摘要
中文分词方法都属于串行分词方法,不能处理海量数据。提出一种基于M印Reduce的并行分词方法。Mapreduce程模型默认使用TextI印utFomat文本输入方式,该方式不适合处理大量文本文件。首先基于CombineFilelnputFormat类,自定义文本输入方式MylnputFormat,并在实现createRecordReader方法过程中返回RecordReader对象。其次自定义MyReeordReader类来说明读取文本〈key,value〉键值对的具体逻辑。最后自定义MapReduce函数实现不同类别文本的分词结果。实验证明,基于改进后的MylnputFormat文本输入方式比默认的TextlnputFormat输入方式,更能处理大量文本文件。
Method of word segmentation is a serial process and it fails to deal with big data. We put forward a parallel word seg- mentation based on MapReduce. TextlnputFormat is the default input class when preprocessing in the programming model of Mapreduce, while it fails to process datasets which is made up of many small files. Firstly, we define a new class named Myln- putFormat based on the class of CombineFilelnputFormat,and return an object of RecordReader class. Secondly, we declare My- RecordReader class, by which can we write a new logic method to read and split the original data to 〈key, value〉 pairs when implementing the createRecordReader method. Last, we define our own mapreduce function, by which can we get the final seg- mentation results of different categories. The experimental results indicate that, compared with the default TextlnputFormat, My- InputForrnat saves much time to segment the text.
作者
徐宏博
赵文涛
孟令军
XU Hong-bo, ZHAO Wen-tao, MENG Ling-jun (College of Computer Science and Technology, Henan Polytechnic University, Jiaozuo 454000, China)
出处
《电脑知识与技术》
2016年第8期171-175,共5页
Computer Knowledge and Technology