摘要
目前关于数据清洗的研究大多针对英文数据,相关算法较为成熟,而对中文文本数据的清洗研究较少,且中英文差距较大,面向英文的清洗方法不完全适应于中文。基于此提出一种面向中文的相似重复数据清洗方法,充分考虑中文存在的一词多义与多词一义现象,在中文文本向量化过程中引入位置向量,降低文本数据转为数学表达后语义信息的丢失程度。
At present,most of the research on data cleaning focuses on English data,and the relevant algorithms are relatively mature,while the research on Chinese text data cleaning is less,and there is a big gap between Chinese and English,so the English oriented cleaning method is not completely suitable for Chinese.Based on this,a Chinese oriented similar duplicate data cleaning method is proposed,which takes full account of the phenomenon of polysemy and polysemy in Chinese,and introduces position vector in the process of Chinese text vectorization to reduce the loss of semantic information after text data is transformed into mathematical expression.
作者
李碧秋
王佳斌
刘雪丽
LI Biqiu;WANG Jiabin;LIU Xueli(College of Engineering,Huaqiao University,Quanzhou 362000)
出处
《现代计算机》
2021年第19期58-61,共4页
Modern Computer
基金
厦门市科技局产学研创新项目(No.3502Z20173046)。
关键词
中文文本
数据清洗
相似重复数据
文本向量化
聚类
Chinese Text
Data Cleaning
Similar Duplicate Data
Text Vectorization
Clustering