摘要
在现有的文本相似度计算方法中,获取关键词权值的TFIDF算法没有完全考虑到关键词在文本中的位置和其在文本库中的离散度对权值的影响,且当处理的文本库中信息量过大时,运行效率较低。针对上述问题,文中提出一种基于语义的信息熵与信息增益的TFIDF算法(TFIDFWGE)。该算法通过对给定的关键词添加位置权重与计算熵值和信息增益,得到关键词的最终权值,并利用Hadoop平台的Map/Reduce框架来实现TFIDFWGE算法和向量空间模型(VSM)的文本相似度计算过程。通过对两组真实的数据集进行的实验结果表明,与现有的TFIDF算法相比,TFIDFWGE算法的查全率和查准率更高,且在Hadoop平台上实现的文本相似度检测系统对信息量大的文本库处理效率更加高效。
In existing method of calculating similarity ,TFIDF which is usually used to obtain weights of key words doesn' t take into con- sideration the influence of key words' position and their dispersion in text library, and moreover runs in low efficiency when dealing with large quantity of data. To tackle the problems above, propose a kind of TFLDF algorithm (TFIDFWGE) based on the semantic informa- tion entropy and information gain by adding position weight to key words and calculating the entropy and information gain to acquire final value. The algorithm adds position weight and calculation entropy and information gain for given keywords to get the final weights of keywords, and use Map/Reduce framework of Hadoop platform to achieve TFIDFWGE algorithms and Vector Space Model (VSM) in the text similarity calculation process. Experimental results on two real datasets show that compared with the existing TFIDF, TFIDF- WGE' s recall and precision is higher,and in the Hadoop platform text similarity detection system is more efficient for information large text database processing.
出处
《计算机技术与发展》
2015年第8期90-93,共4页
Computer Technology and Development
基金
国家自然科学基金资助项目(6100311)
安徽省自然科学研究重点项目(KJ2013Z023
KJ2013A058)