摘要
针对已有方法识别出的网络中文新词生命周期短且很快不再为人们所用的问题,提出了一种基于信息传播特性的中文新词发现方法。该方法结合"新词传播范围广、持续时间长"的特点,从用户覆盖率、话题覆盖率和新词生命周期3个方面设计统计量;采用N-gram算法得到候选词串列表;用基于词频和词语灵活度的方法过滤垃圾词串。实验中以微博文本作为语料来源,与已有方法相比,用户特性使得新词识别的准确率提高了11%,话题特性使准确率提高了10%,时间特性使准确率提高了13%,综合用户、话题和时间的方法使准确率提高了16%。实验结果表明:该方法中的每个特性都提高了中文网络新词识别的准确率,而且同时考虑3种特性的准确率比只考虑单一特性的高。
A method of discovering new Chinese words from Internet based on information propagation is proposed to solve the problems that the recognizing results of existing methods always have short life cycles and will not be used again in soon.The method combines the characteristics of new words such as widely spreading and long lasting,and three statistics,i.e.coverage rate of users,coverage rate of topics and life cycle of a new word,are defined.The Ngram algorithm is applied to generate candidates of new words,then the word candidates are filtered bade on word frequency and word flexibility.Experiments with the text of microblogs as corpus and comparisons with the existing methods show that the user statistic enhances the accuracy rate of recognizing new words by 11%,the topic statistic enhances the accuracy rate by10%,and the time statistic enhances the accuracy rate by 13%.When the three statistics are combined,the accuracy rate is raised by 16%.It can be concluded that each single statistic considered by the proposed method can enhance the accuracy rate,and more accurate rate can be obtained by considering the combination of the three statistics rather than just considering one statistic.
出处
《西安交通大学学报》
EI
CAS
CSCD
北大核心
2015年第12期59-64,共6页
Journal of Xi'an Jiaotong University
基金
国家自然科学基金资助项目(61221063
61572397
61502383)
陕西省自然科学基础研究计划资助项目(2015JM6298)
关键词
新词发现
信息传播
用户行为
时间特性
new word discovery
information propagation
user behavior
temporal characteristics