期刊文献+

利用信息传播特性的中文网络新词发现方法 被引量:5

A Method of Discovering New Chinese Words from Internet Based on Information Propagation
下载PDF
导出
摘要 针对已有方法识别出的网络中文新词生命周期短且很快不再为人们所用的问题,提出了一种基于信息传播特性的中文新词发现方法。该方法结合"新词传播范围广、持续时间长"的特点,从用户覆盖率、话题覆盖率和新词生命周期3个方面设计统计量;采用N-gram算法得到候选词串列表;用基于词频和词语灵活度的方法过滤垃圾词串。实验中以微博文本作为语料来源,与已有方法相比,用户特性使得新词识别的准确率提高了11%,话题特性使准确率提高了10%,时间特性使准确率提高了13%,综合用户、话题和时间的方法使准确率提高了16%。实验结果表明:该方法中的每个特性都提高了中文网络新词识别的准确率,而且同时考虑3种特性的准确率比只考虑单一特性的高。 A method of discovering new Chinese words from Internet based on information propagation is proposed to solve the problems that the recognizing results of existing methods always have short life cycles and will not be used again in soon.The method combines the characteristics of new words such as widely spreading and long lasting,and three statistics,i.e.coverage rate of users,coverage rate of topics and life cycle of a new word,are defined.The Ngram algorithm is applied to generate candidates of new words,then the word candidates are filtered bade on word frequency and word flexibility.Experiments with the text of microblogs as corpus and comparisons with the existing methods show that the user statistic enhances the accuracy rate of recognizing new words by 11%,the topic statistic enhances the accuracy rate by10%,and the time statistic enhances the accuracy rate by 13%.When the three statistics are combined,the accuracy rate is raised by 16%.It can be concluded that each single statistic considered by the proposed method can enhance the accuracy rate,and more accurate rate can be obtained by considering the combination of the three statistics rather than just considering one statistic.
出处 《西安交通大学学报》 EI CAS CSCD 北大核心 2015年第12期59-64,共6页 Journal of Xi'an Jiaotong University
基金 国家自然科学基金资助项目(61221063 61572397 61502383) 陕西省自然科学基础研究计划资助项目(2015JM6298)
关键词 新词发现 信息传播 用户行为 时间特性 new word discovery information propagation user behavior temporal characteristics
  • 相关文献

参考文献13

  • 1张海军,史树敏,朱朝勇,黄河燕.中文新词识别技术综述[J].计算机科学,2010,37(3):6-10. 被引量:39
  • 2霍帅,张敏,刘奕群,马少平.基于微博内容的新词发现方法[J].模式识别与人工智能,2014,27(2):141-145. 被引量:25
  • 3苏其龙..微博新词发现研究[D].哈尔滨工业大学,2013:
  • 4杨辉..汉语新词语发现及其词性标注方法研究[D].复旦大学,2008:
  • 5邹纲,刘洋,刘群,孟遥,于浩,西野文人,亢世勇.面向Internet的中文新词语检测[J].中文信息学报,2004,18(6):1-9. 被引量:59
  • 6SUI Zhifang,CHEN Yirong.The research on the automatic term extraction in the domain of information science and technology[C]∥Proceedings of the 5th East Asia Forum of Terminology.Beijing,China:China National Institute of Standardization,2002:17-21. 被引量:1
  • 7HIDEKI I.Japanese named entity recognition based on a simple rule generator and decision tree learning[C]∥Proceedings of the 39th Annual Meeting on Association for Computational Linguistics.Stroudsburg,PA,USA:Association for Computational Linguistics,2001:314-321. 被引量:1
  • 8罗盛芬,孙茂松.基于字串内部结合紧密度的汉语自动抽词实验研究[J].中文信息学报,2003,17(3):9-14. 被引量:32
  • 9YE Yunming,WU Qingyao,LI Yan,et al.Unknown Chinese word extraction based on variety of overlapping strings[J].Information Processing and Management,2013,49(2):497-512. 被引量:1
  • 10HUANG J H,POWERS D.Chinese word segmentation based on contextual entropy[C]∥Proceedings of the 17th Asian Pacific Conference on Language,Information and Computation.Piscataway,NJ,USA:IEEE,2003:152-158. 被引量:1

二级参考文献71

共引文献149

同被引文献44

引证文献5

二级引证文献19

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部