期刊文献+

可增量的用户短文本聚类方法研究

Research on Scalable Clustering of User-oriented Short Text
下载PDF
导出
摘要 随着大数据时代的到来,用户短文本数据呈爆炸性增长,充分利用聚类分析技术获取短文本中的有用信息显得十分重要。聚类分析作为一种重要的知识发现手段,是将对象按其特征的相似程度进行归类的过程。为此,提出了一种可增量面向用户短文本聚类方法。该方法包括离线聚类和在线聚类两大类,前者在短文本预处理的基础上,利用无关语词典对短文本中的无关语进行识别和清理,再利用词类词典对短文本进行语义归一化;同时还提出了基于多特征融合的相似度计算方法,以实现对文本的相关性聚类。后者则以离线聚类结果为特征,对在线文本进行在线聚类操作,将离线聚类结果和在线聚类结果进行合并,以生成最终的聚类结果。为验证该方法的有效性与可行性,与基于特征向量的相似度方法进行了对比实验。实验结果表明,该方法的聚类召回率可达73%,聚类精度达到87.7%,F值为79.6%,均优于基于特征向量的方法。 With the advent of big data time,data of user short text has growing explosively. Acquisition of useful information from short text with clustering analysis technology is becoming most important. Clustering analysis,as a crucial means of knowledge discovery ,is the process of classifying the objects according to their similarity degree of characteristics. Therefore, a scalable clustering method of user-ori- ented short text is proposed, which is composed of two phases, offline clustering and online clustering. The short text is pre-processed by recognizing and removing irrelevant words with irrelevant words dictionary and normalizing semantics with parts of speech dictionary in offline clustering. A similarity calculation method is proposed based on fusion of mufti-features to conduct correlation clustering on text. Then in the online clustering, the online texts are clustered via taken results of offline clustering as features. Results of clustering are pro- duced by integration of the results from offline clustering with those of online clustering. In order to verify its effectiveness and feasibility, the contrast experiments are conducted. Experimental results show that it has achieved recall rate in clustering by 73 %, clustering accuracy by 87.7% and value off -measure by 79.6% ,which is superior to feature vector method.
出处 《计算机技术与发展》 2017年第11期83-87,共5页 Computer Technology and Development
基金 国家自然科学基金资助项目(61371114 61170156) 江苏科技大学海洋装备研究院自培育项目(HZ2016004)
关键词 短文本 语义归一化 离线聚类 在线聚类 short text semantic normalization offline clustering online clustering
  • 相关文献

参考文献4

二级参考文献17

  • 1孙爽,章勇.一种基于语义相似度的文本聚类算法[J].南京航空航天大学学报,2006,38(6):712-716. 被引量:18
  • 2Weinstein P,Birmingham W.Comparing concepts in differentiated ontologie[C] ∥Proc.of KAW-99.1999. 被引量:1
  • 3Paolucci M.Semantic Matching of Web Service Capabilities[C] ∥Proceedings of the First International Semantic Web Conference(ISWC).2002. 被引量:1
  • 4Wache H,Vogele T,Visser U,et al.Ontology-Based Integration of Information--A Survey of Existing Approaches[C] ∥Proc.of the IJCAI-01 Workshop:Ontologies and Information Sharing.Seattle,WA,2001:108-117. 被引量:1
  • 5Pandya A,Bhattacharyya P.Text similarity measurement using concept representation of texts[C] ∥Proceedings of First International Conference on Patttern Recognition and Machine Intelligence.Berlin,Germany:Springer,2005. 被引量:1
  • 6Roy R,Mili H,Blettner M.Development and application of ametric on semantic nets[J].IEEE Transaction on System,1989,19(1). 被引量:1
  • 7Song Shaoxu,Li Chunping.TCUA P:a novel app roach of text clustering using asymmetricproximity[C] ∥Proceedings of the 2nd Indian International Conference on Artificial Intelligence.India:IICA I,2005:604-613. 被引量:1
  • 8Smadja F. Retrieving Collocations from Text: Xtract[J]. Computional Linguistics, 1993,19(1): 143-177 被引量:1
  • 9Choueka Y, Klein T, Neuwitz E. Automatic Retrieval of Frequent Idiomatic and Collocational Expressions in a Large Corpus[J]. Journal of the Association for Literary and Linguistic Computing, 1983,4(1):34-38 被引量:1
  • 10Church K, Hanks P. Word Association Norms, Mutual Information,and Lexicography[c]. Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics, 1989:76-83 被引量:1

共引文献32

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部