可增量的用户短文本聚类方法研究

Research on Scalable Clustering of User-oriented Short Text

下载PDF

导出

摘要随着大数据时代的到来,用户短文本数据呈爆炸性增长,充分利用聚类分析技术获取短文本中的有用信息显得十分重要。聚类分析作为一种重要的知识发现手段,是将对象按其特征的相似程度进行归类的过程。为此,提出了一种可增量面向用户短文本聚类方法。该方法包括离线聚类和在线聚类两大类,前者在短文本预处理的基础上,利用无关语词典对短文本中的无关语进行识别和清理,再利用词类词典对短文本进行语义归一化;同时还提出了基于多特征融合的相似度计算方法,以实现对文本的相关性聚类。后者则以离线聚类结果为特征,对在线文本进行在线聚类操作,将离线聚类结果和在线聚类结果进行合并,以生成最终的聚类结果。为验证该方法的有效性与可行性,与基于特征向量的相似度方法进行了对比实验。实验结果表明,该方法的聚类召回率可达73%,聚类精度达到87.7%,F值为79.6%,均优于基于特征向量的方法。 With the advent of big data time,data of user short text has growing explosively. Acquisition of useful information from short text with clustering analysis technology is becoming most important. Clustering analysis,as a crucial means of knowledge discovery ,is the process of classifying the objects according to their similarity degree of characteristics. Therefore, a scalable clustering method of user-ori- ented short text is proposed, which is composed of two phases, offline clustering and online clustering. The short text is pre-processed by recognizing and removing irrelevant words with irrelevant words dictionary and normalizing semantics with parts of speech dictionary in offline clustering. A similarity calculation method is proposed based on fusion of mufti-features to conduct correlation clustering on text. Then in the online clustering, the online texts are clustered via taken results of offline clustering as features. Results of clustering are pro- duced by integration of the results from offline clustering with those of online clustering. In order to verify its effectiveness and feasibility, the contrast experiments are conducted. Experimental results show that it has achieved recall rate in clustering by 73 %, clustering accuracy by 87.7% and value off -measure by 79.6% ,which is superior to feature vector method.

作者张仪陈国张再跃

机构地区江苏科技大学计算机科学与工程学院

出处《计算机技术与发展》 2017年第11期83-87,共5页 Computer Technology and Development

基金国家自然科学基金资助项目(61371114 61170156) 江苏科技大学海洋装备研究院自培育项目(HZ2016004)

关键词短文本语义归一化离线聚类在线聚类 short text semantic normalization offline clustering online clustering

分类号 TP301 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献4

1宋韶旭..基于语义关联的文本聚类方法[D].清华大学,2006:
2王刚,钟国祥.一种基于本体相似度计算的文本聚类算法研究[J].计算机科学,2010,37(9):222-224. 被引量：10
3李晓光,于戈,王大玲,鲍玉斌.基于信息论的潜在概念获取与文本聚类[J].软件学报,2008,19(9):2276-2284. 被引量：7
4曲维光,陈小荷,吉根林.基于框架的词语搭配自动抽取方法[J].计算机工程,2004,30(23):22-24. 被引量：18

二级参考文献17

1孙爽,章勇.一种基于语义相似度的文本聚类算法[J].南京航空航天大学学报,2006,38(6):712-716. 被引量：18
2Weinstein P,Birmingham W.Comparing concepts in differentiated ontologie[C] ∥Proc.of KAW-99.1999. 被引量：1
3Paolucci M.Semantic Matching of Web Service Capabilities[C] ∥Proceedings of the First International Semantic Web Conference(ISWC).2002. 被引量：1
4Wache H,Vogele T,Visser U,et al.Ontology-Based Integration of Information--A Survey of Existing Approaches[C] ∥Proc.of the IJCAI-01 Workshop:Ontologies and Information Sharing.Seattle,WA,2001:108-117. 被引量：1
5Pandya A,Bhattacharyya P.Text similarity measurement using concept representation of texts[C] ∥Proceedings of First International Conference on Patttern Recognition and Machine Intelligence.Berlin,Germany:Springer,2005. 被引量：1
6Roy R,Mili H,Blettner M.Development and application of ametric on semantic nets[J].IEEE Transaction on System,1989,19(1). 被引量：1
7Song Shaoxu,Li Chunping.TCUA P:a novel app roach of text clustering using asymmetricproximity[C] ∥Proceedings of the 2nd Indian International Conference on Artificial Intelligence.India:IICA I,2005:604-613. 被引量：1
8Smadja F. Retrieving Collocations from Text: Xtract[J]. Computional Linguistics, 1993,19(1): 143-177 被引量：1
9Choueka Y, Klein T, Neuwitz E. Automatic Retrieval of Frequent Idiomatic and Collocational Expressions in a Large Corpus[J]. Journal of the Association for Literary and Linguistic Computing, 1983,4(1):34-38 被引量：1
10Church K, Hanks P. Word Association Norms, Mutual Information,and Lexicography[c]. Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics, 1989:76-83 被引量：1

共引文献32

1曲维光,吉根林,穗志方,周俊生.基于语境信息的组合型分词歧义消解方法[J].计算机工程,2006,32(17):74-76. 被引量：10
2王素格,杨军玲,张武.自动获取汉语词语搭配[J].中文信息学报,2006,20(6):31-37. 被引量：14
3姚双云,沈威.关联词的搭配研究[J].计算机与现代化,2007(4):7-9. 被引量：1
4姚建民,屈蕴茜,朱巧明,张晶.大规模语料库中自动搭配获取的统计方法研究[J].计算机工程与设计,2007,28(9):2154-2155. 被引量：4
5梁文娟,郑逢斌,杜莹.汉字语法语义智能输入法搭配库的设计与实现[J].计算机工程与设计,2009,30(21):5003-5006. 被引量：1
6徐超,周一民,沈磊.一种面向隐含主题的上下文树核[J].电子与信息学报,2010,32(11):2695-2700.
7翁彧,胡长军,席强,张学春.一种面向e-Science环境的多领域Web文本特征抽取模型[J].小型微型计算机系统,2011,32(1):17-23.
8谷俊,朱紫阳.基于聚类算法的本体层次关系获取研究[J].现代图书情报技术,2011(12):46-51. 被引量：6
9王璐,张仰森.基于典型句型的词语搭配定量分析及提取算法[J].计算机科学,2012,39(B06):232-234. 被引量：6
10陈叶旺,王华珍,李海波,钟必能,陈锻生.基于百度百科与文本分类的网络文本语义主题抽取方法[J].小型微型计算机系统,2012,33(12):2605-2610. 被引量：9

1钟珞,田蓝,张开松,李琳.面向流数据的演化聚类算法[J].武汉大学学报（理学版）,2017,63(5):459-465.
2China＇s outbound tourism could maintain ＂explosive growth＂ in next decade[J].中国-东盟博览,2017(10):10-10.
3薛彬,陶海军,王加强.针对民生热线文本的热点挖掘系统设计[J].中国计量大学学报,2017,28(3):371-379. 被引量：1
4胡亮,夏磊,李伟.基于改进TF-IDF算法的关键词抽取系统[J].厦门理工学院学报,2017,25(5):67-72. 被引量：2
5田侦,郭茂祖.一种改进的基因功能相似度计算方法[J].智能计算机与应用,2017,7(5):123-126.
6王昌,毛鹏乐.面向用户资源建设的高校图书馆微信公众号服务研究[J].数字图书馆论坛,2017(10):61-67. 被引量：7
7孙静,蔡希彪,孙福明.基于特征融合的多约束非负矩阵分解算法[J].计算机应用,2017,37(10):2834-2840. 被引量：2
8李梅莲,郭超峰.基于闻香识源的改进人工蜂群聚类算法[J].河南大学学报（自然科学版）,2017,47(5):552-559.
9汉语成为美国中小学第四大外语课程[J].海外华文教育动态,2017(7):140-140.
10邓巍,程成,郭雅男,王倩.云计算关键技术及发展浅析[J].中国新通信,2017,19(19):48-48.

计算机技术与发展

2017年第11期

浏览历史

内容加载中请稍等...

可增量的用户短文本聚类方法研究

参考文献4

二级参考文献17

共引文献32

相关作者

相关机构

相关主题

浏览历史