摘要
近年来XML凭借其自身的简单性、半结构化、可扩展性、自描述性等特点,逐渐成为了互联网数据表示和数据交换的标准.XML文档聚类是数据挖掘研究中热点一个,为网络信息资源的搜集、组织及检索利用提供良好的技术支持.本文首先介绍了目前主要的XML文档聚类算法,然后在利用WordNet对XML文档中的标记进行语义消歧的基础上,提出了一种新的基于语义标记树的XML文档相似度计算方法,并通过最近邻算法进行聚类,最后在用于XML检索研究的数据集上进行实验,证实其确实是一种比较有效的XML文档聚类方法.
XML gradually became a standard for data representation and data exchange in Internet due to its advantage of simplicity, semi-structuredness, extensibility and self-description. XML documents clustering is an important topic in the field of data mining, provides support to the collecting, organization and retrieving of web information resource. The authors introduce the popular XML documents clustering algorithm, and makes use of word sense disambiguation which is based on the WordNet to disambiguate the tags in XML documents. Then the authors propose a new XML documents similarity calculating method based on semantic tag tree, and cluster using KNN algorithm. At last, the authors make the experiment of the documents clustering on the data sets of XML, which approves that this method is effective for XML documents clustering.
出处
《情报学报》
CSSCI
北大核心
2012年第5期508-514,共7页
Journal of the China Society for Scientific and Technical Information
基金
本文为国家自然科学基金项目"基于标记树的XML文档自动聚类和分类研究"(70803046)的研究成果.