摘要
本文采用一种基于词的归类技术。在类别词专指度的计算中 ,考虑了类别词在语料中的频度、集中度和分布性等因素。根据HTML语言的标记特性 ,应用三维加权分类算法计算类别权值。采用Bayes公式变型 ,计算WWW中文信息文件归类可信度 ,并按可信度最大归类。对 10 8篇试语料进行测试 ,封闭测试的归类正确率为98 1% ,开放测试的正确率为 83 3%。
The word-based categorization is adopted in the paper.It not only uses the frequency,concentrated degree and distribution,but also uses amount of the every corpus to determine the specialty of the category-word.This paper analyses the tag of HTML,discusses the research on the three-dimensional weighted algorithm to calculate the classification weight.The algorithm uses the frequency,location and specialty.The reliability is calculated by Bayes algorithm and the document is categorized to the kind which reliability is maximum.Close testing and open testing are done in the experiment system.The recall ratio of close testing is 98.1%,the accuracy of open testing is 83.3%.
出处
《情报学报》
CSSCI
北大核心
2002年第5期532-536,共5页
Journal of the China Society for Scientific and Technical Information