摘要
针对在关联规则分类算法的构造分类器阶段中只考虑特征词是否存在,忽略了文本特征权重的问题,基于关联规则的文本分类方法(ARC-BC)的基础上提出一种可以提高关联文本分类准确率的ISARC(ItemSet Significance-based ARC)算法.该算法利用特征项权重定义了k-项集重要度,通过挖掘重要项集来产生关联规则,并考虑提升度对待分类文本的影响.实验结果表明,挖掘重要项集的ISARC算法可以提高关联文本分类的准确率.
Text classification technology is an important basis of information retrieval and text mining,and its main task is to mark category according to a given category set.Text classification has a wide range of applications in natural language processing and understanding、information organization and management、information filtering and other areas.At present,text classification can be mainly divided into three groups: based on statistical methods、based on connection method and the method based on rules. The basic idea of the traditional association text classification algorithm associative rule-based classifier by category(ARC-BC) is to use the association rule mining algorithm Apriori which generates frequent items that appear frequently feature items or itemsets,and then use these frequent items as rule antecedent and category is used as rule consequent to form the rule set and then make these rules constitute a classifier.During classifying the test samples,if the test sample matches the rule antecedent,put the rule that belongs to the class counterm to the cumulative confidence.If the confidence of the category counter is the maximum,then determine the test sample belongs to that category. However,ARC-BC algorithm has two main drawbacks:(1) During the structure classifier,it only considers the existence of feature words and ignores the weight of text features for mining frequent itemsets and generated association rules may affect the classification results;(2) In the class prediction stage,it gives too much emphasis on rule confidence.In the mining process,there will be ruels that have the same antecedent but different consequent,and if only considering the rules' confidence in predicting the impact of text classification,without considering the correlation between rules antecedent and consequdent,it will also affect the classification accuracy.In order to solve the two problems,in this paper,a new algorithm itemset significance-based a ssociation rule-based categorizer(ISARC) is propo
出处
《南京大学学报(自然科学版)》
CAS
CSCD
北大核心
2011年第5期544-550,共7页
Journal of Nanjing University(Natural Science)
基金
国家自然科学基金(61070062)
关键词
文本分类
基于关联规则的分类算法
权重
重要项集
text classification
association rule-based categorizer by category
weight
itemset significance