摘要
针对网页中的维吾尔文不良文档信息的过滤问题,提出一种基于互信息和余弦相似度的不良文档信息过滤方案。首先,对输入文档进行预处理,过滤掉无用单词。然后,利用文档频率(DF)和互信息(MI)相结合,从文档中提取出高区分度的特征向量。最后,利用TF-IDF方法对特征进行加权,并计算加权特征向量与分类模板中的各类加权特征向量之间的余弦相似度,来分类文档并过滤掉不良文档信息。实验结果表明,该方案能够有效过滤不良维吾尔文文档,正确过滤率达到了83.5%。
For the issues that the Uyghur bad text information filtering in the web page, an information filtering scheme based on mutual information and cosine similarity is proposed. First, the input document is preprocessed to filter out useless words. Then, the combination of document frequency (DF) and mutual information (MI) is used to extract the feature vector which with high degree of differentiation. Finally, the feature is weighted by the TF-IDF method, and calculate the cosine similarity between the weighted feature vector and the weighted feature vectors in the classification template, so as to classify the documents and filter out the bad document information. Experimental results show that the proposed scheme can effectively filter the bad Uyghur documents, and the correct filtering rate is 83.5%.
出处
《电子设计工程》
2016年第16期109-112,共4页
Electronic Design Engineering
基金
新疆维吾尔自治区自然科学基金科研项目(2015211A016)