摘要
潜在语义索引(LSI)是一种有效的信息查询方法,同时也被成功地应用到了文本分类中。LSI能解决同义和多义的问题,通过降低原始文档-术语矩阵的噪声来凸现出词条和文档之间的语义关系。为了识别和过滤有害的、不期望的定题的信息或Email,在双语言环境下(包括中文和英文),提出了一个基于改进的LSI方法的定题邮件类信息过滤系统,该系统采用潜在语义模型来表示被过滤的信息类,通过奇异值分解和正例监护学习方法,选择支持向量机(SVM)来识别和分类预定义的定题信息。实验结果表明:基于LSI的特征选择的SVM分类算法是一种更有效的信息识别和文本分类方法,不但具有较好的分类性能,同时也能大大减小计算的复杂性。
Latent Semantic lndexing(LSl) is an effective method for Information Retrieval(IR),and it also has been successfully applied to text classification.LSI can resolve the problems of polysemy and synonymy,and make the semantic relation between document and term turn more obvious through reducing noise in the raw document-term matrix.In this paper,in order to prevent and filter the unsolicited emails and harmful messages,under multi-languages (Chinese and English) circumstance an improving LSI approach was proposed for customized Email filtering system,Latent Semantic Model was applied to represent the predefined and filtered information categories,Support Vector Machine(SVM) algorithm was chosen to recognize and classify predefined and customized unsolicited and harmful information through Singular Value Decomposition (SVD) and positive examples supervised learning.The results of the experiment show that the approach based on LSI and SVM is a more effective approach to information identifying,it not only has a good filtering performance but also can greatly reduce the complexity of computation.
出处
《计算机工程与应用》
CSCD
北大核心
2006年第35期168-171,共4页
Computer Engineering and Applications
基金
湖南省自然科学基金资助项目(06JJ50132)
湖南省杰出青年基金项目(03JJY1012)。
关键词
支持向量机
潜在语义索引
信息查询
监护学习
文本分类
Support Vector Machine (SV M )
Latent Semantic Indexing (LSI)
Information Rctrieval (IR)
supervised learning
text classification