摘要
主成分分析算法是数据分析的重要方法之一,它通过构造原变量的一系列线性组合,使各线性组合在彼此不相关的前提下尽可能多地反映原变量的信息。针对目前垃圾邮件处理中存在的不足,本文使用主成分分析方法对大量的垃圾邮件样本进行分析,统计出在垃圾邮件中普遍存在的词语和它们对垃圾邮件的贡献率,作为判断未知邮件是否为垃圾邮件的过程中的降维依据;以此压缩邮件信息,得到含信息量大而维数低的向量。
Principal components analysis is one of the most important methods in data analysis. It constructs a series of linear compounds of former variables and makes each compound reflect the information of former variables as more as possible on the condition of being independent of each other. Aimming at the lack existing in the disposal of garbage E-mail nowdays, this paper uses principal components analysis to analyze lots of garbage E-mail samples, in order to obtain the common words in garbage E-mails and their contribution rates. These are the gists of condensation during judging whether the unknown E-mail is a garbage E-mail or not. Based on this gist, the method compresses E-mail information and then gets vectors with more information and less dimension.
出处
《计算机与现代化》
2006年第1期13-15,33,共4页
Computer and Modernization
基金
浙江省教育厅资助项目(20030718)
关键词
主成分
贡献率
特征值
协方差矩阵
相关矩阵
principal components
contribution rate
eigenvalue
covariance matrix
correlation matrix