摘要
为了在个性化搜索过程中能够准确地挖掘到用户的潜在兴趣并进行相应的聚类分析,提出采用潜语义空间的Zipf分布的特性,并结合PLSA(概率潜在语义分析)来获取全文的语义.即先通过Zipf分布原理找到文档的潜在语义空间,在此空间中对用户的兴趣进行聚类,并建立用户兴趣描述文件(user profile),即建立用户兴趣层次树.实验表明,所提出聚类算法的聚类效果明显优于传统的VSM(向量空间模型)的聚类效果,同时,在著名的CTI数据集上的个性化推荐实验结果也充分说明基于潜在语义空间构建的用户兴趣描述与用户真实兴趣相符合.
To mine user's latent interests and make relevantly the clustering analysis during personalized search, it is proposed to combine the characteristics of Zipf distribution in latent semantic space with PLSA (the probability latent semantic analysis ), so as to gain the semantemes of the whole text. Namely, the principle of Zipf distribution is introduced to find out the latent semantic space of files, where the user interest is clustered according to underlying factors and a user interest hierarchy tree is built in user profile. Experimental results show that the clustering result as proposed is clearly superior to that by the conventional VSM (vector space model) algorithm. In addition, the results of the recommended personalized experiment based on well-known CTI data set also indicates fully that the description of user profile on the basis of latent semantic space coincides actually with the user interest.
出处
《东北大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2008年第1期53-56,共4页
Journal of Northeastern University(Natural Science)
基金
国家自然科学基金资助项目(60573090
60673139)
关键词
用户兴趣描述文件
PLSA
潜语义空间
ZIPF分布
用户兴趣层次树
user profile
PLSA(the probability latent semantic analysis)
latent semantic space
Zipf distribution
user interest hierarchy tree