摘要
针对经典的PageRank算法存在的偏重历史网页、主题漂移、平分网页链接权重等缺陷,引入了向量空间模型和信息论中的信息熵,提出一种改进的PRKE算法。该算法用表征网页特征的关键词构成的向量来表示网页,用关键词在网页中所占的权重作为向量中各个分量的权值;对已存在的网页采用K-means聚类算法进行聚类,以信息熵的形式表征各个簇的权值,完成对网页的宏观排序;融入了时间因子和主题相关度等参数,完成对网页的微观排序。实验结果表明,改进的PRKE算法相对于经典的PageRank算法在首页命中率、检索准确性等方面获得了较大的提高。
Through the analysis of the classic PageRank algorithm, it has several defects on biasing history web pages, theme drifting, dividing the weight of each hyperlink as equal and so on, some theories on vector space model and information entropy defined in information theory are brought in. As a result, an improved algorithm called PRKE algorithm is proposed. Web is represented by using, a vector which constitutes of several key words which could represent the character of the current web, keeping the weight of a key word in a web as the weight of each component in the vector. Then using the K-means clustering algo-rithm to cluster for those existed web and use the information entropy to represent the weight of each cluster after the clustering to realize the macroscopical sort. At last, by adding some parameters about time and theme similarity into the classic PageRank algorithm to realize the microcosmic sort. The experimental result shows that the improved algorithm acquires higher improvement in both retrieve accuracy and page hit rate than the classic PageRank algorithm.
出处
《计算机工程与设计》
CSCD
北大核心
2013年第5期1695-1699,共5页
Computer Engineering and Design
基金
重庆市教委科技计划基金项目(KJ100821)