摘要
在大数据时代,网络上的信息量获得了爆炸性增长,准确的网页分类技术有助于用户从海量网页中迅速定位到自己感兴趣的信息。网页分类技术在诸多应用中发挥着至关重要的作用,其大体可以分为基于网页内容分析和基于URL分析的网页分类。针对基于内容分析的网页分类技术在部分场景下的不足,提出仅根据网页URL信息进行网页分类。借鉴n-gram模型的思想,并将字符作为基本单位,进行URL特征的提取。考虑到URL各字段对于网页分类的区分能力不同,在剔除部分字段的同时,也为重要的path字段赋予更高的权重,在此基础上改进了n-gram模型。实验结果表明,将改进后的n-gram模型用于URL分类不仅提高了算法效率,而且网页分类的准确性也有所提升,其中训练时间减少了9.34%,网页分类结果的F1值提高了12.63%。
In the era of big data,the amount of information is increasing explosively. With the help of web page classification technology,users are able to access to the information they are interested in from the massive web pages. Web page classification technology whichcan mainly be divided into content-based and URL-based plays an important role in many applications. Considering content-based webpage classification technology is inapplicable to some occasions,only URLs are used to classify web pages. Taking the character as thebasic unit,the URL feature is extracted by drawing on the idea of n-gram model. As each field of a URL is different in the distinguishability,some fields are not taken into account when classifying web pages. In the meantime,the path field is given a higher weight withthe consideration of its importance. Then an n-gram model is improved based on this and experiment shows that the efficiency and accu-racy of web page classification both get a certain increase. To be specific,training time reduces by 9. 34% while F1 score gets an increaseof 12. 63%.
作者
骆聪
周城
LUO Cong;ZHOU Cheng(Jiangnan Institute of Computing Technology,Wuxi 214083,China)
出处
《计算机技术与发展》
2018年第9期38-41,共4页
Computer Technology and Development
基金
国家"核高基"重大专项项目(2015ZX01040-201)