摘要
互联网是人类网络空间行为的体现,其中隐藏了大量人物信息。由于这些信息分散在整个网络空间中,将互联网人物信息提取并进行归类具有重要的研究意义和实用价值。文中提出了一种新的互联网人物信息提取模型,实现了人物信息的自动化提取。详细分析了基于网络爬虫的网页信息采集、基于语义分析的人物特征提取、基于向量空间模型的人物聚类算法和人物信息检索等技术原理和实现方案,能够对互联网人物信息进行分析和提取。
The Internet, as a manifestation of human behavior in cyberspace, contains massive personage information. However this information is scattered throughout the Internet, so extraction and classification of the information from the Internet is of important significance and practical value. This paper proposes a new model for automatically extracting the personage information from the Internet. It discusses technologies of web information collection based on web crawl- er, character extraction based on semantic analysis, figures clustering algorithm based on vector space model and retrieval of personage information. It also analyzes their technical principles and implementations. Thus both the extraction of personage information from the Internet and analysis on the extracted information can be done.
出处
《信息安全与通信保密》
2013年第12期103-108,共6页
Information Security and Communications Privacy
关键词
语义
向量空间模型
聚类算法
semantic
vector space model
clustering algorithm