摘要
为了提高航行通告文件下载效率、节省人工资源,文章通过Scrapy爬虫框架,结合自然语言处理中的信息,分类爬取各民航局发布的航行通告文本。首先基于网页数据交互模式将网站分类,结合Selenium自动化测试工具进行网页下载。然后使用朴素贝叶斯算法将网站所有链接进行分类,区分为目标链接以及非目标链接,从而实现提取航行通告文本链接,此分类模型在领域类网站准确率为95.97%。
In order to improve the efficiency of downloading the notice to navigation document and save human resources,the article uses the Scrapy crawler framework combined with the information in natural language processing to classify and crawl the text of navigation notices issued by civil aviation administration.Firstly,web sites are classified based on web data interaction mode,and then web pages are downloaded with Selenium automated testing tool.Then the Naive Bayes algorithm is used to classify all the links of the website into target link and non-target link,so as to extract the text link of the notice of navigation.The accuracy rate of this classification model in the domain website is 95.97%.
作者
邹维
李廷元
ZOU Wei;LI Tingyuan(School of Computer Science,Civil Aviation Flight University of China,Guanghan 618307,China)
出处
《现代信息科技》
2020年第21期6-9,共4页
Modern Information Technology