摘要
面向中亚国家“一带一路”网络舆情分析这一应用目标,探讨如何构建中亚国家通用语料库。首先利用爬虫完成新闻语料采集,其次在语料预处理的基础上,对其进行唯一编码并利用关系数据库完成语料的结构化组织和持久存储,并采用人机结合的标注方式对语料内容按照主题进行分类,最后研究语料库的信息服务方式以最大化其价值。目前,语料库词语容量已达到1.5亿,且还在持续更新,但仍然属于生语料,后续需根据具体应用领域完成相应标注工作。以文章构建的语料库为基础,不仅为分析中亚国家“一带一路”网络舆情提供可依托的语料库,还可用于中亚国家语言的研究学习和教学科研等相关场景。
For the application goal of‘Belt and Road’network public opinion analysis in central Asian countries,the present study discusses how to build a general corpus for Central Asian countries.Firstly,the source media are sorted out and the crawler is used to complete the news corpus collection.Secondly,based on the corpus preprocessing,the corpus is uniquely coded and its structured organization and permanent storage is completed by utilizing relational databases.Then,by means of human-computer integration,the corpus is classified according to the theme.Finally,the mode of corpus information service is studied so as to maximize the value of this corpus.At present,the word capacity of corpus has reached 150 million,which has been updated,but it still belongs to the raw corpus,which shall be marked according to the specific application field.Based on the corpus constructed in this paper,it can not only improve a reliable corpus for the analysis of‘Belt and Road’network public opinion in Central Asian countries,but also be used for the research and learning of languages in Central Asian countries,teaching and scientific research and other relevant scenes.
作者
席耀一
王小明
云建飞
高鑫
XI Yaoyi;WANG Xiaoming;YUN Jianfei;GAO Xin(Information Engineering University, Zhengzhou 450001, China)
出处
《信息工程大学学报》
2020年第6期741-746,751,共7页
Journal of Information Engineering University
基金
国家社会科学基金青年项目(19CXW027)。
关键词
语料库
中亚国家
一带一路
corpus
central asian countries
the belt and road initiative