期刊文献+

中亚语种通用语料库构建研究 被引量:1

Research of Construction of the Central Asian Languages General Corpus
下载PDF
导出
摘要 面向中亚国家“一带一路”网络舆情分析这一应用目标,探讨如何构建中亚国家通用语料库。首先利用爬虫完成新闻语料采集,其次在语料预处理的基础上,对其进行唯一编码并利用关系数据库完成语料的结构化组织和持久存储,并采用人机结合的标注方式对语料内容按照主题进行分类,最后研究语料库的信息服务方式以最大化其价值。目前,语料库词语容量已达到1.5亿,且还在持续更新,但仍然属于生语料,后续需根据具体应用领域完成相应标注工作。以文章构建的语料库为基础,不仅为分析中亚国家“一带一路”网络舆情提供可依托的语料库,还可用于中亚国家语言的研究学习和教学科研等相关场景。 For the application goal of‘Belt and Road’network public opinion analysis in central Asian countries,the present study discusses how to build a general corpus for Central Asian countries.Firstly,the source media are sorted out and the crawler is used to complete the news corpus collection.Secondly,based on the corpus preprocessing,the corpus is uniquely coded and its structured organization and permanent storage is completed by utilizing relational databases.Then,by means of human-computer integration,the corpus is classified according to the theme.Finally,the mode of corpus information service is studied so as to maximize the value of this corpus.At present,the word capacity of corpus has reached 150 million,which has been updated,but it still belongs to the raw corpus,which shall be marked according to the specific application field.Based on the corpus constructed in this paper,it can not only improve a reliable corpus for the analysis of‘Belt and Road’network public opinion in Central Asian countries,but also be used for the research and learning of languages in Central Asian countries,teaching and scientific research and other relevant scenes.
作者 席耀一 王小明 云建飞 高鑫 XI Yaoyi;WANG Xiaoming;YUN Jianfei;GAO Xin(Information Engineering University, Zhengzhou 450001, China)
机构地区 信息工程大学
出处 《信息工程大学学报》 2020年第6期741-746,751,共7页 Journal of Information Engineering University
基金 国家社会科学基金青年项目(19CXW027)。
关键词 语料库 中亚国家 一带一路 corpus central asian countries the belt and road initiative
  • 相关文献

参考文献7

二级参考文献47

共引文献173

同被引文献8

引证文献1

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部