摘要
在语料处理上,汉语以字为单位,缺少形态特征,词性不确定等问题,给语料的对齐、分词和检索带来困难,需要找到有效的解决方案。本文讨论历时语料的标注、检索及类比研究中如何应对上述问题。作者认为,历时语料检索平台能够实现按年度、年代检索语言数据;正则表达式检索可以弥补分词和标注精确度上的不足,提高检索质量。此外,篇头元数据的有效应用可以从技术上保证对汉语翻译语料和汉语原创语料进行多层面的类比分析,为汉语历时变化研究提供有力的支持。
The paper reports the work done towards compiling Diachronic English-Chinese Parallel Corpora and a Chinese Diachronic Reference Corpus,and looks into the design,sampling,data retrieval and practical use of the historical corpora.It argues that special attention should be paid to EnglishChinese alignment,segmentation and query,because written Chinese,being character-based,lacks morphological change and abounds with POS conversion.The paper also finds that the search platform designed for the corpora supports data retrieval by year or period,that regex search could make up for imprecision in segmentation and tagging,and that metadata,when effectively used in query,contributes much to the comparison of translated and original Chinese texts,and provides strong support for diachronic study of Mandarin Chinese.
出处
《外语教学与研究》
CSSCI
北大核心
2012年第6期822-834,959,共13页
Foreign Language Teaching and Research
基金
国家社科基金重大招标项目"大规模英汉平行语料库的建立与加工"(10&ZD127)
秦洪武主持的国家社科基金一般项目"历时语料类比中的翻译与现代汉语互动研究"(10BYY008)的阶段性研究成果