摘要
针对弱监督文本分类过于依赖专家生成种子词的局限,提出一种基于类名引导生成种子词的弱监督文本分类方法。使用Skip-Gram模型学习单词的向量表示,借助vMF(von Mises Fisher)分布对用户提供的类名与语料库之间的关系进行建模,综合考虑语义相关性和语义特异性,由此生成一组高质量的种子词,无需依赖专家经验;迭代使用种子词生成伪标签和文档分类器;扩展种子词,进一步提升模型性能。在NYT和20 Newsgroups两个公开数据集上的实验结果(F1-score)表明了所提弱监督文本分类方法的有效性。
Aiming at the limitation that weakly supervised text classification relies too much on experts to generate seed words,a weakly supervised text classification method based on class name guidance to generate seed words was proposed.The vector representation of words was learned using Skip-Gram model,and with the help of vMF distribution,the relationship between class names provided by users and corpus was modeled.Considering semantic relevance and semantic specificity comprehensively,a group of high-quality seed words was generated without relying on expert experience.Seed words were used iteratively to gene-rate pseudo tags and document classifiers.The seed words were extended to further improve the performance of the model.The results of experiments(F1-score)on two public data sets of NYT and 20 Newsgroups show the effectiveness of the proposed method.
作者
周悦尧
奚雪峰
崔志明
盛胜利
仇亚进
ZHOU Yue-yao;XI Xue-feng;CUI Zhi-ming;SHENG Sheng-li;QIU Ya-jin(School of Electronic and Information Engineering,Suzhou University of Science and Technology,Suzhou 215000,China;Suzhou Key Laboratory of Virtual Reality Intelligent Interaction and Application Technology,Suzhou Science and Technology Bureau,Suzhou 215000,China;Suzhou Smart City Research Institute,Suzhou University of Science and Technology,Suzhou 215000,China;School of Computer Science,Texas Institute of Technology,Lubbock 79401,USA)
出处
《计算机工程与设计》
北大核心
2023年第8期2329-2336,共8页
Computer Engineering and Design
基金
国家自然科学基金项目(61876217、62176175)
江苏省“六大人才高峰”高层次人才基金项目(XYDXX-086)
苏州市科技计划基金项目(SGC2021078)。
关键词
弱监督
文本分类
词向量
冯米塞尔分布
语义相关性
语义特异性
深度学习
weakly supervision
text classification
word embedding
vMF distribution
semantic relevance
semantic specificity
deep learning