摘要
作为文本挖掘的热门技术,主题模型在专利分析上的应用日益增多,但由于常用作语料的专利摘要中存在科技术语繁多、同义词大量存在和文本长度较短等特点,导致使用传统主题模型如LDA所抽取主题晦涩难懂,技术指代不明,限制其进一步深入应用。对此,本文提出一种新的主题模型Patent Classification LDA,该模型结合专利分类体系以及专利所属分类号信息来协助主题抽取,以提高所抽取主题的可读性,进而推算出专利在专利分类体系上的概率分布。之后,本文给出一种估计该主题模型参数的吉布斯采样方法。最后,以硬盘磁头领域专利作为实验数据,验证了Patent Classification LDA的可行性和有效性。
As hotspot of text mining techniques,topic model has been used increasingly in patent analysis. However,due to some characteristics of patent abstracts,such as short text,various terminologies consists of multiple words and numorous synonyms,the topics extracted by tranditional topic models like LDA are always hard to explain. In this paper we propose a new topic model-Patent Classification LDA, which takes advantage of patent classification taxonomy and class codes of patents to benefit topic's interpretability. Then Gibbs sampling method is utilized to estimate corresponding parameters. Finally,experiments were conducted on the patents of hard disk drive head to demonstrate Patent Classification LDA's feasibility and effectiveness.
出处
《情报学报》
CSSCI
北大核心
2016年第8期864-874,共11页
Journal of the China Society for Scientific and Technical Information