摘要
机械专利文献蕴含着海量以组件名称为信息单元的领域知识信息,组件名称用词灵活多变,具有独特、复杂和生僻等特点,难以被计算机准确识别,成为专利知识挖掘的一大阻碍。为了提出组件名称的高效识别方法,剖析并提炼专利文本语句中的组件名称构词特征;从组件名称相关的外部用词入手,通过标识附图标记,识别其左侧的名称字符,自动从文本中检索候选名称,并构建组件候选名称集合;提出了字频差算法,过滤候选名称集合的冗余字符;提出了动态构建左切分词库算法,进一步剔除未能被过滤的冗余字符;通过交叉实验测试和分析识别过程中字频差先验阈值、词频阈值和字频差阈值的选取对识别效果的影响,形成一种面向机械领域中文专利的组件名称识别三段式综合方法。最后通过对实验结果的对比分析,验证了该方法的有效性与高效性。
Mechanical patent literature contains a large amount of domain knowledge where component names exist as information units.Being flexible and changeable,the word formatting of component name represents the characteristics of uniqueness,complexity and lesser-known expressions.The challenge of accurate recognition of component names by computers becomes an obstacle to patent knowledge mining.In order to propose an efficient method to recognize component names,the features of word formation in patent text statements are analyzed and extracted.Starting with external words related to component names,characters on the left side of the appended drawing reference signs(ADRS)are identified.Accordingly,candidate names are automatically retrieved from texts,and the set of candidate names are constructed.An algorithm of word frequency difference is proposed to filter redundant characters in the set of candidate names.By building left-segmentation library(LSL)dynamically,redundant characters which are not filtered are further eliminated.Based on cross-over experiment,the influence of character frequency difference prior threshold(CFDV-Ⅰ),word frequency threshold(LSWF)and character frequency difference threshold(CFDV-Ⅱ)on recognition result is tested and analyzed.Furthermore,a three-stage comprehensive method for recognizing component names from patent documents in mechanical field is proposed.Finally,the method has been proved to be effective and efficient by comparing the results of experiments.
作者
孔嘉斌
吕剑文
刘江南
杜文轩
KONG Jiabin;LYU Jianwen;LIU Jiangnan;DU Wenxuan(State Key Laboratory of Advanced Design and Manufacturing for Vehicle Body,Changsha 410082,China)
出处
《计算机科学》
CSCD
北大核心
2023年第7期229-236,共8页
Computer Science
基金
国家科技部创新方法专项资助项目(2019IM050100)
湖南省自然科学基金(2018JJ2039)。
关键词
专利文本
冗余字符
附图标记
字频差
左切分词
Patent text
Redundant characters
Appended drawing reference signs
Word frequency difference
Left-segmentation words