帮助 关于我们

返回检索结果

一种基于LDA模型的关键词抽取方法
A LDA-based approach to keyphrase extraction

查看参考文献17篇

朱泽德 1   李淼 2 *   张健 2   曾伟辉 2   曾新华 2  
文摘 为解决现有方法未能综合考察文档主题的全面性、关键词的可读性以及差异性,提出一种基于文档隐含主题的关键词抽取新算法TFITF。算法根据大规模语料产生隐含主题模型计算词汇对主题的TFITF权重并进一步产生词汇对文档的权重,利用共现信息排序和选择相邻词汇形成候选关键短语,再使用相似性排除隐含主题一致的冗余短语。此外,从文档统计信息、词汇链和主题分析3方面来进行关键词抽取的对比测试,实验在1 040篇中文摘要及5 408个关键词构成的测试集上展开。结果表明,算法有效地提高文档关键词抽取的准确率与召回率。
其他语种文摘 Due to the shortage of the comprehensive analysis of the coverage of document topics, the readability and difference of keyphrases, a new algorithm of keyphrase extraction TFITF based on the implicit topic model was put forward. The algorithm adopted the large-scale corpus and producted latent topic model to calculate the TFITF weight of vocabulary on the topic and further generate the weight of vocabulary on the document. And adjacent lexical was ranked and picked out as candidate keyphrases based on co-occurrence information. Then according to the similarity of vocabulary topics, redundant phrases were eliminated. In addition, the comparative experiments of candidate keyphrases were executed by document statistical information, vocabulary chain and topic information. The experimental results, which were carried out on an evaluation dataset including 1 040 Chinese documents and 5 408 standard keyphrases, demonstrate that the method can effectively improve the precision and recall of keyphrase extraction.
来源 中南大学学报. 自然科学版 ,2015,46(6):2142-2148 【核心库】
DOI 10.11817/j.issn.1672-7207.2015.06.023,10.11817/j.issn.1672-7207.2015.06
关键词 信息抽取 ; 关键词抽取 ; LDA模型 ; 主题相似性
地址

1. 中国科学技术大学自动化系, 安徽, 合肥, 230026  

2. 中国科学院合肥智能机械研究所, 安徽, 合肥, 230031

语种 中文
文献类型 研究性论文
ISSN 1672-7207
学科 自动化技术、计算机技术
基金 中国科学院自动化研究所模式识别国家重点实验室开放基金 ;  中国科学院信息化专项 ;  国家自然科学基金资助项目
文献收藏号 CSCD:5501277

参考文献 共 17 共1页

1.  Chen Jilin. Diverse topic phrase extraction through latent semantic analysis. Proceedings of the Sixth International Conference on Data Mining,2006:834-838 被引 1    
2.  Liu Zhiyuan. Domain-specific term rankings using topic models,2010:454-465 被引 1    
3.  李素建. 关键词自动标引的最大熵模型应用研究. 计算机学报,2004,27(9):92-97 被引 1    
4.  Nguyen T D. Keyphrase extraction in scientific publications,2007:317-326 被引 1    
5.  Treeratpituk P. Seerlab: A system for extracting key phrases from scholarly documents. Proceedings of the 5th International Workshop on Semantic Evaluation,2010:182-185 被引 2    
6.  Jiang X. A ranking approach to keyphrase extraction. Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval,2009:756-757 被引 5    
7.  Liu Zhiyuan. Automatic keyphrase extraction by bridging vocabulary gap. Proceedings of the Fifteenth Conference on Computational Natural Language Learning,2011:135-144 被引 1    
8.  Mihalcea R. TextRank: Bringing order into texts. Proceedings of the Conference on Empirical Methods in Natural Language Processing,2004 被引 3    
9.  Wan Xiaojun. Single document keyphrase extraction using neighborhood knowledge. Proceedings of the 23rd National Conference on Artificial Intelligence,2008:855-860 被引 2    
10.  Litvak M. Graph-based keyword extraction for single-document summarization. Proceedings of the Workshop on Multi-source Multilingual Information Extraction and Summarization,2008:17-24 被引 4    
11.  李鹏. Tag-TextRank:一种基于Tag的网页关键词抽取方法. 计算机研究与发展,2012,49(11):2344-2351 被引 21    
12.  Bougouin A. TopicRank: Graph-Based Topic Ranking for Keyphrase Extraction. Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP),2013:543-551 被引 1    
13.  胡学钢. 基于词汇链的中文新闻网页关键词抽取方法. 模式识别与人工智能,2010(1):45-51 被引 5    
14.  石晶. 基于LDA模型的文本分割. 计算机学报,2008,31(10):1865-1873 被引 26    
15.  Hoffman M. Online learning for latent dirichlet allocation. Advances in Neural Information Processing Systems,2010:856-864 被引 10    
16.  Blei D M. Latent dirichlet allocation. Journal of Machine Learning Research,2003,3:993-1022 被引 1310    
17.  Griffiths T L. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America. 101(Suppl 1),2004:5228-5235 被引 223    
引证文献 4

1 孙明珠 基于文档主题结构和词图迭代的关键词抽取方法研究 数据分析与知识发现,2019,3(8):68-76
被引 4

2 魏元潇 基于社会化聆听的服装品牌资产度量方法 丝绸,2022,59(9):62-70
被引 0 次

显示所有4篇文献

论文科学数据集
PlumX Metrics
相关文献

 作者相关
 关键词相关
 参考文献相关

版权所有 ©2008 中国科学院文献情报中心 制作维护:中国科学院文献情报中心
地址:北京中关村北四环西路33号 邮政编码:100190 联系电话:(010)82627496 E-mail:cscd@mail.las.ac.cn 京ICP备05002861号-4 | 京公网安备11010802043238号