帮助 关于我们

返回检索结果

基于多重文本关系图中clique子团聚类的主题识别方法研究
Study on Textual Topic Identification by Clustering Clique Structure in Multi-Relationship Text Graph

查看参考文献16篇

文摘 在网络成为最主要科学交流和信息传播渠道的今天,越来越多的机构将其研究成果以电子化形式呈现,这些电子化的文本资源中蕴涵着丰富的语义信息。面对这些海量的资源,科研人员很难在短时间内快速捕获文本中的主旨内容。如何高效准确地呈现文本资源中的核心主题,辅助科研人员对文本集中的重要关联信息进行聚焦,提高科研效率,一直是文本挖掘研究中的一个重要问题。在对现有有益研究成果借鉴的基础上,结合文本中术语和术语关系的特点,论文提出将文本中的术语和术语间的共现、句法和语义关系利用图结构进行表示,识别文本关系图中的紧密关联子团,基于所得到的紧密关联子团聚类来揭示文本子主题的整体研究思路。开展了两个方面的研究:①将文本集中的术语和术语间各种关系属性进行叠加归并,构建多重文本关系叠加模型;②基于clique子团间相似性距离和语义标识,进行聚类识别文本集中所包含的重要子主题。论文采用"migraine disorders"主题中近五年的文献构建文本集,对提出的方法开展了2个有效性实验。实验1与文本中领域专家所给出的标引词按语义类型分组结果对比,结果表明论文提出的方法与领域专家给出的标引词语义类型分组结果具有一致性;实验2与目前广泛使用的LDA方法结果进行对比,在准确率和召回率上都较LDA方法有所提高。2个实验均证明了文中方法的有效性。
其他语种文摘 The Internet has become the most important channel for scientific communication and information dissemination. An increasing number of institutes present their research findings in electronic form, and these electronic texts contain rich semantic information. However, it is difficult for researchers to capture core content on short notice when presented with various electronic texts. Assisting researchers in obtaining the core topics and important associated information in these texts, quickly and accurately, is an urgent issue in text mining. Based on reference to state-of-art technologies, algorithms, and the characteristics of the terms and their relations, we propose a new method for topic identification, based on k-clique clustering, to identify text sub-themes. First, we merge the attributions of terms and their relationships based on rules to construct a multi-relationship overlay model. Second, we cluster semantic k-cliques based on similarity distance and semantic content of each k-clique to identify the text sub-theme. With the above efforts, we used the migraine disorders topic dataset over nearly five years to determine the effectiveness of the proposed method. By comparing the proposed method with the Latent Dirichlet Allocation (LDA) method and using a grouping result based on semantic word types given by a professional in the Medline database, we found that the proposed method was closer to grouping results based on word semantic types, and had better precision and recall values than LDA.
来源 情报学报 ,2017,36(5):433-442 【扩展库】
关键词 clique子团 ; 多重文本关系 ; 文本主题识别
地址

中国科学院文献情报中心, 北京, 100190

语种 中文
文献类型 研究性论文
ISSN 1000-0135
学科 社会科学总论
基金 中国科学院文献情报中心青年人才领域前沿项目
文献收藏号 CSCD:6009617

参考文献 共 16 共1页

1.  Aggarwal C C. Towards graphical models for text processing. Knowledge and Information Systems,2013,36(1):1-21 被引 4    
2.  Hugo Z. Clustering based on random graph model embedding vertex features. Pattern Recognition Letters,2010,31(9):830-836 被引 3    
3.  Cheng H. Clustering large attributed graphs: A balance between structural and attribute similarities. ACM Transactions on Knowledge Discovery from Data (TKDD),2011,5(2):Article No. 12 被引 2    
4.  Silva A. Mining attribute-structure correlated patterns in large attributed graphs. Proceedings of the VLDB Endowment,2012,5(5):466-477 被引 4    
5.  Zhang H. Clustering cliques for graph-based summarization of the biomedical research literature. BMC Bioinformatics,2013,14:182 被引 3    
6.  Mougel P N. Finding maximal homogeneous clique sets. Knowledge and Information Systems,2014,39(3):579-608 被引 1    
7.  Bogdanov P. As strong as the Weakest Link: Mining diverse cliques in weighted graphs. Machine Learning and Knowledge Discovery in Databases,2013:525-540 被引 1    
8.  吴思竹. 基于语言网络的文本主题中心度计算方法研究,2011 被引 2    
9.  郭红梅. 基于clique子团聚类的文本主题识别方法研究,2015 被引 1    
10.  . http://metamap.nlm.nih.gov/,2016 被引 1    
11.  . http://semrep.nlm.nih.gov/,2016 被引 1    
12.  McCray A T. Aggregating UMLS semantic types for reducing conceptual complexity. Studies in Health Technology and Informatics,2001,84:216-220 被引 3    
13.  Huang X H. A topic detection approach through hierarchical clustering on concept graph. Applied Mathematics & Information Sciences,2013,7(6):2285-2295 被引 2    
14.  Vinh N X. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. The Journal of Machine Learning Research,2010,11:2837-2854 被引 40    
15.  Blei D M. Latent dirichlet allocation. The Journal of Machine Learning Research,2003,3:993-1022 被引 1298    
16.  Achtert E. Evaluation of clusterings-metrics and visual support. Proceedings of the 2012 IEEE 28th International Conference on Data Engineering,2012:1285-1288 被引 1    
引证文献 3

1 许海云 科学计量中多源数据融合方法研究述评 情报学报,2018,37(3):318-328
被引 6

2 余丽 基于深度学习的文本中细粒度知识元抽取方法研究 数据分析与知识发现,2019,3(1):38-45
被引 6

显示所有3篇文献

论文科学数据集
PlumX Metrics
相关文献

 作者相关
 关键词相关
 参考文献相关

版权所有 ©2008 中国科学院文献情报中心 制作维护:中国科学院文献情报中心
地址:北京中关村北四环西路33号 邮政编码:100190 联系电话:(010)82627496 E-mail:cscd@mail.las.ac.cn 京ICP备05002861号-4 | 京公网安备11010802043238号