帮助关于我们

返回检索结果

基于均值漂移算法的文本聚类数目优化研究
Determining Best Text Clustering Number with Mean Shift Algorithm

查看参考文献34篇

赵华茗 ^* 余丽周强

文摘	【目的】探索最佳文本聚类数目的优化方法,为提升文本聚类算法的有效性和质量提供参考。【方法】结合TF-IDF和Word2Vec算法,提取Top N关键词向量作为语料库文本特征表达;结合均值漂移算法、聚类有效性指标(Silhouette)和均方误差(MSE)指标,确定最佳文本聚类数目。【结果】Top 4500关键词向量规模能较好呈现文本特征;基于均值漂移算法确定的最佳文本聚类数与人工研判优化的聚类数相符。【局限】选取的实验数据集合不够充足,缺少在其他领域的应用对比。【结论】本文方法可以在无监督方式下高质量完成文本聚类个数的确定。
其他语种文摘	[Objective] This paper explores the optimal method for determining the best text clustering number,aiming to improve the effectiveness of related algorithms.[Methods] First,we combined the TF-IDF and Word2Vec algorithms to extract the TopN keyword vectors as text feature expression in corpus.Then,we decided the best number of text clustering with the mean shift algorithm,clustering validity index (Silhouette) and mean square error (MSE) index.[Results] We found that the top 4500 keyword vectors could better represent the text features.The best number of text clustering by Mean Shift algorithm matched the manually optimized results.[Limitations] The size of experimental data sets needs to be expanded.Our results should to be compared with those of other applications.[Conclusions] The proposed method could effectively determin the best text clustering number in an unsupervised way.
来源	数据分析与知识发现 ,2019,3(9):27-35 【扩展库】
DOI	10.11925/infotech.2096-3467.2018.1259
关键词	均值漂移 ; 文本聚类 ; 聚类数 ; 聚类有效性
地址	中国科学院文献情报中心, 北京, 100190
语种	中文
文献类型	研究性论文
ISSN	2096-3467
学科	社会科学总论
基金	国家社会科学基金 ; 中国科学院文献情报中心文献情报能力建设专项项目
文献收藏号	CSCD:6698728

参考文献共 34 共2页

引证文献 3 篇

1 董霏基于机器视觉的线缆导体质量检测研究电子测量与仪器学报,2020,34(6):144-153
被引 5 次

2 赵丹宁基于深度学习的科技文献摘要结构要素自动抽取方法研究数据分析与知识发现,2021,5(7):70-80
被引 4 次

显示所有3篇文献

论文科学数据集

PlumX Metrics

相关文献
作者相关关键词相关参考文献相关

版权所有 ©2008 中国科学院文献情报中心制作维护：中国科学院文献情报中心
地址：北京中关村北四环西路33号邮政编码：100190 联系电话：(010)82627496 E-mail:cscd@mail.las.ac.cn 京ICP备05002861号-4 | 京公网安备11010802043238号