帮助 关于我们

返回检索结果

基于文本密度模型的Web正文抽取
Web Content Extraction Based on Text Density Model

查看参考文献18篇

朱泽德 1   李淼 2 *   张健 2   陈雷 2   曾新华 2  
文摘 为从大量无关信息中获取有用内容,正文抽取成为Web数据应用不可或缺的组成部分。文中提出一种基于文本密度模型的新闻网页正文抽取方法。主要通过融合网页结构和语言特征的统计模型,将网页文档按文本行转化成正、负密度序列,再根据邻近行的内容连续性,利用高斯平滑技术修正文本密度序列,最后采用改进的最大子序列分割序列抽取正文内容。该方法保持正文完整性并排除噪声干扰,且无需人工干预或反复训练。实验结果表明基于文本密度抽取正文对不同数据源具有广泛的适应性,且准确率和召回率优于现有统计模型。
其他语种文摘 In order to obtain useful content encompassed by a large number of irrelevant information, the content extraction becomes indispensable for web data application. An approach of web content extraction based on the text density model is proposed, which integrates page structure features with language features to convert text lines of page document into a positive or negative density sequence. Additionally, the Gaussian smoothing technique is used to revise the density sequence, which takes the content continuity of adjacent lines into consideration. Finally, the improved maximum sequence segmentation is adopted to split the sequence and extract web content. Without any human intervention or repeated trainings, this approach maintains the integrity of content and eliminates noise disturbance. The experimental results indicate that the web content extraction based on the text density model is widely adapted to different data sources, and both accuracy and recall rate of the proposed approach are better than those existing statistical models.
来源 模式识别与人工智能 ,2013,26(7):667-672 【核心库】
关键词 Web挖掘 ; 正文抽取 ; 文本密度 ; 高斯平滑 ; 最大子序列
地址

1. 中国科学技术大学自动化系, 合肥, 230026  

2. 中国科学院合肥智能机械研究所, 合肥, 230031

语种 中文
文献类型 研究性论文
ISSN 1003-6059
学科 自动化技术、计算机技术
基金 国家自然科学基金项目 ;  国家科技支撑计划项目
文献收藏号 CSCD:4901889

参考文献 共 18 共1页

1.  Gibson D. The Volume and Evolution of Web Page Templates. Proc of the 14th International Conference on World Wide Web,2005:830-839 CSCD被引 2    
2.  Chen Yu. Detecting Web Page Structure for Adaptive Viewing on Small Form Factor Devices. Proc of the 12th International Conference on World Wide Web,2003:225-233 CSCD被引 1    
3.  Yu Shipeng. Improving Pseudo-Relevance Feedback in Web Information Retrieval Using Web Page Segmentation. Proc of the 12th International Conference on World Wide Web,2003:11-18 CSCD被引 1    
4.  Uszkoreit J. Large Scale Parallel Document Mining for Machine Translation. Proc of the 23rd International Conference on Computational Linguistics,2010:1101-1109 CSCD被引 2    
5.  Adelberg B. NoDoSEA Tool for Semi-Automatically Extracting Structured and Semistructured Data from Text Documents. Proc of the ACM SIGMOD International Conference on Management of Data,1998:283-294 CSCD被引 1    
6.  Kang D K. MetaNews: An Information Agent for Gathering News Articles on the Web. Proc of the 14th International Symposium Methodologies for Intelligent Systems,2003:179-186 CSCD被引 1    
7.  Yang Shaohua. Automatic Data Extraction from Template-Generated Web Pages. Journal of Software,2008,19(2):209-223 CSCD被引 6    
8.  Kohlschutter C. Boilerplate Detection Using Shallow Text Features. Proc of the 3th ACM International Conference on Web Search and Data Mining,2010:441-450 CSCD被引 1    
9.  Song Ruihua. Learning Important Models for Webpage Blocks Based on Layout and Content Analysis. ACM SIGKDD Explorations Newsletter,2004,6(2):14-23 CSCD被引 2    
10.  Gibson J. Adaptive Web-page Content Identification. Proc of the 9th ACM International Workshop on Web Information and Data Management,2007:105-112 CSCD被引 1    
11.  Ziegler C N. Content Extraction from News Pages Using Particle Swarm Optimization on Linguistic and Structural Features. Proc of the IEEE/ WIC/ ACM International Conference on Web Intelligence,2007:242-249 CSCD被引 1    
12.  Pasternack J. Extracting Article Text from the Web with Maximum Subsequence Segmentation. Proc of the 18th International Conference on World Wide Web,2009:971-980 CSCD被引 3    
13.  Finn A. Fact or Fiction: Content Classification for Digital Libraries. Proc of the 2nd DELOS Network of Excellence Workshop on Personalization and Recommender Systems in Digital Libraries,2001:1-6 CSCD被引 1    
14.  Pinto D. QuASM: A System for Question Answering Using Semi-Structured Data. Proc of the 2nd ACM/ IEEE-CS Joint Conference on Digital Libraries,2002:46-55 CSCD被引 1    
15.  Mantratzis C. Separating XHTML Content from Navigation Clutter Using DOM-Structure Block Analysis. Proc of the 16th ACM Conference on Hypertext and Hypermedia,2005:145-147 CSCD被引 1    
16.  Debnath S. Automatic Extraction of Informative Blocks from Webpages. Proc of the ACM Symposium on Applied Computing,2005:1722-1726 CSCD被引 1    
17.  Gottron T. Content Code Blurring: A New Approach to Content Extraction. Proc of the 19th International Conference on Database and Expert Systems Applications,2008:29-33 CSCD被引 1    
18.  Weninger T. CETR-Content Extraction via Tag Ratios. Proc of the 19th International Conference on World Wide Web,2010:971-980 CSCD被引 4    
引证文献 4

1 秦成磊 一种基于统计的复杂页面正文提取方法 计算机应用与软件,2015,32(7):90-92,147
CSCD被引 1

2 程玉胜 结合关键词微变和LD算法的文本相似性研究 计算机工程与应用,2016,52(8):70-73,124
CSCD被引 0 次

显示所有4篇文献

论文科学数据集
PlumX Metrics
相关文献

 作者相关
 关键词相关
 参考文献相关

版权所有 ©2008 中国科学院文献情报中心 制作维护:中国科学院文献情报中心
地址:北京中关村北四环西路33号 邮政编码:100190 联系电话:(010)82627496 E-mail:cscd@mail.las.ac.cn 京ICP备05002861号-4 | 京公网安备11010802043238号