International Journal of Scientific & Technology Research

IJSTR@Facebook IJSTR@Twitter IJSTR@Linkedin
Home About Us Scope Editorial Board Blog/Latest News Contact Us

IJSTR >> Volume 2- Issue 6, June 2013 Edition

International Journal of Scientific & Technology Research  
International Journal of Scientific & Technology Research

Website: http://www.ijstr.org

ISSN 2277-8616

Informative Content Extraction By Using Eifce [Effective Informative Content Extractor]

[Full Text]



Chaw Su Win, Mie Mie Su Thwin



Index Terms: Informative Content Extraction, Main Content Extraction, Web Page Segmentation



Abstract: Internet web pages contain several items that cannot be classified as the “informative content,” e.g., search and filtering panel, navigation links, advertisements, and so on. Most clients and end-users search for the informative content, and largely do not seek the non-informative content. As a result, the need of Informative Content Extraction from web pages becomes evident. Two steps, Web Page Segmentation and Informative Content Extraction, are needed to be carried out for Web Informative Content Extraction. DOM-based Segmentation Approaches cannot often provide satisfactory results. Vision-based Segmentation Approaches also have some drawbacks. So this paper proposes Effective Visual Block Extractor (EVBE) Algorithm to overcome the problems of DOM-based Approaches and reduce the drawbacks of previous works in Web Page Segmentation. And it also proposes Effective Informative Content Extractor (EIFCE) Algorithm to reduce the drawbacks of previous works in Web Informative Content Extraction. Web Page Indexing System, Web Page Classification and Clustering System, Web Information Extraction System can achieve significant savings and satisfactory results by applying the Proposed Algorithms.



[1]. J. Han and M. Kamber, “Data Mining: Concepts and Techniques, Second Edition,” pp. 630-637, 2006.

[2]. J. Chen, B. Zhou, J. Shi, H. Zhang, and Q. Fengwu, “Function-Based Object Model Towards Website Adaptation”, In the Proceedings of the Tenth World Wide Web conference (WWW10), Budapest, Hungary, May 2001.

[3]. M. Kovacevic, M. Diligenti, M. Gori, M. Maggini, and V. Milutinovic, “Recognition of Common Areas in a Web Page Using Visual Information: a possible application in a page classification,” In the Proceedings of 2002 IEEE International Conference on Data Mining (ICDM'02), Maebashi City, Japan, December 2002.

[4]. S.-H. Lin and J.-M. Ho, “Discovering Informative Content Blocks from Web Documents,” In the Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (SIGKDD’02), 2002.

[5]. D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma, “Extracting Content Structure for Web Pages based on Visual Representation,” In the Fifth Asia Pacific Web Conference (APWeb2003), Springer Lecture Notes in Computer Science, 2003.

[6]. D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma, “VIPS: a Vision-based Page Segmentation Algorithm,” Technical Report, MSR-TR-2003-79, 2003.

[7]. L. Yi, B. Liu, and X. Li, “Eliminating Noisy Information in Web Pages for Data Mining,” In Proceeding of the 9th ACM SIGKDD International Conference, 2003.

[8]. R. Song, H. Liu, J.-R. Wen, and W.-Y. Ma, “Learning Important Models for Web Page Blocks based on Layout and Content Analysis,” SIGKDD Explorations, Volume 6, Issue 2, Microsoft Research Asia, 49 Zhichun Road, Beijing, 100080, P.R. China and Department of Computer Science, University of Toronto, Toronto, ON, Canada, 2004.

[9]. S. Debnath, P. Mitra, N. Pal, and C.L. Giles, “Automatic Identification of Informative Sections of Web Pages,” In IEEE Transactions on Knowledge and Data Engineering, 17(9): 1233-1246, 2005.

[10]. S. Debnath, P. Mitra, and C.L. Giles, “Identifying Content Blocks from Web Documents,” Penn State University, USA, 2005.

[11]. J. Gibson., B. Wellner, and S. Lubar, “CoreEx: Content Extraction from Online News Articles,” In Proceeding of the 17th ACM IKM Conference, 2008.

[12]. M. Toman, “Comparison of Approaches for Information Extraction from the Web,” In Proceeding of the 9th International PhD Workshop on Systems and Control: Young Generation Viewpoint, Slovenia, 2008.

[13]. S. Louvan, “Extracting the Main Content from HTML Documents,” [Online], http://wwwis.win.tue.nl/bnaic2009/papers/bnaic2009_paper_113.pdf, 2009.

[14]. T. Win and K.N.N. Tun, “Noise Elimination for Improving Web Information Extraction,” In the Proceedings of the Seventh International Conference on Computer Applications, 2009.

[15]. M. Asfia, M.M. Pedram, and A.M. Rahmani, “Main Content Extraction from Detailed Web Pages,” In International Journal of Computer Applications (0975 – 8887), 2010.

[16]. Y. Yesilada, “Web Page Segmentation: A Review,” eMINE Technical Report Deliverable 0 (D0), 2011.

[17]. Y. Yesilada, “Heuristics for Visual Elements of Web Pages,” eMINE Technical Report Deliverable 1 (D1), 2011.

[18]. E. Akpınar and Y. Yesilada, “Vision Based Page Segmentation: Extended and Improved Algorithm,” eMINE Technical Report Deliverable 2 (D2), unpublished, Middle East Technical University, Ankara, Turkey, 2012.

[19]. T. Gottron, “Evaluating Content Extraction on HTML Documents,” Institut für Informatik, Johannes Gutenberg-Universität, Mainz, Germany.