A Survey On Various Web Template Detection And Extraction Methods
Neethu Mary Varghese, Tenny Thomas Soman
Index Terms: Cluster, Homogeneous web page, Heterogeneous web page, Page-level detection, Search engine, Site-level detection, Template Detection, Template Extraction.
Abstract: In today’s digital world, reliance on the World Wide Web as a source of information is extensive. Users increasingly rely on web based search engines to provide accurate search results on a wide range of topics that interest them. The search engines, in turn parse the vast repository of web pages searching for relevant information. However, majority of web portals are designed using web templates, which are designed to provide consistent look and feel to end users. The presence of these templates however can influence search results leading to inaccurate results being delivered to the users. Therefore to improve the accuracy and reliability of search results, identification and removal of web templates from the actual content is essential. A wide range of approaches are commonly employed to achieve this, and this paper focuses on the study of the various approaches of template detection and extraction that can be applied across homogenous as well as heterogeneous web pages.
 Chulyun Kim and Kyuseok Shim, “Text:Automatic Template Extraction from Heterogeneous Web Pages,” IEEE Transactions on Knowledge and Data Engineering, Vol. 23, No. 4, April 2011.
 K.Vieira, A.S. da Silva, N.Pinto, E.S. de Moura, J.M.B. Cavalcanti and J.Friere, “A Fast and Robust Method for Web Page Template Detection and Removal,” Proc.15th ACM Int’l Conf. Information and Knowledge Management(CIKM), 2006.
 Z.Bar-Yossef and S. Rajagopalan, “Template Detection via Data Mining and its Applications,” Proc.11th Int’l Conf. World Wide Web(WWW), 2002.
 M.de Castro Reis, P.B.Golgher, A.S. da Silva and A.H.F Laender, “Automatic Web News Extraction Using Tree Edit Distance,” Proc.13th Int’l Conf. World Wide Web(WWW), 2004.
 L.Yi, B.Liu and X.Li , “Eliminating noisy information in Web Pages for Data Mining,” In Proceedings of the International ACM Conference on Knowledge Discovery and Data Mining, 2003.
 A.Arasu and H.Garcia-Molina, “Extracting Structured Data from Web Pages,” Proc.ACM SIGMOD, 2003.
 L.Ma, N.Goharian, A.Chowdhury and M.Chung, “Extracting Unstructured Data from Template Generated Web Documents,” Proc. CIKM, pp 512-515, 2003.
 Liang Chen, Shaozhi Ye, Xing Li, “Template Detection for large scale search engines,” Proc.ACM Symposium, pp 1094-1098, 2006.
 Yu Wang, Bingxing Fang, Xueqi Cheng, Li Guo, Hongvo Xu, “Incremental Web Page Template Detection,” Proc.17th Int’l Conf. World Wide Web(WWW), pp 1247-1248, 2008.
 Sandip Debnath, Prasenjit Mitra, C.Lee Giles, “Automatic Extraction of Informative Blocks from Web Pages,” Proc.ACM Symposium, pp 1722-1726, 2005.