Abstract
The majority of documents on the Web are written in HTML, constituting a huge amount of legacy data: all documents are formatted for visual purposes only and with different styles due to diverse authorships and goals and this makes the process of retrieval and integration of Web contents difficult to automate. We provide a contribution to the solution of this problem by proposing a structured approach to data reverse engineering of data-intensive Web sites. We focus on data content and on the way in which such content is structured on the Web. We profitably use a Web data model to describe abstract structural features of HTML pages and propose a method for the segmentation of HTML documents in special blocks grouping semantically related Web objects. We have developed a tool based on this method that supports the identification of structure, function, and meaning of data organized in Web object blocks. We demonstrate with this tool the feasibility and effectiveness of our approach over a set of real Web sites.
Chapter PDF
References
Antoniol, G., Canfora, G., Casazza, G., De Lucia, A.: Web Site Reengineering using RMM. In: Proc. of Int. Workshop on Web Site Evolution, Zurich, Switzerland (2000)
Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: Proc. of the 27th Int. Conf. on Very Large Data Bases (VLDB 2007), Roma, Italy (2001)
Benslimane, S.M., Benslimane, D., Malki, M., Amghar, Y., Hassane, H.S.: Acquiring owl ontologies from data-intensive web sites. In: Proc. of Int. Conf. on Web Engineering (ICWE 2006), Palo Alto, California, USA (2006)
Bouchiha, D., Malki, M., Benslimane, S.M.: Ontology based Web Application Reverse Engineering Approach. INFOCOMP Journal of Computer Science 6(1), 37–46 (2007)
Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: Extracting Content Structure for Web Pages based on Visual Representation. In: Zhou, X., Zhang, Y., Orlowska, M.E. (eds.) APWeb 2003. LNCS, vol. 2642, pp. 406–417. Springer, Heidelberg (2003)
Chikofsky, E.J., Cross, J.H.: Reverse Engineering and Design Recovery: A Taxonomy. IEEE Software 7(1), 13–17 (1990)
Chung, S., Lee, Y.S.: Reverse Software Engineering with UML for Web Site Maintenance. In: Proc. of the 1th Int. Conf. on Web Information Systems Engineering (WISE 2000), Hong Kong, China (2000)
Crescenzi, V., Merialdo, P., Missier, P.: Clustering Web pages based on their structure. Data Knowl. Eng. 54(3), 279–299 (2005)
De Virgilio, R., Torlone, R.: A Meta-model Approach to the Management of Hypertexts in Web Information Systems. In: ER Workshops (WISM 2008) (2008)
Di Lucca, G.A., Fasolino, A.R., Tramontana, P.: Reverse engineering Web applications: the WARE approach. Journal of Software Maintenance 16(1-2), 71–101 (2004)
Du Bois, B.: Towards a Reverse Engineering Ontology. In: Proc. of the 2th Int. Workshop on Empirical Studies in Reverse Engineering (WESRE 2006), Benevento, Italy (2006)
Laender, A., Ribeiro-Neto, B., Da Silva, A., Teixeira, J.S.: A brief survey of web data extraction tools. ACM SIGMOD Record 31(2), 84–93 (2002)
Ricca, F., Tonella, P.: Understanding and Restructuring Web Sites with ReWeb. IEEE Multimedia 8(2), 40–51 (2001)
Tao, T., Mukherjee, A.: LZW Based Compressed Pattern Matching. In: Proc. of the 14th Data Compression Conf. (DCC 2004), Snowbird, UT, USA (2004)
Vanderdonckt, J., Bouillon, L., Souchon, N.: Flexible reverse engineering of Web Pages with VAQUISTA. In: Proc. of the 8th Working Conf. on Reverse Engineering (WCRE 2001), Stuttgart, Germany (2001)
Wong, T.-L., Lam, W.: Adapting web information extraction knowledge via mining site-invariant and site-dependent features. ACM Transactions on Internet Technology 7(1), 6 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
De Virgilio, R., Torlone, R. (2009). A Structured Approach to Data Reverse Engineering of Web Applications. In: Gaedke, M., Grossniklaus, M., Díaz, O. (eds) Web Engineering. ICWE 2009. Lecture Notes in Computer Science, vol 5648. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02818-2_7
Download citation
DOI: https://doi.org/10.1007/978-3-642-02818-2_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-02817-5
Online ISBN: 978-3-642-02818-2
eBook Packages: Computer ScienceComputer Science (R0)