Abstract
In this chapter we develop a representation model of web document networks. Based on the notion of uncertain web document structures, the model is defined as a template which grasps nested manifestation levels of hypertext types. Further, we specify the model on the conceptual, formal and physical level and exemplify it by reconstructing competing web document models.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Adamic, Lada A. (1999). The small world of web. In Abiteboul, Serge and Vercoustre, Anne-Marie, editors, Research and Advanced Technology for Digital Libraries, pages 443–452. Springer, Berlin.
Barnard, D. T., Burnard, L., DeRose, S. J., Durand, D. G., and Sperberg-McQueen, C. M. (1995). Lessons for the World Wide Web from the text encoding initiative. In Proc. of the 4th Int. WWW Conf.
Baroni, Marco and Bernardini, Silvia, editors (2006). WaCky! Working papers on the Web as corpus. Gedit, Bologna, Italy.
Björneborn, Lennart (2004). Small-World Link Structures across an Academic Web Space: A Library and Information Science Approach. PhD thesis, Royal School of Library and Information Science, Department of Information Studies, Denmark.
Björneborn, Lennart and Ingwersen, Peter (2004). Towards a basic framework for webometrics. JASIST, 55(14):1216–1227.
Chakrabarti, Soumen (2002). Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, San Francisco.
Géry, Mathias and Chevallet, Jean-Pierre (2001). Toward a structured information retrieval system on the web: Automatic structure extraction of web pages. In Int. Workshop on Web Dynamics as part of the 8th Int. Conf. on Database Theory.
Haas, Stephanie W. and Grams, Erika S. (2000). Readers, authors, and page structure. JASIST, 51(2):181–192.
Holt, Richard C., Schürr, Andy, Elliott Sim, Susan, and Winter, Andreas (2006). GXL: A graph-based standard exchange format for reengineering. Science of Computer Programming, 60(2):149–170.
Koehler, Wallace (1999). An analysis of web page and web site constancy and permanence. JASIST, 50(2):162–180.
Koehler, Wallace (2003). A longitudinal study of web pages continued: a consideration of document persistence. Information Research, 9(2).
Kot, Mark, Silverman, Emily, and Berg, Celeste A. (2003). Zipf’s law and the diversity of biology newsgroups. Scientometrics, 56(2):247–257.
Kumar, Ravi, Novak, Jasmine, Raghavan, Prabhakar, and Tomkins, Andrew (2004). Structure and evolution of blogspace. Communications of the ACM, 47(12):35–39.
Martin, James R. (1992). English Text. System and Structure. Benjamins, Philadelphia.
Mehler, Alexander (2005). Zur textlinguistischen Fundierung der Text- und Korpuskonversion. Sprache und Datenverarbeitung, 1:29–53.
Mehler, Alexander (2008). Large text networks as an object of corpus linguistic studies. In Lüdeling, A. and Kytö, M., editors, Corpus Linguistics. An International Handbook, pages 328–382. De Gruyter, Berlin/New York.
Mehler, Alexander and Gleim, Rüdiger (2006). The net for the graphs: Webgenre representation for corpus linguistic studies. In Baroni, M. and Bernardini, S. (2006), pages 191–224.
Mehler, Alexander, Gleim, Rüdiger, and Wegner, Armin (2007). Structural uncertainty of hypertext types. An empirical study. In Towards Genre-Enabled Search Engines: The Impact of NLP. Workshop in conjunction with RANLP 2007, pages 13–19.
Mehler, Alexander, Sharoff, Serge, and Santini, Marina, editors (2009). Genres on the Web: Computational Models and Empirical Studies. Submitted to Springer, Berlin/New York.
Menczer, Filippo (2004). Lexical and semantic clustering by web links. JASIST, 55(14): 1261–1269.
Mukherjea, Sougata (2000). Organizing topic-specific web information. In Proc. of the 11th ACM Conf. on Hypertext and Hypermedia, pages 133–141. ACM.
Pirolli, Peter, Pitkow, James, and Rao, Ramana (1996). Silk from a sow’s ear: Extracting usable structures from the web. In Proc. of the ACM SIGCHI Conf. on Human Factors in Computing, pages 118–125.
Power, Richard, Scott, Donia, and Bouayad-Agha, Nadjet (2003). Document structure. Computational Linguistics, 29(2):211–260.
Thelwall, M., Prabowo, R., and Fairclough, R. (2006a). Are raw RSS feeds suitable for broad issue scanning? A science concern case study. JASIST, 57(12):1644–1654.
Thelwall, Mike, Vaughan, Liwen, and Björneborn, Lennart (2006b). Webometrics. Annual Review of Information Science Technology, 6(8).
Thüring, Manfred, Hannemann, Jörg, and Haake, Jörg M. (1995). Hypermedia and cognition: Designing for comprehension. Communications of the ACM, 38(8):57–66.
Tsikrika, Theodora and Lalmas, Mounia (2002). Combining web document representations in a Bayesian inference network model using link and content-based evidence. In Proc. ECIR ’02, volume 2291 of LNCS, pages 53–72.
Weare, Christopher and Lin, Wan-Ying (2000). Content analysis of the World Wide Web: Opportunities and challenges. Social Science Computer Review, 18(3):272–292.
Acknowledgements
Financial support of the Deutsche Forschungsgemeinschaft (DFG) via the project Induction of Web Genre Document Grammars of the Research Group 437 Text Technological Information Modeling and via the Project KnowCIT of the Excellence Cluster 277 Cognitive Interaction Technology is gratefully acknowledged.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer Science+Business Media B.V.
About this chapter
Cite this chapter
Mehler, A. (2010). Structure Formation in the Web. In: Witt, A., Metzing, D. (eds) Linguistic Modeling of Information and Markup Languages. Text, Speech and Language Technology, vol 41. Springer, Dordrecht. https://doi.org/10.1007/978-90-481-3331-4_12
Download citation
DOI: https://doi.org/10.1007/978-90-481-3331-4_12
Published:
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-3330-7
Online ISBN: 978-90-481-3331-4
eBook Packages: Computer ScienceComputer Science (R0)