Structure Formation in the Web

Mehler, Alexander

doi:10.1007/978-90-481-3331-4_12

Alexander Mehler³

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 41))

753 Accesses
4 Citations

Abstract

In this chapter we develop a representation model of web document networks. Based on the notion of uncertain web document structures, the model is defined as a template which grasps nested manifestation levels of hypertext types. Further, we specify the model on the conceptual, formal and physical level and exemplify it by reconstructing competing web document models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Adamic, Lada A. (1999). The small world of web. In Abiteboul, Serge and Vercoustre, Anne-Marie, editors, Research and Advanced Technology for Digital Libraries, pages 443–452. Springer, Berlin.
Chapter Google Scholar
Barnard, D. T., Burnard, L., DeRose, S. J., Durand, D. G., and Sperberg-McQueen, C. M. (1995). Lessons for the World Wide Web from the text encoding initiative. In Proc. of the 4th Int. WWW Conf.
Google Scholar
Baroni, Marco and Bernardini, Silvia, editors (2006). WaCky! Working papers on the Web as corpus. Gedit, Bologna, Italy.
Google Scholar
Björneborn, Lennart (2004). Small-World Link Structures across an Academic Web Space: A Library and Information Science Approach. PhD thesis, Royal School of Library and Information Science, Department of Information Studies, Denmark.
Google Scholar
Björneborn, Lennart and Ingwersen, Peter (2004). Towards a basic framework for webometrics. JASIST, 55(14):1216–1227.
Article Google Scholar
Chakrabarti, Soumen (2002). Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, San Francisco.
Google Scholar
Géry, Mathias and Chevallet, Jean-Pierre (2001). Toward a structured information retrieval system on the web: Automatic structure extraction of web pages. In Int. Workshop on Web Dynamics as part of the 8th Int. Conf. on Database Theory.
Google Scholar
Haas, Stephanie W. and Grams, Erika S. (2000). Readers, authors, and page structure. JASIST, 51(2):181–192.
Article Google Scholar
Holt, Richard C., Schürr, Andy, Elliott Sim, Susan, and Winter, Andreas (2006). GXL: A graph-based standard exchange format for reengineering. Science of Computer Programming, 60(2):149–170.
Article MATH MathSciNet Google Scholar
Koehler, Wallace (1999). An analysis of web page and web site constancy and permanence. JASIST, 50(2):162–180.
Article MathSciNet Google Scholar
Koehler, Wallace (2003). A longitudinal study of web pages continued: a consideration of document persistence. Information Research, 9(2).
Google Scholar
Kot, Mark, Silverman, Emily, and Berg, Celeste A. (2003). Zipf’s law and the diversity of biology newsgroups. Scientometrics, 56(2):247–257.
Article Google Scholar
Kumar, Ravi, Novak, Jasmine, Raghavan, Prabhakar, and Tomkins, Andrew (2004). Structure and evolution of blogspace. Communications of the ACM, 47(12):35–39.
Article Google Scholar
Martin, James R. (1992). English Text. System and Structure. Benjamins, Philadelphia.
Google Scholar
Mehler, Alexander (2005). Zur textlinguistischen Fundierung der Text- und Korpuskonversion. Sprache und Datenverarbeitung, 1:29–53.
Google Scholar
Mehler, Alexander (2008). Large text networks as an object of corpus linguistic studies. In Lüdeling, A. and Kytö, M., editors, Corpus Linguistics. An International Handbook, pages 328–382. De Gruyter, Berlin/New York.
Google Scholar
Mehler, Alexander and Gleim, Rüdiger (2006). The net for the graphs: Webgenre representation for corpus linguistic studies. In Baroni, M. and Bernardini, S. (2006), pages 191–224.
Google Scholar
Mehler, Alexander, Gleim, Rüdiger, and Wegner, Armin (2007). Structural uncertainty of hypertext types. An empirical study. In Towards Genre-Enabled Search Engines: The Impact of NLP. Workshop in conjunction with RANLP 2007, pages 13–19.
Google Scholar
Mehler, Alexander, Sharoff, Serge, and Santini, Marina, editors (2009). Genres on the Web: Computational Models and Empirical Studies. Submitted to Springer, Berlin/New York.
Google Scholar
Menczer, Filippo (2004). Lexical and semantic clustering by web links. JASIST, 55(14): 1261–1269.
Article Google Scholar
Mukherjea, Sougata (2000). Organizing topic-specific web information. In Proc. of the 11th ACM Conf. on Hypertext and Hypermedia, pages 133–141. ACM.
Google Scholar
Pirolli, Peter, Pitkow, James, and Rao, Ramana (1996). Silk from a sow’s ear: Extracting usable structures from the web. In Proc. of the ACM SIGCHI Conf. on Human Factors in Computing, pages 118–125.
Google Scholar
Power, Richard, Scott, Donia, and Bouayad-Agha, Nadjet (2003). Document structure. Computational Linguistics, 29(2):211–260.
Article Google Scholar
Thelwall, M., Prabowo, R., and Fairclough, R. (2006a). Are raw RSS feeds suitable for broad issue scanning? A science concern case study. JASIST, 57(12):1644–1654.
Article Google Scholar
Thelwall, Mike, Vaughan, Liwen, and Björneborn, Lennart (2006b). Webometrics. Annual Review of Information Science Technology, 6(8).
Google Scholar
Thüring, Manfred, Hannemann, Jörg, and Haake, Jörg M. (1995). Hypermedia and cognition: Designing for comprehension. Communications of the ACM, 38(8):57–66.
Article Google Scholar
Tsikrika, Theodora and Lalmas, Mounia (2002). Combining web document representations in a Bayesian inference network model using link and content-based evidence. In Proc. ECIR ’02, volume 2291 of LNCS, pages 53–72.
Google Scholar
Weare, Christopher and Lin, Wan-Ying (2000). Content analysis of the World Wide Web: Opportunities and challenges. Social Science Computer Review, 18(3):272–292.
Article Google Scholar

Download references

Acknowledgements

Financial support of the Deutsche Forschungsgemeinschaft (DFG) via the project Induction of Web Genre Document Grammars of the Research Group 437 Text Technological Information Modeling and via the Project KnowCIT of the Excellence Cluster 277 Cognitive Interaction Technology is gratefully acknowledged.

Author information

Authors and Affiliations

Bielefeld University, Bielefeld, Germany
Alexander Mehler

Authors

Alexander Mehler
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alexander Mehler .

Editor information

Editors and Affiliations

Institut für Deutsche Sprache (IDS), Mannheim, 68161, Germany
Andreas Witt
Fak. Linguistik und, Universität Bielefeld, Universitätsstraße, Bielefeld, 33615, Germany
Dieter Metzing

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Mehler, A. (2010). Structure Formation in the Web. In: Witt, A., Metzing, D. (eds) Linguistic Modeling of Information and Markup Languages. Text, Speech and Language Technology, vol 41. Springer, Dordrecht. https://doi.org/10.1007/978-90-481-3331-4_12

Download citation

DOI: https://doi.org/10.1007/978-90-481-3331-4_12
Published: 09 November 2009
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-3330-7
Online ISBN: 978-90-481-3331-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics