Networks Generated from Natural Language Text

Part of the Modeling and Simulation in Science, Engineering and Technology book series (MSSET)

The study of large-scale characteristics of graphs that arise in natural language processing is an essential step in finding structural regularities. Structure discovery processes have to be designed with an awareness of these properties. Examining and contrasting the effects of processes that generate graph structures similar to those observed in language data sheds light on the structure of language and its evolution.

In this chapter, we examine power-law distributions and small world graphs (SWGs) originating from natural language data. There are several reasons for the special interest in these structures.
  1. 1.

    Power laws appear in many rank-frequency statistics. Furthermore, we can construct graphs with words as nodes and use various rules to introduce edges between words. In many cases, this results in SWGs, which again often have a power-law distribution for their node degrees.

  2. 2.

    SWGs appear in many other real world data, like social networks of many kinds, in the link structure of the World Wide Web or in traffic networks. It is interesting to analyze all these networks in more detail to identify similarities and differences.

  3. 3.

    From an application-driven view, SWGs allow effective clustering strategies in nearly linear time. Because these clusters are often related to the growth process of the underlying graph, they are often meaningful. In the case of natural language these clusters usually reflect semantic and/or syntactic structures.



Degree Distribution Language Data Small World Graph Graph Generation Model Structure Discovery Process 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Adamic, L. A. (2000). Zipf, power-law, pareto – a ranking tutorial. Technical report, Information Dynamics Lab, HP Labs, HP Labs, Palo Alto, CA 94304.Google Scholar
  2. 2.
    Aiello, W., Chung, F., and Lu, L. (2000). A random graph model for massive graphs. In STOC '00: Proceedings of the Thirty-Second Annual ACM Symposium on Theory of Computing, pages 171–180, New York, NY, USA. ACM Press.Google Scholar
  3. 3.
    Barabási, A.-L. and Albert, R. (1999). Emergence of scaling in random networks. Science, 286, 509.CrossRefMathSciNetGoogle Scholar
  4. 4.
    Biemann, C. and Quasthoff, U. (2005). Dictionary acquisition using parallel text and co-occurrence statistics. In Proceedings of NODALIDA '05, Joensuu, Finland.Google Scholar
  5. 5.
    Biemann, C. and Quasthoff, U. (2007). Similarity of documents and document collections using attributes with low noise. In Proceedings of the Third International Conference on Web Information Systems and Technologies (WEBIST-07), pages 130–135, Barcelona, Spain.Google Scholar
  6. 6.
    Biemann, C., Bordag, S., and Quasthoff, U. (2004a). Automatic acquisition of paradigmatic relations using iterated co-occurrences. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC-04), Lisbon, Portugal.Google Scholar
  7. 7.
    Biemann, C., Bhm, C., Heyer, G., and Melz, R. (2004b). Automatically building concept structures and displaying concept trails for the use in brainstorming sessions and content management systems. In Proceedings of Innovative Internet Community Systems (IICS-2004), Springer LNCS, Guadalajara, Mexico.Google Scholar
  8. 8.
    Biemann, C., Shin, S.-I., and Choi, K.-S. (2004c). Semiautomatic extension of corenet using a bootstrapping mechanism on corpus-based co-occurrences. In Proceedings of the 20th International Conference on Computational Linguistics (COLING-04), Morristown, NJ, USA. Association for Computational Linguistics.Google Scholar
  9. 9.
    Bordag, S. (2007). Elements of Knowledge-free and Unsupervised Lexical Acquisition. Ph.D. thesis, University of Leipzig.Google Scholar
  10. 10.
    Burnard, L. (1995). Users Reference Guide for the British National Corpus. Oxford University Computing Service, Oxford, U.K.Google Scholar
  11. 11.
    Cysouw, M., Biemann, C., and Ongyerth, M. (2007). Using Strong's numbers in the Bible to test an automatic alignment of parallel texts. Special issue of Sprachtypologie und Universalienforschung (STUF), pages 66–79.Google Scholar
  12. 12.
    Dorogovtsev, S. N. and Mendes, J. F. F. (2001). Language as an evolving word web. Proceedings of The Royal Society of London. Series B, Biological Sciences, 268(1485), 2603–2606.CrossRefGoogle Scholar
  13. 13.
    Dunning, T. E. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.Google Scholar
  14. 14.
    Evert, S. (2004). The Statistics of Word Co-occurrences: Word Pairs and Collocations. Ph.D. thesis, University of Stuttgart.Google Scholar
  15. 15.
    Ferrer-i-Cancho, R. and Sol, R. V. (2001). The small world of human language. Proceedings of The Royal Society of London. Series B, Biological Sciences, 268(1482), 2261–2265.CrossRefGoogle Scholar
  16. 16.
    Ferrer-i-Cancho, R. and Sol, R. V. (2002). Zipf's law and random texts. Advances in Complex Systems, 5(1), 1–6.CrossRefMATHGoogle Scholar
  17. 17.
    Glassman, S. (1994). A caching relay for the world wide web. Computer Networks and ISDN Systems, 27(2), 165–173.CrossRefGoogle Scholar
  18. 18.
    Ha, L. Q., Sicilia-Garcia, E. I., Ming, J., and Smith, F. J. (2002). Extension of Zipf's law to words and phrases. In Proceedings of the 19th International Conference on Computational Linguistics (COLING-02), pages 1–6, Morristown, NJ, USA. Association for Computational Linguistics.Google Scholar
  19. 19.
    Lempel, R. and Moran, S. (2003). Predictive caching and prefetching of query results in search engines. In Proceedings of the 12th International Conference on World Wide Web (WWW-03), pages 19–28, New York, NY, USA. ACM Press.Google Scholar
  20. 20.
    Mahn, M. and Biemann, C. (2005). Tuning co-occurrences of higher orders for generating ontology extension candidates. In Proceedings of the ICML-05 Workshop on Ontology Learning and Extension using Machine Learning Methods, Bonn, Germany.Google Scholar
  21. 21.
    Mandelbrot, B. B. (1953). An information theory of the statistical structure of language. In Proceedings of the Symposium on Applications of Communications Theory. Butterworths.Google Scholar
  22. 22.
    Miller, G. A. (1957). Some effects of intermittent silence. American Journal of Psychology, 70, 311–313.CrossRefGoogle Scholar
  23. 23.
    Moore, R. C. (2004). On log-likelihood-ratios and the significance of rare events. In D. Lin and D. Wu, editors, Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-04), pages 333–340, Barcelona, Spain. Association for Computational Linguistics.Google Scholar
  24. 24.
    Quasthoff, U., Richter, M., and Biemann, C. (2006). Corpus portal for search in monolingual corpora. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC-06), pages 1799–1802, Genoa, Italy.Google Scholar
  25. 25.
    Rapp, R. (1996). Die Berechnung von Assoziationen: ein korpuslinguistischer Ansatz. Olms, Hildesheim.Google Scholar
  26. 26.
    Sigurd, B., Eeg-Olofsson, M., and van de Weijer, J. (2004). Word length, sentence length and frequency – Zipf revisited. Studia Linguistica, 58(1), 37–52.CrossRefGoogle Scholar
  27. 27.
    Smith, F. J. and Devine, K. (1985). Storing and retrieving word phrases. Inf. Process. Manage., 21(3), 215–224.CrossRefGoogle Scholar
  28. 28.
    Steyvers, M. and Tenenbaum, J. B. (2005). The large-scale structure of semantic networks: Statistical analyses and a model of semantic growth. Cognitive Science, 29(1), 41–78.CrossRefGoogle Scholar
  29. 29.
    Voss, J. (2005). Measuring Wikipedia. In P. Ingwersen and B. Larsen, editors, ISSI2005, volume 1, pages 221–231, Stockholm. International Society for Scientometrics and Informetrics.Google Scholar
  30. 30.
    Zanette, D. H. and Montemurro, M. A. (2005). Dynamics of text generation with realistic Zipf's distribution. Journal of Quantitative Linguistics, 12(1), 29–40.CrossRefGoogle Scholar
  31. 31.
    Zipf, G. K. (1935). The Psycho-Biology of Language. Houghton Mifflin, Boston.Google Scholar
  32. 32.
    Zipf, G. K. (1949). Human Behavior and the Principle of Least-Effort. Addison-Wesley, Cambridge, MA.Google Scholar

Copyright information

© Birkhäuser Boston, a part of Springer Science+Business Media, LLC 2009

Authors and Affiliations

  1. 1.Institute for Computer Science, NLP DepartmentUniversity of LeipzigLeipzigGermany
  2. 2.Institute for Computer Science, NLP DepartmentUniversity of LeipzigLeipzigGermany

Personalised recommendations