Networks Generated from Natural Language Text
The study of large-scale characteristics of graphs that arise in natural language processing is an essential step in finding structural regularities. Structure discovery processes have to be designed with an awareness of these properties. Examining and contrasting the effects of processes that generate graph structures similar to those observed in language data sheds light on the structure of language and its evolution.
Power laws appear in many rank-frequency statistics. Furthermore, we can construct graphs with words as nodes and use various rules to introduce edges between words. In many cases, this results in SWGs, which again often have a power-law distribution for their node degrees.
SWGs appear in many other real world data, like social networks of many kinds, in the link structure of the World Wide Web or in traffic networks. It is interesting to analyze all these networks in more detail to identify similarities and differences.
From an application-driven view, SWGs allow effective clustering strategies in nearly linear time. Because these clusters are often related to the growth process of the underlying graph, they are often meaningful. In the case of natural language these clusters usually reflect semantic and/or syntactic structures.
KeywordsDegree Distribution Language Data Small World Graph Graph Generation Model Structure Discovery Process
- 1.Adamic, L. A. (2000). Zipf, power-law, pareto – a ranking tutorial. Technical report, Information Dynamics Lab, HP Labs, HP Labs, Palo Alto, CA 94304.Google Scholar
- 2.Aiello, W., Chung, F., and Lu, L. (2000). A random graph model for massive graphs. In STOC '00: Proceedings of the Thirty-Second Annual ACM Symposium on Theory of Computing, pages 171–180, New York, NY, USA. ACM Press.Google Scholar
- 4.Biemann, C. and Quasthoff, U. (2005). Dictionary acquisition using parallel text and co-occurrence statistics. In Proceedings of NODALIDA '05, Joensuu, Finland.Google Scholar
- 5.Biemann, C. and Quasthoff, U. (2007). Similarity of documents and document collections using attributes with low noise. In Proceedings of the Third International Conference on Web Information Systems and Technologies (WEBIST-07), pages 130–135, Barcelona, Spain.Google Scholar
- 6.Biemann, C., Bordag, S., and Quasthoff, U. (2004a). Automatic acquisition of paradigmatic relations using iterated co-occurrences. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC-04), Lisbon, Portugal.Google Scholar
- 7.Biemann, C., Bhm, C., Heyer, G., and Melz, R. (2004b). Automatically building concept structures and displaying concept trails for the use in brainstorming sessions and content management systems. In Proceedings of Innovative Internet Community Systems (IICS-2004), Springer LNCS, Guadalajara, Mexico.Google Scholar
- 8.Biemann, C., Shin, S.-I., and Choi, K.-S. (2004c). Semiautomatic extension of corenet using a bootstrapping mechanism on corpus-based co-occurrences. In Proceedings of the 20th International Conference on Computational Linguistics (COLING-04), Morristown, NJ, USA. Association for Computational Linguistics.Google Scholar
- 9.Bordag, S. (2007). Elements of Knowledge-free and Unsupervised Lexical Acquisition. Ph.D. thesis, University of Leipzig.Google Scholar
- 10.Burnard, L. (1995). Users Reference Guide for the British National Corpus. Oxford University Computing Service, Oxford, U.K.Google Scholar
- 11.Cysouw, M., Biemann, C., and Ongyerth, M. (2007). Using Strong's numbers in the Bible to test an automatic alignment of parallel texts. Special issue of Sprachtypologie und Universalienforschung (STUF), pages 66–79.Google Scholar
- 13.Dunning, T. E. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.Google Scholar
- 14.Evert, S. (2004). The Statistics of Word Co-occurrences: Word Pairs and Collocations. Ph.D. thesis, University of Stuttgart.Google Scholar
- 18.Ha, L. Q., Sicilia-Garcia, E. I., Ming, J., and Smith, F. J. (2002). Extension of Zipf's law to words and phrases. In Proceedings of the 19th International Conference on Computational Linguistics (COLING-02), pages 1–6, Morristown, NJ, USA. Association for Computational Linguistics.Google Scholar
- 19.Lempel, R. and Moran, S. (2003). Predictive caching and prefetching of query results in search engines. In Proceedings of the 12th International Conference on World Wide Web (WWW-03), pages 19–28, New York, NY, USA. ACM Press.Google Scholar
- 20.Mahn, M. and Biemann, C. (2005). Tuning co-occurrences of higher orders for generating ontology extension candidates. In Proceedings of the ICML-05 Workshop on Ontology Learning and Extension using Machine Learning Methods, Bonn, Germany.Google Scholar
- 21.Mandelbrot, B. B. (1953). An information theory of the statistical structure of language. In Proceedings of the Symposium on Applications of Communications Theory. Butterworths.Google Scholar
- 23.Moore, R. C. (2004). On log-likelihood-ratios and the significance of rare events. In D. Lin and D. Wu, editors, Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-04), pages 333–340, Barcelona, Spain. Association for Computational Linguistics.Google Scholar
- 24.Quasthoff, U., Richter, M., and Biemann, C. (2006). Corpus portal for search in monolingual corpora. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC-06), pages 1799–1802, Genoa, Italy.Google Scholar
- 25.Rapp, R. (1996). Die Berechnung von Assoziationen: ein korpuslinguistischer Ansatz. Olms, Hildesheim.Google Scholar
- 29.Voss, J. (2005). Measuring Wikipedia. In P. Ingwersen and B. Larsen, editors, ISSI2005, volume 1, pages 221–231, Stockholm. International Society for Scientometrics and Informetrics.Google Scholar
- 31.Zipf, G. K. (1935). The Psycho-Biology of Language. Houghton Mifflin, Boston.Google Scholar
- 32.Zipf, G. K. (1949). Human Behavior and the Principle of Least-Effort. Addison-Wesley, Cambridge, MA.Google Scholar