The Design, Implementation, and Use of the Ngram Statistics Package

Banerjee, Satanjeev; Pedersen, Ted

doi:10.1007/3-540-36456-0_38

The Design, Implementation, and Use of the Ngram Statistics Package

Satanjeev Banerjee⁵ &
Ted Pedersen⁶

Conference paper
First Online: 01 January 2003

974 Accesses
67 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2588))

Abstract

The Ngram Statistics Package (NSP) is a flexible and easy- to-use software tool that supports the identification and analysis of Ngrams, sequences of N tokens in online text. We have designed and implemented NSP to be easy to customize to particular problems and yet remain general enough to serve a broad range of needs. This paper provides an introduction to NSP while raising some general issues in Ngram analysis, and summarizes several applications where NSP has been successfully employed. NSP is written in Perl and is freely available under the GNU Public License.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

G. Bouman and B. Villada. Corpus-based acquisition of collocational prepositional phrases. Computational Linguistics in the Netherlands (CLIN), 2002.
Google Scholar
K. Church and P. Hanks. Word association norms, mutual information and lexicography. In Proceedings of the 28th Annual Meeting of the Association for Computational Linguistics, pages 76–83, 1990.
Google Scholar
T. Dunning. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):61–74, 1993.
Google Scholar
A. Gill and J. Oberlander. Taking care of the linguistic features of extraversion. In Proceedings of the 24th Annual Conference of the Cognitive Science Society, pages 363–368, Washington, D.C., 2002.
Google Scholar
A. Lopez, M. Nossal, R. Hwa, and P. Resnik. Word-level alignment for multilingual resource acquisition. In Proceedings of the 2002 LREC Workshop on Linguistic Knowledge Acquisition and Representation: Bootstrapping Annotated Language Data, 2002.
Google Scholar
T. Pedersen. Fishing for exactness. In Proceedings of the South Central SAS User’s Group (SCSUG-96) Conference, pages 188–200, Austin, TX, October 1996.
Google Scholar
T. Pedersen. A decision tree of bigrams is an accurate predictor of word sense. In Proceedings of the Second Annual Meeting of the North American Chapter of the Association for Computational Linguistics, pages 79–86, Pittsburgh, July 2001.
Google Scholar
T. Pedersen. Machine learning with lexical features: The Duluth approach to Senseval-2. In Proceedings of the Senseval-2 Workshop, pages 139–142, Toulouse, July 2001.
Google Scholar
T. Pedersen, M. Kayaalp, and R. Bruce. Significant lexical relationships. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, pages 455–460, Portland, OR, August 1996.
Google Scholar
C. Shannon. Prediction and entropy of printed English. The Bell System Technical Journal, 30(50–64), 1951.
Google Scholar
T. van der Wouden. Collocational behavior in non content words. In ACL/EACL Workshop on Collocations, Toulouse, France, 2001.
Google Scholar
M. Yanamoto and K. Church. Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Computational Linguistics, 27(1):1–30, 2001.
Article Google Scholar
D. Zaiu Inkpen and G. Hirst. Acquiring collocations for lexical choice between near synonyms. In SIGLEX Workshop on Unsupervised Lexical Acquisition, 40th meeting of the Association for Computational Linguistics, Philadelphia, 2002.
Google Scholar

Download references

Author information

Authors and Affiliations

Carnegie Mellon University, 15213, Pittsburgh, PA, USA
Satanjeev Banerjee
University of Minnesota, 55812, Duluth, MN, USA
Ted Pedersen

Authors

Satanjeev Banerjee
View author publications
You can also search for this author in PubMed Google Scholar
Ted Pedersen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Centro de Investigación en Computación (CIC), Instituto Politécnico Nacional (IPN), Col. Zacatenco, CP 07738, Mexico D.F., Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Banerjee, S., Pedersen, T. (2003). The Design, Implementation, and Use of the Ngram Statistics Package. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2003. Lecture Notes in Computer Science, vol 2588. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36456-0_38

Download citation

DOI: https://doi.org/10.1007/3-540-36456-0_38
Published: 30 April 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00532-2
Online ISBN: 978-3-540-36456-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics