Skip to main content

The Design, Implementation, and Use of the Ngram Statistics Package

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2588))

Abstract

The Ngram Statistics Package (NSP) is a flexible and easy- to-use software tool that supports the identification and analysis of Ngrams, sequences of N tokens in online text. We have designed and implemented NSP to be easy to customize to particular problems and yet remain general enough to serve a broad range of needs. This paper provides an introduction to NSP while raising some general issues in Ngram analysis, and summarizes several applications where NSP has been successfully employed. NSP is written in Perl and is freely available under the GNU Public License.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. G. Bouman and B. Villada. Corpus-based acquisition of collocational prepositional phrases. Computational Linguistics in the Netherlands (CLIN), 2002.

    Google Scholar 

  2. K. Church and P. Hanks. Word association norms, mutual information and lexicography. In Proceedings of the 28th Annual Meeting of the Association for Computational Linguistics, pages 76–83, 1990.

    Google Scholar 

  3. T. Dunning. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):61–74, 1993.

    Google Scholar 

  4. A. Gill and J. Oberlander. Taking care of the linguistic features of extraversion. In Proceedings of the 24th Annual Conference of the Cognitive Science Society, pages 363–368, Washington, D.C., 2002.

    Google Scholar 

  5. A. Lopez, M. Nossal, R. Hwa, and P. Resnik. Word-level alignment for multilingual resource acquisition. In Proceedings of the 2002 LREC Workshop on Linguistic Knowledge Acquisition and Representation: Bootstrapping Annotated Language Data, 2002.

    Google Scholar 

  6. T. Pedersen. Fishing for exactness. In Proceedings of the South Central SAS User’s Group (SCSUG-96) Conference, pages 188–200, Austin, TX, October 1996.

    Google Scholar 

  7. T. Pedersen. A decision tree of bigrams is an accurate predictor of word sense. In Proceedings of the Second Annual Meeting of the North American Chapter of the Association for Computational Linguistics, pages 79–86, Pittsburgh, July 2001.

    Google Scholar 

  8. T. Pedersen. Machine learning with lexical features: The Duluth approach to Senseval-2. In Proceedings of the Senseval-2 Workshop, pages 139–142, Toulouse, July 2001.

    Google Scholar 

  9. T. Pedersen, M. Kayaalp, and R. Bruce. Significant lexical relationships. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, pages 455–460, Portland, OR, August 1996.

    Google Scholar 

  10. C. Shannon. Prediction and entropy of printed English. The Bell System Technical Journal, 30(50–64), 1951.

    Google Scholar 

  11. T. van der Wouden. Collocational behavior in non content words. In ACL/EACL Workshop on Collocations, Toulouse, France, 2001.

    Google Scholar 

  12. M. Yanamoto and K. Church. Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Computational Linguistics, 27(1):1–30, 2001.

    Article  Google Scholar 

  13. D. Zaiu Inkpen and G. Hirst. Acquiring collocations for lexical choice between near synonyms. In SIGLEX Workshop on Unsupervised Lexical Acquisition, 40th meeting of the Association for Computational Linguistics, Philadelphia, 2002.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Banerjee, S., Pedersen, T. (2003). The Design, Implementation, and Use of the Ngram Statistics Package. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2003. Lecture Notes in Computer Science, vol 2588. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36456-0_38

Download citation

  • DOI: https://doi.org/10.1007/3-540-36456-0_38

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-00532-2

  • Online ISBN: 978-3-540-36456-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics