Skip to main content

A Linear Algebra Approach to Language Identification

  • Conference paper
  • First Online:
Principles of Digital Document Processing (PODDP 1998)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1481))

Included in the following conference series:

Abstract

Identification of the language of documents has traditionally been accomplished using dictionaries or other such language sources. This paper presents a novel algorithm for identifying the language of documents using much less information about the language than traditional methods. In addition, if no information about the language of incoming documents is known, the algorithm groups the documents into language groups, despite the deficit of language knowledge. The algorithm is based on the vector space model of information retrieval and uses a matrix projection operator and the singular value decomposition to identify terms that distinguish between languages. Experimental results show that the algorithm works reasonably well.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Marc Damashek. Gauging similarity with n-grams: Language-independent categorization of text. Science, 267:843–848, 1995.

    Article  Google Scholar 

  2. Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391–407, 1990.

    Article  Google Scholar 

  3. Susan T. Dumais. Improving the retrieval of information from external sources. Behavior Research Methods, Instruments & Computers, 23(2):229–236, 1991.

    Google Scholar 

  4. Gene H. Golub and Charles F. Van Loan. Matrix Computations. The Johns Hopkins University Press, Baltimore, 1989.

    MATH  Google Scholar 

  5. Gregory Grefenstette. Comparing two language identification schemes. In Proceedings of the 3rd International Conference on Statistical Analysis of Textual Data, 1995.

    Google Scholar 

  6. Chris Marron and Joe McCloskey. Optimal partitions and clustering. In Proceedings of the 1997 Conference on Linear Algebra and Applications. SIAM, 1997.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1998 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Mather, L.A. (1998). A Linear Algebra Approach to Language Identification. In: Munson, E.V., Nicholas, C., Wood, D. (eds) Principles of Digital Document Processing. PODDP 1998. Lecture Notes in Computer Science, vol 1481. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-49654-8_8

Download citation

  • DOI: https://doi.org/10.1007/3-540-49654-8_8

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-65086-7

  • Online ISBN: 978-3-540-49654-0

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics