Skip to main content

Language Geometry Using Random Indexing

  • Conference paper
  • First Online:
Quantum Interaction (QI 2016)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10106))

Included in the following conference series:

Abstract

Random Indexing is a simple implementation of Random Projections with a wide range of applications. It can solve a variety of problems with good accuracy without introducing much complexity. Here we demonstrate its use for identifying the language of text samples, based on a novel method of encoding letter N-grams into high-dimensional Language Vectors. Further, we show that the method is easily implemented and requires little computational power and space. As proof of the method’s statistical validity, we show its success in a language-recognition task. On a difficult data set of 21,000 short sentences from 21 different languages, we achieve 97.4% accuracy, comparable to state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Shannon, C.E.: A mathematical theory of communication. Bell Syst. Techn. J. 27(4), 623–656 (1948)

    Article  MathSciNet  MATH  Google Scholar 

  2. McCandless, M.: Accuracy, performance of Google’s Compact Language Detector (2011). http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html

  3. Landauer, T., Dumais, S.: A solution to Plato’s problem: the latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychol. Rev. 104(2), 211–240 (1997)

    Article  Google Scholar 

  4. Papadimitriou, C.H., et al.: Latent semantic indexing: a probabilistic analysis. In: Proceedings of 17th ACM Symposium on the Principles of Database Systems, pp. 159–168 (1998)

    Google Scholar 

  5. Kaski, S.: Dimensionality reduction by random mapping: fast similarity computation for clustering. In: Proceedings of International Joint Conference on Neural Networks, vol. 1, pp. 413–418 (1998)

    Google Scholar 

  6. Kanerva, P., Kristoferson, J., Holst, A.: Random indexing of text samples for latent semantic analysis. In: Gleitman, L.R., Josh, A.K. (eds.) Proceedings of 22nd Annual Conference of the Cognitive Science Society, p. 1036 (2000)

    Google Scholar 

  7. Sahlgren, M.: An introduction to random indexing. In: Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering (2005)

    Google Scholar 

  8. Mikolov, T., et al.: Efficient estimation of word representations in vector space, p. 12, 7 September 2013. arXiv:1301.3781v3 [cs.CL]

  9. Kanerva, P.: Sparse Distributed Memory. MIT Press, Cambridge (1988)

    MATH  Google Scholar 

  10. Levy, S.D., Gayler, R.W.: Lateral inhibition in a fully distributed connectionist architecture. In: Proceedings of the Ninth International Conference on Cognitive Modeling (2009)

    Google Scholar 

  11. Kanerva, P.: Computing with 10,000-bit words. In: Proceedings of 52nd Annual Allerton Conference on Communication, Control, and Computing (2014)

    Google Scholar 

  12. van der Maaten, L.: Visualizing high-dimensional data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)

    MATH  Google Scholar 

  13. Quasto, U., Richter, M., Biemann, C.: Corpus portal for search in monolingual corpora. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation, LREC, pp. 1799–1802 (2006)

    Google Scholar 

  14. Nakatani, S.: Langdetect is updated (added profiles of Estonian/Lithuanian/Latvian/Slovene, and so on. http://shuyo.wordpress.com/2011/09/29/langdetect-is-updatedadded-profiles-of-estonian-lithuanian-latvian-slovene-and-so-on/. Accessed 16 Dec 2014

  15. Gayler, R.W.: Multiplicative binding, representation operators, analogy. In: Kokinov, B., Holyoak, K., Sofia, G.D. (eds.) Advances in Analogy Research, p. 405. New Bulgarian University (1998)

    Google Scholar 

  16. Hinton, G.E.: Mapping part-whole hierarchies into connectionist networks. Artif. Intell. 46(1–2), 47–75 (1990)

    Article  Google Scholar 

  17. Smolensky, P.: Tensor product variable binding, the representation of symbolic structures in connectionist networks. Artif. Intell. 46(1–2), 159–216 (1990)

    Article  MathSciNet  MATH  Google Scholar 

  18. Plate, T.A.: Holographic reduced representations: convolution algebra for compositional distributed representations. In: Mylopoulos, R.R., Mateo, J.S. (eds.) Proceedings of 12th International Joint Conference on Articial Intelligence (IJCAI), pp. 30–35. Kaufmann, CA (1991)

    Google Scholar 

  19. Plate, T.A.: Holographic Reduced Representation: Distributed Representation of Cognitive Structure. CSLI, Stanford (2003)

    Google Scholar 

  20. Gayler, R.W.: Vector symbolic architectures are a viable alternative for Jackendo’s challenges. Behav. Brain Sci. 29, 78–79 (2006)

    Article  Google Scholar 

Download references

Acknowledgments

We thank Professor Bruno Olshausen for providing the setting for this work in his class on Neural Computation, and two anonymous reviewers for their comments that helped us improve the paper. Pentti Kanerva’s work was supported by Systems On Nanoscale Information fabriCs (SONIC), one of the six SRC STARnet Centers, sponsored by MARCO and DARPA.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aditya Joshi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Joshi, A., Halseth, J.T., Kanerva, P. (2017). Language Geometry Using Random Indexing. In: de Barros, J., Coecke, B., Pothos, E. (eds) Quantum Interaction. QI 2016. Lecture Notes in Computer Science(), vol 10106. Springer, Cham. https://doi.org/10.1007/978-3-319-52289-0_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-52289-0_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-52288-3

  • Online ISBN: 978-3-319-52289-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics