Abstract
Statistical semantics methods are fairly controversial in the IR community, mostly because of their instability and difficulty to debug. At the same time, they are extremely tempting, in the same way perhaps, as Artificial Intelligence was in the 60s. Then, it took a few decades for the hype to pass and for us to learn the real utility and limits of the great technologies developed earlier. This paper takes an exhaustive view of the performance and utility of a particular statistical semantics method, Random Indexing, in the context of difficult texts. After over a year of CPU time in experiments, we provide a global view of the behaviour of the method on a particularly challenging test collection based on patent data. In the end, we observe interesting patterns emerging in the semantic space created by the method, which we hypothesize to be the cause of the behaviour observed in the experiments.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Achlioptas, D.: Database-friendly random projections. In: Proc. of PODS (2001)
Adams, S.: The text, the full text and nothing but the text: Part 1 - standards for creating textual information in patent documents and general search implications. WPI Journal 32(1), 22–29 (2010)
Atkinson, K.H.: Towards a more rational patent search paradigm. In: Proc. of PaIR (2008)
Bast, H., Majumdar, D.: Why spectral retrieval works. In: Proc. of SIGIR (2005)
Bingham, E., Mannila, H.: Random projection in dimensionality reduction: applications to image and text data. In: Proc. of KDD (2001)
Bradford, R.B.: An empirical study of required dimensionality for large-scale latent semantic indexing applications. In: Proc. of CIKM (2008)
Cohen, T., Schvaneveldt, R., Widdows, D.: Reflective random indexing and indirect inference: A scalable method for discovery of implicit connections. Journal of Biomedical Informatics 43(2) (2010)
Furnas, G.W., Dumais, S.T., Landauer, T.K., Harshman, R.A., Streeter, L.A., Lochbaum, K.E.: Information Retrieval using Singular Value Decomposition Model of Latent Semantic Structure. In: Proc. of SIGIR (1988)
Garron, A., Kontostathis, A.: Applying latent semantic indexing on the trec 2010 legal dataset. In: Text Retrieval Conference, TREC (2010)
Johnson, W.B., Lindenstrauss, J.: Extensions to lipschiz mapping into hilbert space. Contemporary Mathematics 26 (1984)
Joho, H., Sanderson, M.: Document frequency and term specificity. In: Large Scale Semantic Access to Content (Text, Image, Video, & Sound), RIAO (2007)
Jonnalagadda, S., Cohen, T., Wu, S., Gonzalez, G.: Enhancing clinical concept extraction with distributional semantics. Journal of Biomedical Informatics 45(1), 129–140 (2012)
Karlgren, J., Sahlgren, M.: From words to understanding. In: Uesaka, Y., Kanerva, P., Ashton, H. (eds.) Foundations of Real-World Intelligence (2001)
Landauer, T.K., Dumais, S.T.: A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 211–240 (1997)
Lund, K., Burgess, C.: Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods 28 (1996)
Lupu, M., Hanbury, A.: Patent Retrieval. Foundations and Trends in Information Retrieval 7(1) (2013)
Martin, D., Berry, M.: Mathematical Foundations Behind Latent Semantic Analysis. In: Handbook of Latent Semantic Analysis (2007)
Oostdijk, N., D’hondt, E., van Halteren, H., Verberne, S.: Genre and domain in patent texts. In: Proc. of PaIR (2010)
Piroi, F., Lupu, M., Hanbury, A., Zenz, V.: Clef-ip 2011: Retrieval in the intellectual property domain. In: CLEF (Notebook Papers/Labs/Workshop) (2011)
Sahlgren, M.: An introduction to random indexing. Technical report, SICS, Swedish Institute of Computer Science (2005)
Sahlgren, M., Hansen, P., Karlgren, J.: English-Japanese cross-lingual query expansion using random indexing of aligned bilingual text data. In: Proc. of NTCIR (2002)
Sahlgren, M., Karlgren, J.: Vector-based semantic analysis using random indexing for cross-lingual query expansion. In: Peters, C., Braschler, M., Gonzalo, J., Kluck, M. (eds.) CLEF 2001. LNCS, vol. 2406, pp. 169–176. Springer, Heidelberg (2002)
Sahlgren, M., Karlgren, J.: Terminology mining in social media. In: Proc. of CIKM (2009)
Sanderson, M.: Ambiguous queries: test collections need more sense. In: Proc. of SIGIR (2008)
Schütze, H.: Dimensions of meaning. In: Proceedings of the Supercomputing 1992 (1992)
Schütze, H., Pederse, J.O.: A cooccurrence-based thesaurus and two applications to information retrieval. Information Processing & Management 33(3) (1997)
Widdows, D., Cohen, T.: The semantic vectors package: New algorithms and public tools for distributional semantics. In: Proc. of ICSC (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Lupu, M. (2014). On the Usability of Random Indexing in Patent Retrieval. In: Hernandez, N., Jäschke, R., Croitoru, M. (eds) Graph-Based Representation and Reasoning. ICCS 2014. Lecture Notes in Computer Science(), vol 8577. Springer, Cham. https://doi.org/10.1007/978-3-319-08389-6_17
Download citation
DOI: https://doi.org/10.1007/978-3-319-08389-6_17
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-08388-9
Online ISBN: 978-3-319-08389-6
eBook Packages: Computer ScienceComputer Science (R0)