Fast k-NN Classifier for Documents Based on a Graph Structure

  • Fernando José Artigas-Fuentes
  • Reynaldo Gil-García
  • José Manuel Badía-Contelles
  • Aurora Pons-Porrata
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6419)

Abstract

In this paper, a fast k nearest neighbors (k-NN) classifier for documents is presented. Documents are usually represented in a high-dimensional feature space, where their terms are treated as features and the weight of each term reflects its importance in the document. There are many approaches to find the vicinity of an object, but their performance drastically decreases as the number of dimensions grows. This problem prevents its application for documents. The proposed method is based on a graph index structure with a fast search algorithm. Its high selectivity permits to obtain a similar classification quality than the exhaustive classifier, with a few number of computed distances. Our experimental results show that our method can be applied to problems of very high dimensionality, such as Text Mining.

Keywords

nearest neighbor classifier fast nearest neighbor search text documents 

References

  1. 1.
    Myles, J.P., Hand, D.J.: The Multi-Class Metric Problem in Nearest Neighbor Discrimination Rule. Pattern Recognition 23, 1291–1297 (1990)CrossRefGoogle Scholar
  2. 2.
    Schek, H., et al.: A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: VLDB 1998, New York, USA, pp. 194–205 (1998)Google Scholar
  3. 3.
    Ferhatosmanoglu, H., et al.: High dimensional nearest neighbor searching. Information Systems 31, 512–540 (2006)CrossRefGoogle Scholar
  4. 4.
    Berchtold, S., et al.: Independent quantization: An index compression technique for high-dimensional data spaces. In: Proc. 16th Int. Conf. on Data Engineering, San Diego, CA, pp. 577–588 (2000)Google Scholar
  5. 5.
    Tuncel, E., et al.: VQ-Index: An Index Structure for Similarity Searching in Multimedia Databases. In: 10th ACM International Conf. on Multimedia 2002, Juan Les Pins, France, pp. 543–552 (2002)Google Scholar
  6. 6.
    Arya, S., et al.: An optimal algorithm for approximate nearest neighbor searching. In: 5th Ann. ACM-SIAM Symposium on Discrete Algorithms, pp. 573–582 (1994)Google Scholar
  7. 7.
    Chávez, E., et al.: Effective proximity retrieval by ordering permutation. In: IEEE Trans. on Pattern Analysis and Machine Intelligence, TPAMI 2007, vol. 30(9), pp. 1647–1658 (2008)Google Scholar
  8. 8.
    Figueroa, K., Fredriksson, K.: Speeding up permutation based indexing with indexing. In: SISAP 2009, pp. 107–114. IEEE Computer Society, Los Alamitos (2009)Google Scholar
  9. 9.
    Hernández-Rodríguez, S., et al.: Fast Most Similar Neighbor Classifier for Mixed Data Based on a Tree Structure. In: Rueda, L., Mery, D., Kittlel, J. (eds.) CIARP 2007. LNCS, vol. 4756, pp. 407–416. Springer, Heidelberg (2007)Google Scholar
  10. 10.
    Artigas-Fuentes, F., et al.: Vicinity calculation with graph in text mining. In: Genolet, F. (ed.) UCT, vol. 48, pp. 1–10 (2008)Google Scholar
  11. 11.
    Artigas-Fuentes, F., et al.: A High-dimensional Access Method for Approximated Similarity Search in Text Mining. In: ICPR 2010 Congress, Istanbul, Turkey (2010)Google Scholar
  12. 12.
    Lewis, D.L., et al.: RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research 5, 361–397 (2004)Google Scholar
  13. 13.
    Yu, J.: General C-Means Clustering Model. IEEE Trans. on Pattern Analysis and Machine Intelligence 27(8), 1197–1211 (2005)CrossRefGoogle Scholar
  14. 14.
    MacQueen, J.: Some Methods for Classification and Analysis of Multivariate Observations. In: Proc. 5th Berkeley Symp. Math. Statistics and Probability, vol. 1, pp. 281–297 (1967)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Fernando José Artigas-Fuentes
    • 1
  • Reynaldo Gil-García
    • 1
  • José Manuel Badía-Contelles
    • 2
  • Aurora Pons-Porrata
    • 1
  1. 1.Center of Pattern Recognition and Data MiningUniversidad de OrienteSantiago de CubaCuba
  2. 2.Computer Science and Engineering DepartmentUniversitat Jaume ICastellóSpain

Personalised recommendations