Distributional Term Representations for Short-Text Categorization

  • Juan Manuel Cabrera
  • Hugo Jair Escalante
  • Manuel Montes-y-Gómez
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7817)


Everyday, millions of short-texts are generated for which effective tools for organization and retrieval are required. Because of the tiny length of these documents and of their extremely sparse representations, the direct application of standard text categorization methods is not effective. In this work we propose using distributional term representations (DTRs) for short-text categorization. DTRs represent terms by means of contextual information, given by document occurrence and term co-occurrence statistics. Therefore, they allow us to develop enriched document representations that help to overcome, to some extent, the small-length and high-sparsity issues. We report experimental results in three challenging collections, using a variety of classification methods. These results show that the use of DTRs is beneficial for improving the classification performance of classifiers in short-text categorization.


Weighting Scheme Sparse Representation Text Categorization External Resource Latent Semantic Analysis 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Cabrera, J.M.: Clasificación de textos cortos usando representaciones distribucionales de los términos. Master’s thesis, Instituto Nacional de Astrofísica, Óptica y Electrónica (2012)Google Scholar
  2. 2.
    Cardoso-Cachopo, A., Oliveira, A.: Combining LSI with other classifiers to improve accuracy of single-label text categorization. In: First European Workshop on Latent Semantic Analysis in Technology Enhanced Learning, Netherlands (2007)Google Scholar
  3. 3.
    Escalante, H.J., Montes, M., Sucar, E.: Multimodal indexing based on semantic cohesion for image retrieval. Information Retrieval 15(1), 1–32 (2012)CrossRefGoogle Scholar
  4. 4.
    Faguo, Z., Fan, Z., Bingru, Y.: Research on Short Text Classification Algorithm Based on Statistics and Rules. In: Third International Symposium on Electronic Commerce and Security, pp. 3–7 (July 2010)Google Scholar
  5. 5.
    Fan, X., Hu, H.: A New Model for Chinese Short-text Classification Considering Feature Extension. In: International Conference on Artificial Intelligence and Computational Intelligence, pp. 7–11. IEEE (October 2010)Google Scholar
  6. 6.
    Garner, S.R.: Weka: The Waikato environment for knowledge analysis. In: Proceedings of the New Zealand Computer Science Research Students Conference, pp. 57–64 (1995)Google Scholar
  7. 7.
    He, F., Ding, X.-q.: Improving naive bayes text classifier using smoothing methods. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 703–707. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  8. 8.
    Ingaramo, D., Errecalde, M., Rosso, P.: A General Bio-inspired Method to Improve the Short-Text Clustering Task. In: Gelbukh, A. (ed.) CICLing 2010. LNCS, vol. 6008, pp. 661–672. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  9. 9.
    Ingaramo, D., Pinto, D., Rosso, P., Errecalde, M.: Evaluation of internal validity measures in short-text corpora. In: Gelbukh, A. (ed.) CICLing 2008. LNCS, vol. 4919, pp. 555–567. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  10. 10.
    Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  11. 11.
    Lavelli, A., Sebastiani, F., Zanoli, R.: Distributional Term Representations: An Experimental Comparison. In: Italian Workshop on Advanced Database Systems (2004)Google Scholar
  12. 12.
    Lewis, D.D.: Naive Bayes at Forty: The independence assumption in information retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–15. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  13. 13.
    Makagonov, P., Alexandrov, M., Gelbukh, A.F.: Clustering abstracts instead of full texts. In: Proceedings of the 10th International Conference on Text, Speech and Dialogue, pp. 129–136 (2004)Google Scholar
  14. 14.
    Nagarajan, M., Sheth, A., Aguilera, M., Keeton, K.: Altering Document Term Vectors for Classification - Ontologies as Expectations of Co-occurrence. In: ReCALL, pp. 1225–1226 (2007)Google Scholar
  15. 15.
    Phan, X.-H., Nguyen, C.-T., Le, D.-T., Nguyen, L.-M., Horiguchi, S., Ha, Q.-T.: A hidden topic-based framework towards building applications with short web documents. IEEE Transactions on Knowledge and Data Engineering 23(7), 961–976 (2011)CrossRefGoogle Scholar
  16. 16.
    Phan, X.-H., Nguyen, L.-M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceeding of the 17th International Conference on World Wide Web - WWW 2008, p. 91 (2008)Google Scholar
  17. 17.
    Pinto, D., Rosso, P.: On the Relative Hardness of Clustering Corpora. In: Proceedings of the 10th International Conference on Text, Speech and Dialogue, pp. 155–161 (2007)Google Scholar
  18. 18.
    Pinto, D., Rosso, P., Jimenez-Salazar, H.: A Self-enriching Methodology for Clustering Narrow Domain Short Texts. The Computer Journal, 1–18 (September 2010)Google Scholar
  19. 19.
    Pu, Q., Yang, G.-w.: Short-text classification based on ICA and LSA. In: Wang, J., Yi, Z., Żurada, J.M., Lu, B.-L., Yin, H. (eds.) ISNN 2006. LNCS, vol. 3972, pp. 265–270. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  20. 20.
    Ramírez-de-la-Rosa, G., Montes-y-Gómez, M., Solorio, T., Villaseñor-Pineda, L.: A document is known by the company it keeps: neighborhood consensus for short text categorization. Language Resources and Evaluation, 1–23 (to appear, 2013)Google Scholar
  21. 21.
    Rosas, V., Errecalde, M.L., Rosso, P.: Un Analisis Comparativo de Estrategias para la Categorización Semantica de Textos Cortos. Sociedad Española para el Procesamiento del Lenguaje Natural 44, 11–18 (2010)Google Scholar
  22. 22.
    Rosso, P., Errecalde, M., Pinto, D.: Language resources and evaluation journal: Special issue on analysis of short texts on the web (forthcoming, 2013)Google Scholar
  23. 23.
    Sahlgren, M., Cöster, R.: Using bag-of-concepts to improve the performance of support vector machines in text categorization. In: Proceedings of the 20th International Conference on Computational Linguistics, COLING 2004, pp. 1–7 (2004)Google Scholar
  24. 24.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)CrossRefGoogle Scholar
  25. 25.
    Wang, J., Zhou, Y., Li, L., Hu, B., Hu, X.: Improving Short Text Clustering Performance with Keyword Expansion. In: Wang, H., Shen, Y., Huang, T., Zeng, Z. (eds.) The Sixth International Symposium on Neural Networks (ISNN 2009). AISC, vol. 56, pp. 291–298. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  26. 26.
    Xi-Wei, Y.: Feature Extension for short text. In: Proceedings of the Third International Symposium on Computer Science and Computational Technology, pp. 338–341 (2010)Google Scholar
  27. 27.
    Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1999, pp. 42–49. ACM, New York (1999)CrossRefGoogle Scholar
  28. 28.
    Zelikovitz, S.: Transductive LSI for Short Text Classification Problems. In: American Association for Artificial Intelligence (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Juan Manuel Cabrera
    • 1
  • Hugo Jair Escalante
    • 1
  • Manuel Montes-y-Gómez
    • 1
  1. 1.Department of Computational SciencesInstituto Nacional de Astrofísica, Óptica y ElectrónicaTonantzintlaMexico

Personalised recommendations