Document Clustering by Relevant Terms: An Approach

  • Cecilia Reyes-Peña
  • Mireya Tovar VidalEmail author
  • José de Jesús Lavalle Martínez
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 1069)


In this work, a document clustering based on relevant terms into an untagged medical text corpus approach is presented. To achieve this, to create a list of documents containing each word is necessary. Then, for relevant term extraction, the frequency of each term is obtained in order to compute the word weight into the corpus and into each document. Finally, the clusters are built by mapping using main concepts from an ontology and the relevant terms (only subjects), assuming that if two words appear in the same documents these words are related. The obtained clusters have a category corresponding to ontology concepts, and they are measured with cluster from K-Means (assuming the k-Means cluster were well formed) using the Overlap Coefficient and obtaining 70% of similarity among the clusters.


Documents clustering Relevant terms Medical corpus 



This work is supported by the Sectoral Research Fund for Education with the CONACyT project 257357, and partially supported by the VIEP-BUAP project.


  1. 1.
    Siddiqi, S., Sharan, A.: Keyword and keyphrase extraction techniques: a literature review. Int. J. Comput. Appl. 109(2), 18–23 (2015)Google Scholar
  2. 2.
    Jensi, R., Wiselin, J.G.: A survey on optimization approaches to text document clustering. Int. J. Comput. Sci. Appl. 3, 31–44 (2013)Google Scholar
  3. 3.
    Abualigah, L.M., Khader, A.T., Al-Betar, M.A., Alomari, O.A.: Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering. Expert Syst. Appl. 84, 24–36 (2017)CrossRefGoogle Scholar
  4. 4.
    Medline. Accessed 02 Aug 2019
  5. 5.
    Pinto, D., Rosso, P.: KnCr: a short-text narrow-domain sub-corpus of medline. In: Proceedings of TLH-ENC 2006, pp. 266–269 (2006)Google Scholar
  6. 6.
    Habibi, M., Popescu-Belis, A.: Keyword extraction and clustering for document recommendation in conversations. IEEE/ACM Trans. Audio Speech Lang. Process. 23, 746–759 (2015)CrossRefGoogle Scholar
  7. 7.
    Steinbach, M., Karypis, G., Kumar, V., et al.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining, Boston, vol. 400, pp. 525–526 (2000)Google Scholar
  8. 8.
    Balabantaray, R.C., Sarma, C., Jha, M.: Document clustering using k-means and k-medoids. CoRR, abs/1502.07938 (2015)Google Scholar
  9. 9.
    Beltrán, B., Ayala, D.V., Pinto, D., Martínez, R.: Towards the construction of a clustering algorithm with overlap directed by query. Res. Comput. Sci. 145, 97–105 (2017)Google Scholar
  10. 10.
    Jun, S., Park, S.-S., Jang, D.-S.: Document clustering method using dimension reduction and support vector clustering to overcome sparseness. Expert Syst. Appl. 41(7), 3204–3212 (2014)CrossRefGoogle Scholar
  11. 11.
    Disease ontology. Accessed 02 Aug 2019
  12. 12.
    Reyes-Peña, C., Pinto-Avendaño, D., Vilariño Ayala, D.: Emotion classification of twitter data using an approach based on ranking. Res. Comput. Sci. 147(11), 45–52 (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Cecilia Reyes-Peña
    • 1
  • Mireya Tovar Vidal
    • 1
    Email author
  • José de Jesús Lavalle Martínez
    • 1
  1. 1.Faculty of Computer ScienceBenemérita Universidad Autónoma de PueblaPueblaMexico

Personalised recommendations