Skip to main content

Concepts Labeling of Document Clusters Using a Hierarchical Agglomerative Clustering (HAC) Technique

  • Conference paper
  • First Online:
The 8th International Conference on Knowledge Management in Organizations

Part of the book series: Springer Proceedings in Complexity ((SPCOM))

Abstract

The most common way to organize and label documents is to group similar documents into clusters. Normally, the assumed number of clusters may be unreliable since the nature of the grouping structures among the data is unknown before processing and thus the partitioning methods would not predict the structures of the data very well. Hierarchical clustering has been chosen to solve this problem by which they provide data-views at different levels of abstraction, making them ideal for people to visualize the concepts generated and interactively explore large document collections. The appropriate method of combining two different clusters to form a single cluster needs affects the quality of clusters produced. In order to perform this task, various distance methods will be studied in order to cluster documents by using the hierarchical agglomerative clustering. Clusters very often include sub-clusters, and the hierarchical structure is indeed a natural constraint on the underlying application domain. In order to manage and organize documents effectively, similar documents will be merged to form clusters. Each document is represented by one or more concepts. In this paper, concepts that characterize English documents will be generated by using the hierarchical agglomerative clustering. One of the advantages of using hierarchical clustering is that the overlapping clusters can be formed and concepts can be generated based on the contents of each cluster. The quality of clusters produced is also investigated by using different distance measures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Nasser AMB, Jian-Hui J, Ru-Qin Y (2005) Bubble agglomeration algorithm for unsupervised classification: a new clustering methodology without a priori information. Chemometr Intell Lab Syst 77(1–2):43–49

    Google Scholar 

  2. Reynaldo GG, Aurora PP (2010) Dynamic hierarchical algorithms for document clustering. Pattern Recogn Lett 31(6):469–477

    Article  Google Scholar 

  3. Xiaojun W, Jianwu Y (2007) CollabSum: exploiting multiple document clustering for collaborative single document summarizations. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrievalm, Amsterdam, The Netherlands, 23–27 July 2007, pp 143–150

    Google Scholar 

  4. Carullo M, Binaghi E, Gallo I (2009) An online document clustering technique for short web contents. Pattern Recogn Lett 30:870–876

    Article  Google Scholar 

  5. Iliopoulos I, Enright AJ, Ouzounis CA (2001) Textquest: document clustering of medline abstracts for concept discovery in molecular biology. Pac Symp Biocomput 6:384–395

    Google Scholar 

  6. Popescul A, Ungar LH (2000) Automatic labeling of document clusters. http://citeseer.nj.nec.com/popescul00automatic.html

  7. Lamirel JV, Ta AP, Attik M (2008) Novel labeling strategies for hierarchical representation of multidimensional data analysis results. In: Gammerman A (ed) Proceedings of the 26th IASTED international conference on artificial intelligence and applications (AIA ‘08). ACTA Press, Anaheim, pp 169–174

    Google Scholar 

  8. Treeratpituk P, Callan J (2006) An experimental study on automatically labeling hierarchical clusters using statistical features. SIGIR 2006:707–708

    Google Scholar 

  9. Newman, Baldwin T, Cavedon L, Karimi S, Martinez D, Zobel J (2010) Visualizing document collections and search results using topic mapping. J Web Semant 8(2–3):169–175

    Article  Google Scholar 

  10. Magatti, Calegari S, Ciucci D, Stella F (2009) Automatic labeling of topics. In: ISDA 2009, Pisa, pp 1227–1232

    Google Scholar 

  11. Zhao Y, Karypis G (2005) Hierarchical clustering algorithms for document datasets. Data Mining Knowl Disc 10(2):141–168

    Google Scholar 

  12. Salton G, Michael JM (1986) Introduction to modern information retrieval. McGraw-Hill Inc., New York

    Google Scholar 

  13. van Rijsbergen CJ (1979) Information retrieval, 2nd edn. Butterworths, London

    Google Scholar 

  14. Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1(2):224–227

    Article  Google Scholar 

  15. Khalilian M, Mustapha N Data (2010) Stream clustering: challenges and issues, In: Proceedings of the international multiconference of engineers and computer scientists IMECS, Hong Kong, pp 978–988

    Google Scholar 

  16. Torres GJ, Basnet RB, Sung AH, Mukkamala S, Ribeiro BM (2009) A similarity measure for clustering and its applications. Int J Electr Comput Syst Eng 3(3):164–170

    Google Scholar 

  17. Alfred R, Kazakov D, Bartlett M, Paskaleva E (2007) Hierarchical agglomerative clustering for cross-language information retrieval. Int J Transl 19(1):139–162

    Google Scholar 

Download references

Acknowledgments

This work has been supported by the Long Term Research Grant Scheme (LRGS) project funded by the Ministry of Higher Education (MoHE), Malaysia under Grants No. LRGS/TD/2011/UiTM/ICT/04.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rayner Alfred .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer Science+Business Media Dordrecht

About this paper

Cite this paper

Alfred, R., Fun, T.S., Tahir, A., On, C.K., Anthony, P. (2014). Concepts Labeling of Document Clusters Using a Hierarchical Agglomerative Clustering (HAC) Technique. In: Uden, L., Wang, L., Corchado Rodríguez, J., Yang, HC., Ting, IH. (eds) The 8th International Conference on Knowledge Management in Organizations. Springer Proceedings in Complexity. Springer, Dordrecht. https://doi.org/10.1007/978-94-007-7287-8_21

Download citation

  • DOI: https://doi.org/10.1007/978-94-007-7287-8_21

  • Published:

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-94-007-7286-1

  • Online ISBN: 978-94-007-7287-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics