Abstract
The most common way to organize and label documents is to group similar documents into clusters. Normally, the assumed number of clusters may be unreliable since the nature of the grouping structures among the data is unknown before processing and thus the partitioning methods would not predict the structures of the data very well. Hierarchical clustering has been chosen to solve this problem by which they provide data-views at different levels of abstraction, making them ideal for people to visualize the concepts generated and interactively explore large document collections. The appropriate method of combining two different clusters to form a single cluster needs affects the quality of clusters produced. In order to perform this task, various distance methods will be studied in order to cluster documents by using the hierarchical agglomerative clustering. Clusters very often include sub-clusters, and the hierarchical structure is indeed a natural constraint on the underlying application domain. In order to manage and organize documents effectively, similar documents will be merged to form clusters. Each document is represented by one or more concepts. In this paper, concepts that characterize English documents will be generated by using the hierarchical agglomerative clustering. One of the advantages of using hierarchical clustering is that the overlapping clusters can be formed and concepts can be generated based on the contents of each cluster. The quality of clusters produced is also investigated by using different distance measures.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Nasser AMB, Jian-Hui J, Ru-Qin Y (2005) Bubble agglomeration algorithm for unsupervised classification: a new clustering methodology without a priori information. Chemometr Intell Lab Syst 77(1–2):43–49
Reynaldo GG, Aurora PP (2010) Dynamic hierarchical algorithms for document clustering. Pattern Recogn Lett 31(6):469–477
Xiaojun W, Jianwu Y (2007) CollabSum: exploiting multiple document clustering for collaborative single document summarizations. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrievalm, Amsterdam, The Netherlands, 23–27 July 2007, pp 143–150
Carullo M, Binaghi E, Gallo I (2009) An online document clustering technique for short web contents. Pattern Recogn Lett 30:870–876
Iliopoulos I, Enright AJ, Ouzounis CA (2001) Textquest: document clustering of medline abstracts for concept discovery in molecular biology. Pac Symp Biocomput 6:384–395
Popescul A, Ungar LH (2000) Automatic labeling of document clusters. http://citeseer.nj.nec.com/popescul00automatic.html
Lamirel JV, Ta AP, Attik M (2008) Novel labeling strategies for hierarchical representation of multidimensional data analysis results. In: Gammerman A (ed) Proceedings of the 26th IASTED international conference on artificial intelligence and applications (AIA ‘08). ACTA Press, Anaheim, pp 169–174
Treeratpituk P, Callan J (2006) An experimental study on automatically labeling hierarchical clusters using statistical features. SIGIR 2006:707–708
Newman, Baldwin T, Cavedon L, Karimi S, Martinez D, Zobel J (2010) Visualizing document collections and search results using topic mapping. J Web Semant 8(2–3):169–175
Magatti, Calegari S, Ciucci D, Stella F (2009) Automatic labeling of topics. In: ISDA 2009, Pisa, pp 1227–1232
Zhao Y, Karypis G (2005) Hierarchical clustering algorithms for document datasets. Data Mining Knowl Disc 10(2):141–168
Salton G, Michael JM (1986) Introduction to modern information retrieval. McGraw-Hill Inc., New York
van Rijsbergen CJ (1979) Information retrieval, 2nd edn. Butterworths, London
Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1(2):224–227
Khalilian M, Mustapha N Data (2010) Stream clustering: challenges and issues, In: Proceedings of the international multiconference of engineers and computer scientists IMECS, Hong Kong, pp 978–988
Torres GJ, Basnet RB, Sung AH, Mukkamala S, Ribeiro BM (2009) A similarity measure for clustering and its applications. Int J Electr Comput Syst Eng 3(3):164–170
Alfred R, Kazakov D, Bartlett M, Paskaleva E (2007) Hierarchical agglomerative clustering for cross-language information retrieval. Int J Transl 19(1):139–162
Acknowledgments
This work has been supported by the Long Term Research Grant Scheme (LRGS) project funded by the Ministry of Higher Education (MoHE), Malaysia under Grants No. LRGS/TD/2011/UiTM/ICT/04.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer Science+Business Media Dordrecht
About this paper
Cite this paper
Alfred, R., Fun, T.S., Tahir, A., On, C.K., Anthony, P. (2014). Concepts Labeling of Document Clusters Using a Hierarchical Agglomerative Clustering (HAC) Technique. In: Uden, L., Wang, L., Corchado RodrÃguez, J., Yang, HC., Ting, IH. (eds) The 8th International Conference on Knowledge Management in Organizations. Springer Proceedings in Complexity. Springer, Dordrecht. https://doi.org/10.1007/978-94-007-7287-8_21
Download citation
DOI: https://doi.org/10.1007/978-94-007-7287-8_21
Published:
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-007-7286-1
Online ISBN: 978-94-007-7287-8
eBook Packages: Computer ScienceComputer Science (R0)