Concepts Labeling of Document Clusters Using a Hierarchical Agglomerative Clustering (HAC) Technique

Alfred, Rayner; Fun, Tan Soo; Tahir, Asni; On, Chin Kim; Anthony, Patricia

doi:10.1007/978-94-007-7287-8_21

Rayner Alfred⁶,
Tan Soo Fun⁶,
Asni Tahir⁶,
Chin Kim On⁶ &
…
Patricia Anthony⁷

Part of the book series: Springer Proceedings in Complexity ((SPCOM))

2916 Accesses
2 Citations

Abstract

The most common way to organize and label documents is to group similar documents into clusters. Normally, the assumed number of clusters may be unreliable since the nature of the grouping structures among the data is unknown before processing and thus the partitioning methods would not predict the structures of the data very well. Hierarchical clustering has been chosen to solve this problem by which they provide data-views at different levels of abstraction, making them ideal for people to visualize the concepts generated and interactively explore large document collections. The appropriate method of combining two different clusters to form a single cluster needs affects the quality of clusters produced. In order to perform this task, various distance methods will be studied in order to cluster documents by using the hierarchical agglomerative clustering. Clusters very often include sub-clusters, and the hierarchical structure is indeed a natural constraint on the underlying application domain. In order to manage and organize documents effectively, similar documents will be merged to form clusters. Each document is represented by one or more concepts. In this paper, concepts that characterize English documents will be generated by using the hierarchical agglomerative clustering. One of the advantages of using hierarchical clustering is that the overlapping clusters can be formed and concepts can be generated based on the contents of each cluster. The quality of clusters produced is also investigated by using different distance measures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Nasser AMB, Jian-Hui J, Ru-Qin Y (2005) Bubble agglomeration algorithm for unsupervised classification: a new clustering methodology without a priori information. Chemometr Intell Lab Syst 77(1–2):43–49
Google Scholar
Reynaldo GG, Aurora PP (2010) Dynamic hierarchical algorithms for document clustering. Pattern Recogn Lett 31(6):469–477
Article Google Scholar
Xiaojun W, Jianwu Y (2007) CollabSum: exploiting multiple document clustering for collaborative single document summarizations. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrievalm, Amsterdam, The Netherlands, 23–27 July 2007, pp 143–150
Google Scholar
Carullo M, Binaghi E, Gallo I (2009) An online document clustering technique for short web contents. Pattern Recogn Lett 30:870–876
Article Google Scholar
Iliopoulos I, Enright AJ, Ouzounis CA (2001) Textquest: document clustering of medline abstracts for concept discovery in molecular biology. Pac Symp Biocomput 6:384–395
Google Scholar
Popescul A, Ungar LH (2000) Automatic labeling of document clusters. http://citeseer.nj.nec.com/popescul00automatic.html
Lamirel JV, Ta AP, Attik M (2008) Novel labeling strategies for hierarchical representation of multidimensional data analysis results. In: Gammerman A (ed) Proceedings of the 26th IASTED international conference on artificial intelligence and applications (AIA ‘08). ACTA Press, Anaheim, pp 169–174
Google Scholar
Treeratpituk P, Callan J (2006) An experimental study on automatically labeling hierarchical clusters using statistical features. SIGIR 2006:707–708
Google Scholar
Newman, Baldwin T, Cavedon L, Karimi S, Martinez D, Zobel J (2010) Visualizing document collections and search results using topic mapping. J Web Semant 8(2–3):169–175
Article Google Scholar
Magatti, Calegari S, Ciucci D, Stella F (2009) Automatic labeling of topics. In: ISDA 2009, Pisa, pp 1227–1232
Google Scholar
Zhao Y, Karypis G (2005) Hierarchical clustering algorithms for document datasets. Data Mining Knowl Disc 10(2):141–168
Google Scholar
Salton G, Michael JM (1986) Introduction to modern information retrieval. McGraw-Hill Inc., New York
Google Scholar
van Rijsbergen CJ (1979) Information retrieval, 2nd edn. Butterworths, London
Google Scholar
Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1(2):224–227
Article Google Scholar
Khalilian M, Mustapha N Data (2010) Stream clustering: challenges and issues, In: Proceedings of the international multiconference of engineers and computer scientists IMECS, Hong Kong, pp 978–988
Google Scholar
Torres GJ, Basnet RB, Sung AH, Mukkamala S, Ribeiro BM (2009) A similarity measure for clustering and its applications. Int J Electr Comput Syst Eng 3(3):164–170
Google Scholar
Alfred R, Kazakov D, Bartlett M, Paskaleva E (2007) Hierarchical agglomerative clustering for cross-language information retrieval. Int J Transl 19(1):139–162
Google Scholar

Download references

Acknowledgments

This work has been supported by the Long Term Research Grant Scheme (LRGS) project funded by the Ministry of Higher Education (MoHE), Malaysia under Grants No. LRGS/TD/2011/UiTM/ICT/04.

Author information

Authors and Affiliations

School of Engineering and Information Technology, Center of Excellence in Semantic Agents, Universiti Malaysia Sabah, Jalan UMS, 88400, Kota Kinabalu, Sabah, Malaysia
Rayner Alfred, Tan Soo Fun, Asni Tahir & Chin Kim On
Faculty of Environment, Society and Design, Department of Applied Computing, Lincoln University, Christchurch, New Zealand
Patricia Anthony

Authors

Rayner Alfred
View author publications
You can also search for this author in PubMed Google Scholar
Tan Soo Fun
View author publications
You can also search for this author in PubMed Google Scholar
Asni Tahir
View author publications
You can also search for this author in PubMed Google Scholar
Chin Kim On
View author publications
You can also search for this author in PubMed Google Scholar
Patricia Anthony
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rayner Alfred .

Editor information

Editors and Affiliations

School of Computing, Staffordshire University, Stafford, United Kingdom
Lorna Uden
College of Management, National University of Kaohsiung, Kaohsiung, Taiwan, Taiwan
Leon S.L. Wang
and Control Faculty of Science, Universidad Salamanca Department of Computing Science, Salamanca, Spain
Juan Manuel Corchado Rodríguez
National University of Kaohsiung, Kaohsiung, Taiwan
Hsin-Chang Yang
Department of Information Management, National University of Kaohsiung, Kaohsiung, Taiwan, Taiwan
I-Hsien Ting

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Alfred, R., Fun, T.S., Tahir, A., On, C.K., Anthony, P. (2014). Concepts Labeling of Document Clusters Using a Hierarchical Agglomerative Clustering (HAC) Technique. In: Uden, L., Wang, L., Corchado Rodríguez, J., Yang, HC., Ting, IH. (eds) The 8th International Conference on Knowledge Management in Organizations. Springer Proceedings in Complexity. Springer, Dordrecht. https://doi.org/10.1007/978-94-007-7287-8_21

Download citation

DOI: https://doi.org/10.1007/978-94-007-7287-8_21
Published: 06 September 2013
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-007-7286-1
Online ISBN: 978-94-007-7287-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics