Various Document Clustering Tasks Using Word Lists

HaCohen-Kerner, Yaakov; Margaliot, Orr

doi:10.1007/978-3-642-45068-6_14

Various Document Clustering Tasks Using Word Lists

Yaakov HaCohen-Kerner²⁰ &
Orr Margaliot²⁰

Conference paper

1466 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8281))

Abstract

This research investigates whether it is appropriate to use word lists as features for clustering documents to their authors, to the documents’ countries of origin or to the historical periods in which they were written. We have defined three kinds of word lists: most frequent words (FW) including function words (stopwords), most frequent filtered words (FFW) excluding function words, and words with the highest variance values (VFW). The application domain is articles referring to Jewish law written in Hebrew and Aramaic. The clustering experiments have been done using The EM algorithm. To the best of our knowledge, performing clustering tasks according to countries or periods are novel. The improvement rates in these tasks vary from 11.53% to 39.43%. The clustering tasks according to 2 or 3 authors achieved results above 95% and present superior improvement rates (between 15.61% and 56.51%); most of the improvements have been achieved with FW and VFW. These findings are surprising and contrast the initial assumption that FFW is the prime word list for clustering tasks.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Biemann, C.: Chinese Whispers - An Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems. In: Proceedings of the HLT-NAACL 2006 Workshop on Textgraphs 2006, New York, USA, pp. 73–80 (2006)
Google Scholar
Chan, S., Pon, R., Cardenas, A.: Visualization and Clustering of Author Social Networks. In: Distributed Multimedia Systems Conference, pp. 174–180 (2006)
Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum Likelihood from Incomplete Data via the EM Algorithm. J. Royal Stat. Soc. B. 39(1), 1–38 (1977)
MathSciNet MATH Google Scholar
Forman, G.: An Extensive Empirical Study of Feature Selection Metrics for Text Classification. Journal of Machine Learning Research 3, 1289–1305 (2003)
MATH Google Scholar
HaCohen-Kerner, Y., Stern, I., Korkus, D., Fredj, E.: Automatic Machine Learning of Keyphrase Extraction from Short Html Documents Written in Hebrew. Cybernetics and Systems 38(1), 1–21 (2007)
Article MATH Google Scholar
HaCohen-Kerner, Y., Mughaz, D., Beck, H., Yehudai, E.: Words as Classifiers of Documents According to their Historical Period and the Ethnic Origin of their Authors. Cybernetics and Systems 39(3), 213–228 (2008)
Article MATH Google Scholar
HaCohen-Kerner, Y., Beck, H., Yehudai, E., Rosenstein, M., Mughaz, D.: Cuisine: Classification using Stylistic Feature Sets and/or Name-Based Feature Sets. JASIST 61(8), 1644–1657 (2010a)
Google Scholar
HaCohen-Kerner, Y., Beck, H., Yehudai, E., Mughaz, D.: Stylistic Feature Sets as Classifiers of Documents According to their Historical Period and Ethnic Origin. Applied Artificial Intelligence 24(9), 847–862 (2010b)
Article Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA Data Mining Software: an Update. ACM SIGKDD Explorations Newsletter 11(1), 10–18 (2009)
Article Google Scholar
He, Y., Hui, S.C.: Mining a Web Citation Database for Author Co-Citation Analysis. Information Processing & Management 38(4), 491–508 (2002)
Article MATH Google Scholar
Hotho, A., Staab, S.: Stumme. G.: Ontologies Improve Text Document Clustering. In: Proceedings of the International Conference on Data Mining, pp. 541–544. IEEE Press (2003)
Google Scholar
Hotho, A., Staab, S., Stumme. G.: Wordnet Improves Text Document Clustering. In: Proc. of the Semantic Web Workshop at SIGIR- 2003, 26th Annual Int. ACM SIGIR Conference (2003b)
Google Scholar
Jain, K., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Comput. Surveys 31, 264–323 (1991)
Article Google Scholar
Koppel, M., Schler, J.: Mughaz. D.: Text Categorization for Authorship Verification. In: Proc. of the 8th Symposium on Artificial Intelligence and Mathematics, Fort Lauderdale, FL (2004)
Google Scholar
Koppel, J., Mughaz, D., Akiva, N.: New Methods for Attribution of Rabbinic Literature, Hebrew Linguistics: A Journal for Hebrew Descriptive. In: Computational and Applied Linguistics, vol. 57, pp. v-xviii. Bar-Ilan University Press (2006)
Google Scholar
Li, Y.J., Chung, S.M.: Document Clustering Based on Frequent Word Sequences. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM), pp. 293–294 (2005)
Google Scholar
Li, Y.J., Chung, S.M., Holt, J.: Text Document Clustering based on Frequent Word Meaning Sequences. Data & Knowledge Engineering 64, 381–404 (2008)
Article Google Scholar
Miao, Y., Keselj, V., Milios, E.: Document Clustering using Character N-grams: A Comparative Evaluation with Term-based and Word-based Clustering. In: Proc. of the 14th ACM Int. Conference on Information and Knowledge Management, pp. 357–358 (2005)
Google Scholar
Mosteller, F., Wallace, D.L.: Inference and Disputed Authorship: The Federalist. Addison-Wesley, Reading (1964)
MATH Google Scholar
Peng, Y., Kou, G., Shi, Y.: Recent Trends in Data Mining (DM): Document Clustering of DM Publications. In: Proceedings of the International Conference on Service Systems and Service Management, vol. 2, pp. 1653–1659 (2006)
Google Scholar
Sharoff, S.: Classifying Web Corpora into Domain and Genre Using Automatic Feature Identification. In: Proc. of Web as Corpus Workshop, Louvain-la-Neuve (September 2007)
Google Scholar
Steinbach, M., Ertoz, L., Kumar, V.: Challenges of Clustering High Dimensional Data. In: Wille, L.T. (ed.) New Vistas in Statistical Physics – Applications in Econophysics, Bioinformatics, and Pattern Recognition. Springer (2003)
Google Scholar
Sun, Y., Han, J., Zhao, P., Yin, Z., Cheng, H., Wu, T.: Rankclus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis. In: Proc. of the 12th International Conference on Extending Database Technology, pp. 565–576. ACM, New York (2009)
Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann (2005)
Google Scholar
Yu, B.: Function Words for Chinese Authorship Attribution. In: Proceedings of the NAACL-HLT 2012 Workshop on Computational Linguistics for Literature, pp. 45–53. Association for Computational Linguistics (June 2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Science, Jerusalem College of Technology, 9116001, Jerusalem, Israel
Yaakov HaCohen-Kerner & Orr Margaliot

Authors

Yaakov HaCohen-Kerner
View author publications
You can also search for this author in PubMed Google Scholar
Orr Margaliot
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute for Infocomm Research, Human Language Technology, 1 Fusionopolis Way #21-01, Connexis South, 138632, Singapore
Rafael E. Banchs , Min Zhang & Sheng Gao , &
Yahoo Labs, Avinguda Diagonal 177, 08018, Barcelona, Spain
Fabrizio Silvestri
Microsoft Research Asia, No. 5, Danling Street, Haidian District, 100080, Beijing, China
Tie-Yan Liu
Institute for Infocomm Research, Human Language Technology, 1 Fusionopolis Way #21-01, Connexis South,, 138632, Singapore
Jun Lang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

HaCohen-Kerner, Y., Margaliot, O. (2013). Various Document Clustering Tasks Using Word Lists. In: Banchs, R.E., Silvestri, F., Liu, TY., Zhang, M., Gao, S., Lang, J. (eds) Information Retrieval Technology. AIRS 2013. Lecture Notes in Computer Science, vol 8281. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45068-6_14

Download citation

DOI: https://doi.org/10.1007/978-3-642-45068-6_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-45067-9
Online ISBN: 978-3-642-45068-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics