Abstract
This research investigates whether it is appropriate to use word lists as features for clustering documents to their authors, to the documents’ countries of origin or to the historical periods in which they were written. We have defined three kinds of word lists: most frequent words (FW) including function words (stopwords), most frequent filtered words (FFW) excluding function words, and words with the highest variance values (VFW). The application domain is articles referring to Jewish law written in Hebrew and Aramaic. The clustering experiments have been done using The EM algorithm. To the best of our knowledge, performing clustering tasks according to countries or periods are novel. The improvement rates in these tasks vary from 11.53% to 39.43%. The clustering tasks according to 2 or 3 authors achieved results above 95% and present superior improvement rates (between 15.61% and 56.51%); most of the improvements have been achieved with FW and VFW. These findings are surprising and contrast the initial assumption that FFW is the prime word list for clustering tasks.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Biemann, C.: Chinese Whispers - An Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems. In: Proceedings of the HLT-NAACL 2006 Workshop on Textgraphs 2006, New York, USA, pp. 73–80 (2006)
Chan, S., Pon, R., Cardenas, A.: Visualization and Clustering of Author Social Networks. In: Distributed Multimedia Systems Conference, pp. 174–180 (2006)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum Likelihood from Incomplete Data via the EM Algorithm. J. Royal Stat. Soc. B. 39(1), 1–38 (1977)
Forman, G.: An Extensive Empirical Study of Feature Selection Metrics for Text Classification. Journal of Machine Learning Research 3, 1289–1305 (2003)
HaCohen-Kerner, Y., Stern, I., Korkus, D., Fredj, E.: Automatic Machine Learning of Keyphrase Extraction from Short Html Documents Written in Hebrew. Cybernetics and Systems 38(1), 1–21 (2007)
HaCohen-Kerner, Y., Mughaz, D., Beck, H., Yehudai, E.: Words as Classifiers of Documents According to their Historical Period and the Ethnic Origin of their Authors. Cybernetics and Systems 39(3), 213–228 (2008)
HaCohen-Kerner, Y., Beck, H., Yehudai, E., Rosenstein, M., Mughaz, D.: Cuisine: Classification using Stylistic Feature Sets and/or Name-Based Feature Sets. JASIST 61(8), 1644–1657 (2010a)
HaCohen-Kerner, Y., Beck, H., Yehudai, E., Mughaz, D.: Stylistic Feature Sets as Classifiers of Documents According to their Historical Period and Ethnic Origin. Applied Artificial Intelligence 24(9), 847–862 (2010b)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA Data Mining Software: an Update. ACM SIGKDD Explorations Newsletter 11(1), 10–18 (2009)
He, Y., Hui, S.C.: Mining a Web Citation Database for Author Co-Citation Analysis. Information Processing & Management 38(4), 491–508 (2002)
Hotho, A., Staab, S.: Stumme. G.: Ontologies Improve Text Document Clustering. In: Proceedings of the International Conference on Data Mining, pp. 541–544. IEEE Press (2003)
Hotho, A., Staab, S., Stumme. G.: Wordnet Improves Text Document Clustering. In: Proc. of the Semantic Web Workshop at SIGIR- 2003, 26th Annual Int. ACM SIGIR Conference (2003b)
Jain, K., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Comput. Surveys 31, 264–323 (1991)
Koppel, M., Schler, J.: Mughaz. D.: Text Categorization for Authorship Verification. In: Proc. of the 8th Symposium on Artificial Intelligence and Mathematics, Fort Lauderdale, FL (2004)
Koppel, J., Mughaz, D., Akiva, N.: New Methods for Attribution of Rabbinic Literature, Hebrew Linguistics: A Journal for Hebrew Descriptive. In: Computational and Applied Linguistics, vol. 57, pp. v-xviii. Bar-Ilan University Press (2006)
Li, Y.J., Chung, S.M.: Document Clustering Based on Frequent Word Sequences. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM), pp. 293–294 (2005)
Li, Y.J., Chung, S.M., Holt, J.: Text Document Clustering based on Frequent Word Meaning Sequences. Data & Knowledge Engineering 64, 381–404 (2008)
Miao, Y., Keselj, V., Milios, E.: Document Clustering using Character N-grams: A Comparative Evaluation with Term-based and Word-based Clustering. In: Proc. of the 14th ACM Int. Conference on Information and Knowledge Management, pp. 357–358 (2005)
Mosteller, F., Wallace, D.L.: Inference and Disputed Authorship: The Federalist. Addison-Wesley, Reading (1964)
Peng, Y., Kou, G., Shi, Y.: Recent Trends in Data Mining (DM): Document Clustering of DM Publications. In: Proceedings of the International Conference on Service Systems and Service Management, vol. 2, pp. 1653–1659 (2006)
Sharoff, S.: Classifying Web Corpora into Domain and Genre Using Automatic Feature Identification. In: Proc. of Web as Corpus Workshop, Louvain-la-Neuve (September 2007)
Steinbach, M., Ertoz, L., Kumar, V.: Challenges of Clustering High Dimensional Data. In: Wille, L.T. (ed.) New Vistas in Statistical Physics – Applications in Econophysics, Bioinformatics, and Pattern Recognition. Springer (2003)
Sun, Y., Han, J., Zhao, P., Yin, Z., Cheng, H., Wu, T.: Rankclus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis. In: Proc. of the 12th International Conference on Extending Database Technology, pp. 565–576. ACM, New York (2009)
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann (2005)
Yu, B.: Function Words for Chinese Authorship Attribution. In: Proceedings of the NAACL-HLT 2012 Workshop on Computational Linguistics for Literature, pp. 45–53. Association for Computational Linguistics (June 2012)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
HaCohen-Kerner, Y., Margaliot, O. (2013). Various Document Clustering Tasks Using Word Lists. In: Banchs, R.E., Silvestri, F., Liu, TY., Zhang, M., Gao, S., Lang, J. (eds) Information Retrieval Technology. AIRS 2013. Lecture Notes in Computer Science, vol 8281. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45068-6_14
Download citation
DOI: https://doi.org/10.1007/978-3-642-45068-6_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-45067-9
Online ISBN: 978-3-642-45068-6
eBook Packages: Computer ScienceComputer Science (R0)