Skip to main content

Various Document Clustering Tasks Using Word Lists

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8281))

Abstract

This research investigates whether it is appropriate to use word lists as features for clustering documents to their authors, to the documents’ countries of origin or to the historical periods in which they were written. We have defined three kinds of word lists: most frequent words (FW) including function words (stopwords), most frequent filtered words (FFW) excluding function words, and words with the highest variance values (VFW). The application domain is articles referring to Jewish law written in Hebrew and Aramaic. The clustering experiments have been done using The EM algorithm. To the best of our knowledge, performing clustering tasks according to countries or periods are novel. The improvement rates in these tasks vary from 11.53% to 39.43%. The clustering tasks according to 2 or 3 authors achieved results above 95% and present superior improvement rates (between 15.61% and 56.51%); most of the improvements have been achieved with FW and VFW. These findings are surprising and contrast the initial assumption that FFW is the prime word list for clustering tasks.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Biemann, C.: Chinese Whispers - An Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems. In: Proceedings of the HLT-NAACL 2006 Workshop on Textgraphs 2006, New York, USA, pp. 73–80 (2006)

    Google Scholar 

  2. Chan, S., Pon, R., Cardenas, A.: Visualization and Clustering of Author Social Networks. In: Distributed Multimedia Systems Conference, pp. 174–180 (2006)

    Google Scholar 

  3. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum Likelihood from Incomplete Data via the EM Algorithm. J. Royal Stat. Soc. B. 39(1), 1–38 (1977)

    MathSciNet  MATH  Google Scholar 

  4. Forman, G.: An Extensive Empirical Study of Feature Selection Metrics for Text Classification. Journal of Machine Learning Research 3, 1289–1305 (2003)

    MATH  Google Scholar 

  5. HaCohen-Kerner, Y., Stern, I., Korkus, D., Fredj, E.: Automatic Machine Learning of Keyphrase Extraction from Short Html Documents Written in Hebrew. Cybernetics and Systems 38(1), 1–21 (2007)

    Article  MATH  Google Scholar 

  6. HaCohen-Kerner, Y., Mughaz, D., Beck, H., Yehudai, E.: Words as Classifiers of Documents According to their Historical Period and the Ethnic Origin of their Authors. Cybernetics and Systems 39(3), 213–228 (2008)

    Article  MATH  Google Scholar 

  7. HaCohen-Kerner, Y., Beck, H., Yehudai, E., Rosenstein, M., Mughaz, D.: Cuisine: Classification using Stylistic Feature Sets and/or Name-Based Feature Sets. JASIST 61(8), 1644–1657 (2010a)

    Google Scholar 

  8. HaCohen-Kerner, Y., Beck, H., Yehudai, E., Mughaz, D.: Stylistic Feature Sets as Classifiers of Documents According to their Historical Period and Ethnic Origin. Applied Artificial Intelligence 24(9), 847–862 (2010b)

    Article  Google Scholar 

  9. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA Data Mining Software: an Update. ACM SIGKDD Explorations Newsletter 11(1), 10–18 (2009)

    Article  Google Scholar 

  10. He, Y., Hui, S.C.: Mining a Web Citation Database for Author Co-Citation Analysis. Information Processing & Management 38(4), 491–508 (2002)

    Article  MATH  Google Scholar 

  11. Hotho, A., Staab, S.: Stumme. G.: Ontologies Improve Text Document Clustering. In: Proceedings of the International Conference on Data Mining, pp. 541–544. IEEE Press (2003)

    Google Scholar 

  12. Hotho, A., Staab, S., Stumme. G.: Wordnet Improves Text Document Clustering. In: Proc. of the Semantic Web Workshop at SIGIR- 2003, 26th Annual Int. ACM SIGIR Conference (2003b)

    Google Scholar 

  13. Jain, K., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Comput. Surveys 31, 264–323 (1991)

    Article  Google Scholar 

  14. Koppel, M., Schler, J.: Mughaz. D.: Text Categorization for Authorship Verification. In: Proc. of the 8th Symposium on Artificial Intelligence and Mathematics, Fort Lauderdale, FL (2004)

    Google Scholar 

  15. Koppel, J., Mughaz, D., Akiva, N.: New Methods for Attribution of Rabbinic Literature, Hebrew Linguistics: A Journal for Hebrew Descriptive. In: Computational and Applied Linguistics, vol. 57, pp. v-xviii. Bar-Ilan University Press (2006)

    Google Scholar 

  16. Li, Y.J., Chung, S.M.: Document Clustering Based on Frequent Word Sequences. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM), pp. 293–294 (2005)

    Google Scholar 

  17. Li, Y.J., Chung, S.M., Holt, J.: Text Document Clustering based on Frequent Word Meaning Sequences. Data & Knowledge Engineering 64, 381–404 (2008)

    Article  Google Scholar 

  18. Miao, Y., Keselj, V., Milios, E.: Document Clustering using Character N-grams: A Comparative Evaluation with Term-based and Word-based Clustering. In: Proc. of the 14th ACM Int. Conference on Information and Knowledge Management, pp. 357–358 (2005)

    Google Scholar 

  19. Mosteller, F., Wallace, D.L.: Inference and Disputed Authorship: The Federalist. Addison-Wesley, Reading (1964)

    MATH  Google Scholar 

  20. Peng, Y., Kou, G., Shi, Y.: Recent Trends in Data Mining (DM): Document Clustering of DM Publications. In: Proceedings of the International Conference on Service Systems and Service Management, vol. 2, pp. 1653–1659 (2006)

    Google Scholar 

  21. Sharoff, S.: Classifying Web Corpora into Domain and Genre Using Automatic Feature Identification. In: Proc. of Web as Corpus Workshop, Louvain-la-Neuve (September 2007)

    Google Scholar 

  22. Steinbach, M., Ertoz, L., Kumar, V.: Challenges of Clustering High Dimensional Data. In: Wille, L.T. (ed.) New Vistas in Statistical Physics – Applications in Econophysics, Bioinformatics, and Pattern Recognition. Springer (2003)

    Google Scholar 

  23. Sun, Y., Han, J., Zhao, P., Yin, Z., Cheng, H., Wu, T.: Rankclus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis. In: Proc. of the 12th International Conference on Extending Database Technology, pp. 565–576. ACM, New York (2009)

    Google Scholar 

  24. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann (2005)

    Google Scholar 

  25. Yu, B.: Function Words for Chinese Authorship Attribution. In: Proceedings of the NAACL-HLT 2012 Workshop on Computational Linguistics for Literature, pp. 45–53. Association for Computational Linguistics (June 2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

HaCohen-Kerner, Y., Margaliot, O. (2013). Various Document Clustering Tasks Using Word Lists. In: Banchs, R.E., Silvestri, F., Liu, TY., Zhang, M., Gao, S., Lang, J. (eds) Information Retrieval Technology. AIRS 2013. Lecture Notes in Computer Science, vol 8281. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45068-6_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-45068-6_14

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-45067-9

  • Online ISBN: 978-3-642-45068-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics