Advertisement

Classifying Papers from Different Computer Science Conferences

  • Yaakov HaCohen-Kerner
  • Avi Rosenfeld
  • Maor Tzidkani
  • Daniel Nisim Cohen
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8346)

Abstract

This paper analyzes what stylistic characteristics differentiate different styles of writing, and specifically types of different A-level computer science articles. To do so, we compared various full papers using stylistic feature sets and a supervised machine learning method. We report on the success of this approach in identifying papers from the last 6 years of the following three conferences: SIGIR, ACL, and AAMAS. This approach achieves high accuracy results of 95.86%, 97.04%, 93.22%, and 92.14% for the following four classification experiments: (1) SIGIR / ACL, (2) SIGIR / AAMAS, (3) ACL / AAMAS, and (4) SIGIR / ACL / AAMAS, respectively. The Part of Speech (PoS) and the Orthographic sets were superior to all others and have been found as key components in different types of writing.

Keywords

Classification and regression trees Conference classification Decision tree learning Document classification Feature sets Text classification 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Androutsopoulos, I., Koutsias, J., Chandrinos, K., Paliouras, G., Spyropoulos, C.D.: An Evaluation of Naive Bayesian Anti-spam Filtering. CoRR, cs.CL/0006013 (2000)Google Scholar
  2. 2.
    Argamon, S., Shimoni, A.R.: Automatically Categorizing Written Texts by Author Gender. Literary and Linguistic Computing 17, 401–412 (2003)Google Scholar
  3. 3.
    Argamon, S., Koppel, M., Avneri, G.: Style-based Text Categorization: What Newspaper am I Reading? In: AAAI Workshop on Learning for Text (1998)Google Scholar
  4. 4.
    Argamon, S., Koppel, M., Pennebaker, J.W., Schler, J.: Mining the Blogosphere: Age, Gender and the Varieties of Self-expression. First Monday 12(9) (2007)Google Scholar
  5. 5.
    Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. In: Monterey, C.A. (ed.) Wadsworth & Brooks/Cole Advanced Books & Software (1984) ISBN 978-0-412-04841-8Google Scholar
  6. 6.
    Diederich, J., Kindermann, J., Leopold, E., Paass, G.: Authorship Attribution with support vector machines. Applied Intelligence 19(1-2), 109–123 (2003)CrossRefzbMATHGoogle Scholar
  7. 7.
    Dikli, S.: An Overview of Automated Scoring of Essays. Journal of Technology, Learning, and Assessment 5(1), 1–35 (2006)Google Scholar
  8. 8.
    Egghe, L.: Untangling Herdan’s Law and Heaps’ Law: Mathematical and Informetric Arguments. Journal of the American Society for Information Science and Technology 58(5), 702–709 (2007)CrossRefGoogle Scholar
  9. 9.
    Foltz, P.W.: Latent Semantic Analysis for Text-based Research. Behavior Research Methods, Instruments and Computers 28(2), 197–202 (1996)CrossRefGoogle Scholar
  10. 10.
    HaCohen-Kerner, Y., Beck, H., Yehudai, E., Mughaz, D.: Stylistic Feature Sets as Classifiers of Documents According to their Historical Period and Ethnic Origin. Applied Artificial Intelligence 24(9), 847–862 (2010a)CrossRefGoogle Scholar
  11. 11.
    HaCohen-Kerner, Y., Beck, H., Yehudai, E., Rosenstein, M., Mughaz, D.: Cuisine: Classification using Stylistic Feature Sets and/or Name-Based Feature Sets. JASIST 61(8), 1644–1657 (2010b)Google Scholar
  12. 12.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA Data Mining Software: an Update. ACM SIGKDD Explorations Newsletter 11(1), 10–18 (2009)CrossRefGoogle Scholar
  13. 13.
    Hota, S.R., Argamon, S., Chung, R.: Gender in Shakespeare: Automatic Stylistics Gender Character Classification using Syntactic, Lexical and Lemma Features. In: Digital Humanties and Computer Science (DHCS) (2006)Google Scholar
  14. 14.
    Karlgren, J., Cutting, D.: Recognizing Text Genres with Simple Metrics using Discriminant Analysis. In: Proceedings of the 15th International Conference on Computational Linguistics, pp. 1071–1075 (1994)Google Scholar
  15. 15.
    Koppel, M., Argamon, S., Shimoni, A.R.: Automatically Categorizing Written Texts by Author Gender. Lit. Linguist Computing 17(4), 401–412 (2002)Google Scholar
  16. 16.
    Koppel, M., Schler, J., Argamon, S.: Computational Methods in Authorship Attribution. JASIST 60(1), 9–26 (2009)CrossRefGoogle Scholar
  17. 17.
    Koppel, M., Schler, J., Argamon, S.: Authorship Attribution in the Wild. Language Resources and Evaluation 45(1), 83–94 (2011)CrossRefGoogle Scholar
  18. 18.
    Lemaire, B., Dessus, P.: A System to Assess the Semantic Content of Student Essays. Educational Computing Research 24(3), 305–306 (2001)CrossRefGoogle Scholar
  19. 19.
    Lim, C., Lee, K., Kim, G.: Multiple Sets of Features for Automatic Genre Classification of Web Documents. Information Processing Management 41(5), 1263–1276 (2005)CrossRefGoogle Scholar
  20. 20.
    Luyckx, K.: Scalability Issues in Authorship Attribution. Ph.D. Dissertation, Universiteit Antwerpen. University Press, Brussels (2010)Google Scholar
  21. 21.
    Meretakis, D., Wüthrich, B.: Extending Naive Bayes Classifiers using Long Itemsets. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 165–174. ACM (1999)Google Scholar
  22. 22.
    Novak, J., Raghavan, P., Tomkins, A.: Anti-aliasing on the Web. In: Proceedings of the 13th International Conference on World Wide Web (WWW), pp. 30–39. ACM (2004)Google Scholar
  23. 23.
    Pang, B., Lee, L.: Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 115–124. Association for Computational Linguistics (2005)Google Scholar
  24. 24.
    Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up?: Sentiment Classification using Machine Learning Techniques. In: Proceedings of the ACL 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), vol. 10, pp. 79–86 (2002)Google Scholar
  25. 25.
    Porter, M.: An Algorithm for Suffix Stripping. Program 14(3), 130–137 (1980)CrossRefGoogle Scholar
  26. 26.
    Rosenfeld, A., Zuckerman, I., Azaria, A., Kraus, S.: Combining Psychological Models with Machine Learning to Better Predict People’s Decisions. Synthese 189, 81–93 (2012)CrossRefGoogle Scholar
  27. 27.
    Rokach, L., Maimon, O.: Data Mining with Decision Trees: Theory and Applications. World Scientific Pub. Co. Inc. (2008) ISBN 978-9812771711Google Scholar
  28. 28.
    Snyder, B., Barzilay, R.: Multiple Aspect Ranking using the Good Grief Algorithm. In: Proceedings of the HLT-NAACL, pp. 300–307 (2007)Google Scholar
  29. 29.
    Stamatatos, E., Kokkinakis, G., Fakotakis, N.: Automatic Text Categorization in Terms of Genre and Author. Comput. Linguist. 26(4), 471–495 (2000)CrossRefGoogle Scholar
  30. 30.
    Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Computer-based Authorship Attribution without Lexical Measures. Computers and the Humanities 35(2), 193–214 (2001)CrossRefGoogle Scholar
  31. 31.
    Stamatatos, E.: Authorship Attribution based on Feature Set Subspacing Ensembles. International Journal on Artificial Intelligence Tools 15(5), 823–838 (2006)CrossRefGoogle Scholar
  32. 32.
    Stamatatos, E.: Author identification: Using Text Sampling to Handle the Class Imbalance Problem. Inf. Process. Manage. 44(2), 790–799 (2008)CrossRefGoogle Scholar
  33. 33.
    Stamatatos, E.: A Survey of Modern Authorship Attribution Methods. Journal of the American Society for information Science and Technology 60(3), 538–556 (2009)CrossRefGoogle Scholar
  34. 34.
    Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich Part-of-speech Tagging with a Cyclic Dependency Network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL 2003), vol. 1, pp. 173–180. Association for Computational Linguistics (2003)Google Scholar
  35. 35.
    Tweedie, F.J., Baayen, R.H.: How Variable a Constant Be? Measures of Lexical Richness in Perspective. Computers and the Humanities 32(5), 323–352 (1998)CrossRefGoogle Scholar
  36. 36.
    Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd Edition. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann (2005)Google Scholar
  37. 37.
    Yuan, Y., Shaw, M.J.: Induction of Fuzzy Decision Trees. Fuzzy Sets and Systems 69, 125–139 (1995)CrossRefMathSciNetGoogle Scholar
  38. 38.
    Yule, U.: On Sentence Length as a Statistical Characteristic of Style in Prose with Application to Two Cases of Disputed Authorship. Biometrika 30, 363–390 (1938)Google Scholar
  39. 39.
    Zhang, L., Zhu, J., Yao, T.: An Evaluation of Statistical Spam Filtering Techniques. ACM Transactions on Asian Language Information Processing (TALIP) 3(4), 243–269 (2004)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Yaakov HaCohen-Kerner
    • 1
  • Avi Rosenfeld
    • 2
  • Maor Tzidkani
    • 1
  • Daniel Nisim Cohen
    • 1
  1. 1.Dept. of Computer ScienceJerusalem College of TechnologyJerusalemIsrael
  2. 2.Department of Industrial EngineeringJerusalem College of TechnologyJerusalemIsrael

Personalised recommendations