Skip to main content

Identifying Historical Period and Ethnic Origin of Documents Using Stylistic Feature Sets

  • Conference paper
Discovery Science (DS 2006)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4265))

Included in the following conference series:

Abstract

Text classification is an important and challenging research domain. In this paper, identifying historical period and ethnic origin of documents using stylistic feature sets is investigated. The application domain is Jewish Law articles written in Hebrew-Aramaic. Such documents present various interesting problems for stylistic classification. Firstly, these documents include words from both languages. Secondly, Hebrew and Aramaic are richer than English in their morphology forms. The classification is done using six different sets of stylistic features: quantitative features, orthographic features, topographic features, lexical features and vocabulary richness. Each set of features includes various baseline features, some of them formalized by us. SVM has been chosen as the applied machine learning method since it has been very successful in text classification. The quantitative set was found as very successful and superior to all other sets. Its features are domain-independent and language-independent. It will be interesting to apply these feature sets in general and the quantitative set in particular into other domains as well as into other.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Argamon-Engelson, S., Koppel, M., Avneri, G.: Style-based text categorization: What newspaper am I reading? In: Proceedings of the AAAI Workshop on Learning for Text Categorization, pp. 1–4 (1998)

    Google Scholar 

  2. Blum, A., Mitchell, T.: Combining Labeled and Unlabeled Data with Co-Training Proceedings of the Conference on Computational Learning Theory (COLT), pp. 92–100 (1998)

    Google Scholar 

  3. Choueka, Y., Conley, E.S., Dagan, I.: A comprehensive bilingual word alignment system: Application to disparate languages - Hebrew, English. In: Veronis, J. (ed.) Parallel Text Processing, pp. 69–96. Kluwer Academic Publishers, Dordrecht (2000)

    Google Scholar 

  4. Cortes, C., Vapnik, V.: Support-Vector Networks. Machine Learning 20, 273–297 (1995)

    MATH  Google Scholar 

  5. Díaz, I., Ranilla, J., Montañés, E., Fernández, J., Combarro, E.F.: Improving performance of text categorization by combining filtering, supportvector machines. JASIST 55(7), 579–592 (2004)

    Article  Google Scholar 

  6. Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive Learning Algorithms, Representations for Text Categorization. In: Proceedings of the 7th ACM International Conference on Information, Knowledge Management (CIKM), Bethesda, MD, pp. 148–155 (1998)

    Google Scholar 

  7. Friedman, S.: The Manuscripts of the Babylonian Talmud: A Typology Based Upon Orthographic and Linguistic Features. In: Bar-Asher, M. (ed.) Studies in Hebrew and Jewish Languages Presented to Shelomo Morag (in Hebrew), Jerusalem, pp. 163–190 (1996)

    Google Scholar 

  8. Gabrilovich, E., Markovitch, S.: Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4.5. In: Proceedings of the 21 Int. Conference on Machine Learning, ICML 2004, pp. 321–328 (2004)

    Google Scholar 

  9. HaCohen-Kerner, Y., Kass, A., Peretz, A.: Baseline Methods for Automatic Disambiguation of Abbreviations in Jewish Law Documents. In: Vicedo, J.L., Martínez-Barco, P., Muńoz, R., Saiz Noeda, M. (eds.) EsTAL 2004. LNCS (LNAI), vol. 3230, pp. 58–69. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  10. Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Proceedings of the 10th European Conference on Machine Learning (ECML), Chemnitz, Germany, pp. 137–142 (1998)

    Google Scholar 

  11. Joachims, T.: Learning to Classify Text using Support Vector Machines. Kluwer, Dordrecht (2002)

    Google Scholar 

  12. Karlgren, J., Cutting, D.: Recognizing Text Genres with Simple Metrics Using Discriminant Analysis. In: Proceedings of the 15th International Conference on Computational Linguistics, Kyoto, Japan, vol. 2, pp. 1071–1075 (1994)

    Google Scholar 

  13. Knight, K.: Mining online text. Commun. ACM 42(11), 58–61 (1999)

    Article  MathSciNet  Google Scholar 

  14. Koppel, M., Argamon, S., Shimony, A.R.: Automatically categorizing written texts by author gender, Literary. Linguistic Computing 17(4), 401–412 (2002)

    Article  Google Scholar 

  15. Koppel, M., Mughaz, D., Schler, J.: Text categorization for authorship verification. In: Proc. 8th Symposium on Artificial Intelligence, Mathematics, Fort Lauderdale, FL (2004)

    Google Scholar 

  16. Koppel, M., Mughaz, D., Akiva, N.: New Methods for Attribution of Rabbinic Literature. Hebrew Linguistics: A Journal for Hebrew Descriptive, Computational, Applied Linguistics 57, v-xviii (2006)

    Google Scholar 

  17. Lim, C.S., Lee, K.J., Kim, G.-C.: Multiple sets of features for automatic genre classification of web documents. Inf. Process. Manage. 41(5), 1263–1276 (2005)

    Article  Google Scholar 

  18. Melamed, E.Z.: Aramaic-Hebrew-English Dictionary. Feldheim (2005)

    Google Scholar 

  19. Meretakis, D., Wuthrich, B.: Extending Naive Bayes Classifiers Using Long Itemsets. In: Proc. 5th ACM-SIGKDD Int. Conf. Knowledge Discovery, Data Mining (KDD 1999), San Diego, USA, pp. 165–174 (1999)

    Google Scholar 

  20. Mughaz, D.: Classification Of Hebrew Texts according to Style, M.Sc. Thesis (in Hebrew), BarIlan University, Ramat-Gan, Israel (2003)

    Google Scholar 

  21. Mosteller, F., Wallace, D.L.: Inference and Disputed Authorship: The Federalist. Addison Wesley, Reading (1964)

    MATH  Google Scholar 

  22. Pazienza, M.T. (ed.): Information Extraction. LNCS, vol. 1299. Springer, Heidelberg (1997)

    Google Scholar 

  23. Platt, J.C.: Fast training of support vector machines using sequential minimal optimization. In: Scholkopf, B., Burges, C., Smola, A.J. (eds.) Advances in Kernel Methods - Support Vector Learning, ch. 12, pp. 185–208. MIT Press, Cambridge (1999)

    Google Scholar 

  24. Radai, Y.: Hamikra haMemuchshav: Hesegim Bikoret uMishalot (in Hebrew). Balshanut Ivrit 13, 92–99 (1978)

    Google Scholar 

  25. Radai, Y.: Od al Hamikra haMemuchshav (in Hebrew). Balshanut Ivrit 15, 58–59 (1979)

    Google Scholar 

  26. Radai, Y.: Mikra uMachshev: Divrei Idkun (in Hebrew). Balshanut Ivrit 19, 47–52 (1982)

    Google Scholar 

  27. Rosenthal, F.: Aramaic Studies During the Past Thirty Years. The Journal of Near Eastern Studies, 81–82 (1978)

    Google Scholar 

  28. Schneider, K.-M.: Techniques for Improving the Performance of Naive Bayes for Text Classification. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 682–693. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  29. Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)

    Article  Google Scholar 

  30. Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Computer-based authorship attribution without lexical measures. Computers and the Humanities 35, 193–214 (2001)

    Article  Google Scholar 

  31. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995) ISBN 0-387-94559-8

    Google Scholar 

  32. Witten, I.H., Frank, E.: Weka 3: Machine Learning Software in Java (1999), http://www.cs.waikato.ac.nz/~ml/weka

  33. Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. In: Proceedings of the 22nd ACM International Conference on Research, Development in Information Retrieval (SIGIR), Berkeley, CA, pp. 42–49 (1999)

    Google Scholar 

  34. Yule, G.U.: On Sentence Length as a Statistical Characteristic of Style in Prose with Application to Two Cases of Disputed Authorship. Biometrika 30, 363–390 (1938)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

HaCohen-Kerner, Y., Beck, H., Yehudai, E., Mughaz, D. (2006). Identifying Historical Period and Ethnic Origin of Documents Using Stylistic Feature Sets. In: Todorovski, L., Lavrač, N., Jantke, K.P. (eds) Discovery Science. DS 2006. Lecture Notes in Computer Science(), vol 4265. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11893318_13

Download citation

  • DOI: https://doi.org/10.1007/11893318_13

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-46491-4

  • Online ISBN: 978-3-540-46493-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics