Identifying Historical Period and Ethnic Origin of Documents Using Stylistic Feature Sets

HaCohen-Kerner, Yaakov; Beck, Hananya; Yehudai, Elchai; Mughaz, Dror

doi:10.1007/11893318_13

Yaakov HaCohen-Kerner²¹,
Hananya Beck²¹,
Elchai Yehudai²¹ &
…
Dror Mughaz^21,22

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4265))

Included in the following conference series:

International Conference on Discovery Science

1220 Accesses
1 Citations

Abstract

Text classification is an important and challenging research domain. In this paper, identifying historical period and ethnic origin of documents using stylistic feature sets is investigated. The application domain is Jewish Law articles written in Hebrew-Aramaic. Such documents present various interesting problems for stylistic classification. Firstly, these documents include words from both languages. Secondly, Hebrew and Aramaic are richer than English in their morphology forms. The classification is done using six different sets of stylistic features: quantitative features, orthographic features, topographic features, lexical features and vocabulary richness. Each set of features includes various baseline features, some of them formalized by us. SVM has been chosen as the applied machine learning method since it has been very successful in text classification. The quantitative set was found as very successful and superior to all other sets. Its features are domain-independent and language-independent. It will be interesting to apply these feature sets in general and the quantitative set in particular into other domains as well as into other.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Argamon-Engelson, S., Koppel, M., Avneri, G.: Style-based text categorization: What newspaper am I reading? In: Proceedings of the AAAI Workshop on Learning for Text Categorization, pp. 1–4 (1998)
Google Scholar
Blum, A., Mitchell, T.: Combining Labeled and Unlabeled Data with Co-Training Proceedings of the Conference on Computational Learning Theory (COLT), pp. 92–100 (1998)
Google Scholar
Choueka, Y., Conley, E.S., Dagan, I.: A comprehensive bilingual word alignment system: Application to disparate languages - Hebrew, English. In: Veronis, J. (ed.) Parallel Text Processing, pp. 69–96. Kluwer Academic Publishers, Dordrecht (2000)
Google Scholar
Cortes, C., Vapnik, V.: Support-Vector Networks. Machine Learning 20, 273–297 (1995)
MATH Google Scholar
Díaz, I., Ranilla, J., Montañés, E., Fernández, J., Combarro, E.F.: Improving performance of text categorization by combining filtering, supportvector machines. JASIST 55(7), 579–592 (2004)
Article Google Scholar
Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive Learning Algorithms, Representations for Text Categorization. In: Proceedings of the 7th ACM International Conference on Information, Knowledge Management (CIKM), Bethesda, MD, pp. 148–155 (1998)
Google Scholar
Friedman, S.: The Manuscripts of the Babylonian Talmud: A Typology Based Upon Orthographic and Linguistic Features. In: Bar-Asher, M. (ed.) Studies in Hebrew and Jewish Languages Presented to Shelomo Morag (in Hebrew), Jerusalem, pp. 163–190 (1996)
Google Scholar
Gabrilovich, E., Markovitch, S.: Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4.5. In: Proceedings of the 21 Int. Conference on Machine Learning, ICML 2004, pp. 321–328 (2004)
Google Scholar
HaCohen-Kerner, Y., Kass, A., Peretz, A.: Baseline Methods for Automatic Disambiguation of Abbreviations in Jewish Law Documents. In: Vicedo, J.L., Martínez-Barco, P., Muńoz, R., Saiz Noeda, M. (eds.) EsTAL 2004. LNCS (LNAI), vol. 3230, pp. 58–69. Springer, Heidelberg (2004)
Chapter Google Scholar
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Proceedings of the 10th European Conference on Machine Learning (ECML), Chemnitz, Germany, pp. 137–142 (1998)
Google Scholar
Joachims, T.: Learning to Classify Text using Support Vector Machines. Kluwer, Dordrecht (2002)
Google Scholar
Karlgren, J., Cutting, D.: Recognizing Text Genres with Simple Metrics Using Discriminant Analysis. In: Proceedings of the 15th International Conference on Computational Linguistics, Kyoto, Japan, vol. 2, pp. 1071–1075 (1994)
Google Scholar
Knight, K.: Mining online text. Commun. ACM 42(11), 58–61 (1999)
Article MathSciNet Google Scholar
Koppel, M., Argamon, S., Shimony, A.R.: Automatically categorizing written texts by author gender, Literary. Linguistic Computing 17(4), 401–412 (2002)
Article Google Scholar
Koppel, M., Mughaz, D., Schler, J.: Text categorization for authorship verification. In: Proc. 8th Symposium on Artificial Intelligence, Mathematics, Fort Lauderdale, FL (2004)
Google Scholar
Koppel, M., Mughaz, D., Akiva, N.: New Methods for Attribution of Rabbinic Literature. Hebrew Linguistics: A Journal for Hebrew Descriptive, Computational, Applied Linguistics 57, v-xviii (2006)
Google Scholar
Lim, C.S., Lee, K.J., Kim, G.-C.: Multiple sets of features for automatic genre classification of web documents. Inf. Process. Manage. 41(5), 1263–1276 (2005)
Article Google Scholar
Melamed, E.Z.: Aramaic-Hebrew-English Dictionary. Feldheim (2005)
Google Scholar
Meretakis, D., Wuthrich, B.: Extending Naive Bayes Classifiers Using Long Itemsets. In: Proc. 5th ACM-SIGKDD Int. Conf. Knowledge Discovery, Data Mining (KDD 1999), San Diego, USA, pp. 165–174 (1999)
Google Scholar
Mughaz, D.: Classification Of Hebrew Texts according to Style, M.Sc. Thesis (in Hebrew), BarIlan University, Ramat-Gan, Israel (2003)
Google Scholar
Mosteller, F., Wallace, D.L.: Inference and Disputed Authorship: The Federalist. Addison Wesley, Reading (1964)
MATH Google Scholar
Pazienza, M.T. (ed.): Information Extraction. LNCS, vol. 1299. Springer, Heidelberg (1997)
Google Scholar
Platt, J.C.: Fast training of support vector machines using sequential minimal optimization. In: Scholkopf, B., Burges, C., Smola, A.J. (eds.) Advances in Kernel Methods - Support Vector Learning, ch. 12, pp. 185–208. MIT Press, Cambridge (1999)
Google Scholar
Radai, Y.: Hamikra haMemuchshav: Hesegim Bikoret uMishalot (in Hebrew). Balshanut Ivrit 13, 92–99 (1978)
Google Scholar
Radai, Y.: Od al Hamikra haMemuchshav (in Hebrew). Balshanut Ivrit 15, 58–59 (1979)
Google Scholar
Radai, Y.: Mikra uMachshev: Divrei Idkun (in Hebrew). Balshanut Ivrit 19, 47–52 (1982)
Google Scholar
Rosenthal, F.: Aramaic Studies During the Past Thirty Years. The Journal of Near Eastern Studies, 81–82 (1978)
Google Scholar
Schneider, K.-M.: Techniques for Improving the Performance of Naive Bayes for Text Classification. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 682–693. Springer, Heidelberg (2005)
Chapter Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Computer-based authorship attribution without lexical measures. Computers and the Humanities 35, 193–214 (2001)
Article Google Scholar
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995) ISBN 0-387-94559-8
Google Scholar
Witten, I.H., Frank, E.: Weka 3: Machine Learning Software in Java (1999), http://www.cs.waikato.ac.nz/~ml/weka
Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. In: Proceedings of the 22nd ACM International Conference on Research, Development in Information Retrieval (SIGIR), Berkeley, CA, pp. 42–49 (1999)
Google Scholar
Yule, G.U.: On Sentence Length as a Statistical Characteristic of Style in Prose with Application to Two Cases of Disputed Authorship. Biometrika 30, 363–390 (1938)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Jerusalem College of Technology (Machon Lev), 21 Havaad Haleumi St., P.O.B. 16031, 91160, Jerusalem, Israel
Yaakov HaCohen-Kerner, Hananya Beck, Elchai Yehudai & Dror Mughaz
Department of Computer Science, Bar-Ilan University, 52900, Ramat-Gan, Israel
Dror Mughaz

Authors

Yaakov HaCohen-Kerner
View author publications
You can also search for this author in PubMed Google Scholar
Hananya Beck
View author publications
You can also search for this author in PubMed Google Scholar
Elchai Yehudai
View author publications
You can also search for this author in PubMed Google Scholar
Dror Mughaz
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Jozef Stefan Institute, Jamova 39, 1000, Ljubljana, Slovenia
Ljupčo Todorovski
University of Nova Gorica, Nova Gorica, Slovenia
Nada Lavrač
Meme Media Laboratory, Hokkaido University Sapporo, Kita 13, Nishi 8, Kita-ku, P.O. Box, 060-8628, Sapporo, Japan
Klaus P. Jantke

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

HaCohen-Kerner, Y., Beck, H., Yehudai, E., Mughaz, D. (2006). Identifying Historical Period and Ethnic Origin of Documents Using Stylistic Feature Sets. In: Todorovski, L., Lavrač, N., Jantke, K.P. (eds) Discovery Science. DS 2006. Lecture Notes in Computer Science(), vol 4265. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11893318_13

Download citation

DOI: https://doi.org/10.1007/11893318_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-46491-4
Online ISBN: 978-3-540-46493-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics