Abstract
Text classification is an important and challenging research domain. In this paper, identifying historical period and ethnic origin of documents using stylistic feature sets is investigated. The application domain is Jewish Law articles written in Hebrew-Aramaic. Such documents present various interesting problems for stylistic classification. Firstly, these documents include words from both languages. Secondly, Hebrew and Aramaic are richer than English in their morphology forms. The classification is done using six different sets of stylistic features: quantitative features, orthographic features, topographic features, lexical features and vocabulary richness. Each set of features includes various baseline features, some of them formalized by us. SVM has been chosen as the applied machine learning method since it has been very successful in text classification. The quantitative set was found as very successful and superior to all other sets. Its features are domain-independent and language-independent. It will be interesting to apply these feature sets in general and the quantitative set in particular into other domains as well as into other.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Argamon-Engelson, S., Koppel, M., Avneri, G.: Style-based text categorization: What newspaper am I reading? In: Proceedings of the AAAI Workshop on Learning for Text Categorization, pp. 1–4 (1998)
Blum, A., Mitchell, T.: Combining Labeled and Unlabeled Data with Co-Training Proceedings of the Conference on Computational Learning Theory (COLT), pp. 92–100 (1998)
Choueka, Y., Conley, E.S., Dagan, I.: A comprehensive bilingual word alignment system: Application to disparate languages - Hebrew, English. In: Veronis, J. (ed.) Parallel Text Processing, pp. 69–96. Kluwer Academic Publishers, Dordrecht (2000)
Cortes, C., Vapnik, V.: Support-Vector Networks. Machine Learning 20, 273–297 (1995)
Díaz, I., Ranilla, J., Montañés, E., Fernández, J., Combarro, E.F.: Improving performance of text categorization by combining filtering, supportvector machines. JASIST 55(7), 579–592 (2004)
Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive Learning Algorithms, Representations for Text Categorization. In: Proceedings of the 7th ACM International Conference on Information, Knowledge Management (CIKM), Bethesda, MD, pp. 148–155 (1998)
Friedman, S.: The Manuscripts of the Babylonian Talmud: A Typology Based Upon Orthographic and Linguistic Features. In: Bar-Asher, M. (ed.) Studies in Hebrew and Jewish Languages Presented to Shelomo Morag (in Hebrew), Jerusalem, pp. 163–190 (1996)
Gabrilovich, E., Markovitch, S.: Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4.5. In: Proceedings of the 21 Int. Conference on Machine Learning, ICML 2004, pp. 321–328 (2004)
HaCohen-Kerner, Y., Kass, A., Peretz, A.: Baseline Methods for Automatic Disambiguation of Abbreviations in Jewish Law Documents. In: Vicedo, J.L., Martínez-Barco, P., Muńoz, R., Saiz Noeda, M. (eds.) EsTAL 2004. LNCS (LNAI), vol. 3230, pp. 58–69. Springer, Heidelberg (2004)
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Proceedings of the 10th European Conference on Machine Learning (ECML), Chemnitz, Germany, pp. 137–142 (1998)
Joachims, T.: Learning to Classify Text using Support Vector Machines. Kluwer, Dordrecht (2002)
Karlgren, J., Cutting, D.: Recognizing Text Genres with Simple Metrics Using Discriminant Analysis. In: Proceedings of the 15th International Conference on Computational Linguistics, Kyoto, Japan, vol. 2, pp. 1071–1075 (1994)
Knight, K.: Mining online text. Commun. ACM 42(11), 58–61 (1999)
Koppel, M., Argamon, S., Shimony, A.R.: Automatically categorizing written texts by author gender, Literary. Linguistic Computing 17(4), 401–412 (2002)
Koppel, M., Mughaz, D., Schler, J.: Text categorization for authorship verification. In: Proc. 8th Symposium on Artificial Intelligence, Mathematics, Fort Lauderdale, FL (2004)
Koppel, M., Mughaz, D., Akiva, N.: New Methods for Attribution of Rabbinic Literature. Hebrew Linguistics: A Journal for Hebrew Descriptive, Computational, Applied Linguistics 57, v-xviii (2006)
Lim, C.S., Lee, K.J., Kim, G.-C.: Multiple sets of features for automatic genre classification of web documents. Inf. Process. Manage. 41(5), 1263–1276 (2005)
Melamed, E.Z.: Aramaic-Hebrew-English Dictionary. Feldheim (2005)
Meretakis, D., Wuthrich, B.: Extending Naive Bayes Classifiers Using Long Itemsets. In: Proc. 5th ACM-SIGKDD Int. Conf. Knowledge Discovery, Data Mining (KDD 1999), San Diego, USA, pp. 165–174 (1999)
Mughaz, D.: Classification Of Hebrew Texts according to Style, M.Sc. Thesis (in Hebrew), BarIlan University, Ramat-Gan, Israel (2003)
Mosteller, F., Wallace, D.L.: Inference and Disputed Authorship: The Federalist. Addison Wesley, Reading (1964)
Pazienza, M.T. (ed.): Information Extraction. LNCS, vol. 1299. Springer, Heidelberg (1997)
Platt, J.C.: Fast training of support vector machines using sequential minimal optimization. In: Scholkopf, B., Burges, C., Smola, A.J. (eds.) Advances in Kernel Methods - Support Vector Learning, ch. 12, pp. 185–208. MIT Press, Cambridge (1999)
Radai, Y.: Hamikra haMemuchshav: Hesegim Bikoret uMishalot (in Hebrew). Balshanut Ivrit 13, 92–99 (1978)
Radai, Y.: Od al Hamikra haMemuchshav (in Hebrew). Balshanut Ivrit 15, 58–59 (1979)
Radai, Y.: Mikra uMachshev: Divrei Idkun (in Hebrew). Balshanut Ivrit 19, 47–52 (1982)
Rosenthal, F.: Aramaic Studies During the Past Thirty Years. The Journal of Near Eastern Studies, 81–82 (1978)
Schneider, K.-M.: Techniques for Improving the Performance of Naive Bayes for Text Classification. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 682–693. Springer, Heidelberg (2005)
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Computer-based authorship attribution without lexical measures. Computers and the Humanities 35, 193–214 (2001)
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995) ISBN 0-387-94559-8
Witten, I.H., Frank, E.: Weka 3: Machine Learning Software in Java (1999), http://www.cs.waikato.ac.nz/~ml/weka
Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. In: Proceedings of the 22nd ACM International Conference on Research, Development in Information Retrieval (SIGIR), Berkeley, CA, pp. 42–49 (1999)
Yule, G.U.: On Sentence Length as a Statistical Characteristic of Style in Prose with Application to Two Cases of Disputed Authorship. Biometrika 30, 363–390 (1938)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
HaCohen-Kerner, Y., Beck, H., Yehudai, E., Mughaz, D. (2006). Identifying Historical Period and Ethnic Origin of Documents Using Stylistic Feature Sets. In: Todorovski, L., Lavrač, N., Jantke, K.P. (eds) Discovery Science. DS 2006. Lecture Notes in Computer Science(), vol 4265. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11893318_13
Download citation
DOI: https://doi.org/10.1007/11893318_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-46491-4
Online ISBN: 978-3-540-46493-8
eBook Packages: Computer ScienceComputer Science (R0)