Corpus-Based Study of Scientific Methodology: Comparing the Historical and Experimental Sciences

  • Shlomo Argamon
  • Jeff Dodick
Part of the The Information Retrieval Series book series (INRE, volume 20)


This chapter studies the use of textual features based on systemic functional linguistics, for genre-based text categorization. We describe feature sets that represent different types of conjunctions and modal assessment, which together can partially indicate how different genres structure text and may prefer certain classes of attitudes towards propositions in the text. This enables analysis of large-scale rhetorical differences between genres by examining which features are important for classification. The specific domain we studied comprises scientific articles in historical and experimental sciences (paleontology and physical chemistry, respectively). We applied the SMO learning algorithm, which with our feature set achieved over 83% accuracy for classifying articles according to field, though no field-specific terms were used as features. The most highly-weighted features for each were consistent with hypothesized methodological differences between historical and experimental sciences, thus lending empirical evidence to the recent philosophical claim of multiple scientific methods.


Text classification systemic functional linguistics computational stylistics philosophy of science science education 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

8. References

  1. Argamon, S., Koppel, M., Fine, J., and Shimoni, A. R. (2003a) Gender, Genre, and Writing Style in Formal Written Texts. Text, 23(3).Google Scholar
  2. Argamon, S., Šari, M., and Stein, S. S. (2003b) Style mining of electronic messages for multiple authorship discrimination: First Results. In Proceedings of ACM Conference on Knowledge Discovery and Data Mining 2003.Google Scholar
  3. Baayen, H., van Halteren, H., and Tweedie, F. (1996) Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution, Literary and Linguistic Computing, 11.Google Scholar
  4. Baker, V.R. (1996) The pragmatic routes of American Quaternary geology and geomorphology. Geomorphology 16, pp. 197–215.CrossRefGoogle Scholar
  5. Cleland, C.E. (2002) Methodological and epistemic differences between historical science and experimental science. Philosophy of Science.Google Scholar
  6. Diamond, J. (1999) Guns, Germs, & Steel, New York: W. W. Norton and Company.Google Scholar
  7. Divisia-Blohorn, B., Genoud, F., Borel, C., Bidan, G., Kern, J-M., and Sauvage, J-P. (2003) Conjugated Polymetallorotaxanes: In-Situ ESR and Conductivity Investigations of Metal-Backbone Interactions, J. Phys. Chem. B, 107, pp. 5126–5132.CrossRefGoogle Scholar
  8. Dodick, J. T. and Orion, N. (2003) Geology as an Historical Science: Its Perception within Science and the Education System. Science and Education, 12(2).Google Scholar
  9. Dunbar, K. (1995) How scientists really reason: Scientific reasoning in real-world laboratories. In Sternberg, R.J. and Davidson, J. (Eds.). Mechanisms of Insight. Cambridge MA: MIT Press, pp. 365–395.Google Scholar
  10. Eggins, S. and Martin, J. R. (1997) Genres and registers of discourse. In van Dijk, T. A. (Ed.) Discourse as structure and process. A multidisciplinary introduction. Discourse studies 1. London: Sage, pp. 230–256.Google Scholar
  11. Goodwin, C. (1994) Professional Vision. American Anthropologist, 96(3), pp. 606–633.CrossRefMathSciNetGoogle Scholar
  12. Gould, S. J. (1986) Evolution and the Triumph of Homology, or, Why History Matters, American Scientist, Jan.–Feb. 1986:60–69.Google Scholar
  13. Gregory, M. (1967) Aspects of varieties differentiation, Journal of Linguistics 3:177–198.Google Scholar
  14. Halliday, M.A.K. (1991) Corpus linguistics and probabilistic grammar. In Karin Aijmer & Bengt Altenberg (Ed.) English Corpus Linguistics: Studies in honour of Jan Svartvik. (London: Longman), pp. 30–44.Google Scholar
  15. Halliday, M.A.K. (1994). An Introduction to Functional Grammar. Edward Arnold, London.Google Scholar
  16. Hasan, R. (1988) Language in the process of socialisation: Home and school. In Oldenburg, J., v Leeuwen, Th., and Gerot, L. (ed.), Language and socialisation: Home and school; Proceedings from the Working Conference on Language in Education, 17–21 November, 1986. North Ryde, N.S.W., Macquarie University.Google Scholar
  17. Holmes, D. I. and Forsyth, R. S. (1995). The federalist revisited: New directions in authorship attribution. Literary and Linguistic Computing, 10(2):111–126CrossRefGoogle Scholar
  18. Joachims, T. (1998) Text categorization with Support Vector Machines: Learning with many relevant features. In Machine Learning: ECML-98, Tenth European Conference on Machine Learning, pp. 137–142.Google Scholar
  19. Koppel, M., Argamon, S., and Shimoni, A. R. (2003) Automatically categorizing written texts by author gender. Literary and Linguistic Computing 17(4).Google Scholar
  20. Latour, B. and Woolgar, S. (1986) Laboratory Life: The Construction of Scientific Facts, Princeton: Princeton University Press.Google Scholar
  21. Lewin, B.A., Fine, J. and Young, L. (2001) Expository Discourse: A Genre-Based Approach to Social Science Research Texts, Continuum Press.Google Scholar
  22. Losee, R. M. (1996) Text Windows and Phrases Differing by Discipline, Location in Document, and Syntactic Structure. Information Processing & Management, 32(6):747–767.CrossRefGoogle Scholar
  23. Marcu, D. (2000) The Rhetorical Parsing of Unrestricted Texts: A Surface-Based Approach. Computational Linguistics, 26(3):395–448.CrossRefMathSciNetGoogle Scholar
  24. Martin, J. R. (1992) English Text: System and Structure. Amsterdam: Benjamins.Google Scholar
  25. Matthews, R. A. J. and Merriam, T. V. N. (1997) Distinguishing literary styles using neural networks. In Fiesler, E. and Beale, R. (Eds) Handbook of Neural Computation, chapter 8. Oxford University Press.Google Scholar
  26. Matthiessen, C. (1995) Lexicogrammatical Cartography: English Systems. International Language Sciences Publishers: Tokyo, Taipei & Dallas.Google Scholar
  27. Mayr, E. (1976). Evolution and the Diversity of Life. Cambridge: Harvard University Press.Google Scholar
  28. Mosteller, F. and Wallace, D. L. (1964) Inference and Disputed Authorship: The Federalist Papers, Reading, Mass.: Addison Wesley.Google Scholar
  29. Ochs, E., Jacoby, S., and Gonzales, P. (1994) Interpretive journeys: How physicists talk and travel through graphic space, Configurations 1:151–171.Google Scholar
  30. Platt, J. (1998) Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines, Microsoft Research Technical Report MSR-TR-98-14.Google Scholar
  31. Plum, G. A. and Cowling, A. (1987) Social constraints on grammatical variables: Tense choice in English. In Steele, R. and Threadgold, T. (Eds.), Language topics. Essays in honour of Michael Halliday. Amsterdam: Benjamins.Google Scholar
  32. Sebastiani, F. (2002) Machine learning in automated text categorization, ACM Computing Surveys, 34(1):1–47.CrossRefGoogle Scholar
  33. Smith, F. A. and Betancourt, J. L. (2003) The effect of Holocene temperature fluctuations on the evolution and ecology of Neotoma (woodrats) in Idaho and northwestern Utah, Quaternary Research 59:160–171.CrossRefGoogle Scholar
  34. Stamatatos, E., Fakotakis, N., and Kokkinakis, G. (2001) Computer-based authorship attribution without lexical measures, Computers and the Humanities 35.Google Scholar
  35. Teufel, S. and Moens, M. (1998) Sentence extraction and rhetorical classification for flexible abstracts. In Proc. AAAI Spring Symposium on Intelligent Text Summarization.Google Scholar
  36. Wiebe, J., Wilson, T., and Bell, M. (2001) Identifying Collocations for Recognizing Opinions. In Proc. ACL/EACL’ 01 Workshop on Collocation, Toulouse, France, July 200.Google Scholar
  37. Whewell, W. (1837) History of the Inductive Sciences, John W. Parker, London.Google Scholar
  38. Witten, I.H. and Frank E. (1999) Weka 3: Machine Learning Software in Java; Scholar
  39. Yule, G.U. (1938) On sentence length as a statistical characteristic of style in prose with application to two cases of disputed authorship, Biometrika, 30:363–390.Google Scholar

Copyright information

© Springer 2006

Authors and Affiliations

  • Shlomo Argamon
    • 1
  • Jeff Dodick
    • 2
  1. 1.Department of Computer ScienceIllinois Institute of TechnologyChicagoUSA
  2. 2.Science Teaching CenterThe Hebrew University of JerusalemJerusalemIsrael

Personalised recommendations