Skip to main content

Automatic Pragmatic Text Segmentation of Historical Letters

  • Conference paper
  • First Online:
Book cover Language Technology for Cultural Heritage

Abstract

In this investigation we aim to reduce the manual workload by automatic processing of the corpus of historical letters for pragmatic research. We focus on two consecutive sub tasks: the first task is automatic text segmentation of the letters in formal/informal parts using a statistical n-gram based technique. As a second task we perform semantic labeling of the formal parts of the letters using supervised machine learning. The main stumbling block in our investigation is data sparsity due to the small size of the data set and enlarged by the spelling variation present in the historical letters. We try to address the latter problem with a dictionary look up and edit distance text normalization step. We achieve results of 86% micro-averaged F-score for the text segmentation task and 66.3% for the semantic labeling task. Even though these scores are not high enough to completely replace the manual annotation with automatic annotation, our results are promising and demonstrate that an automatic approach based on such small data set is feasible.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Archer, D., Culpeper, J.: Identifying key sociophilological usage in plays and trial proceedings): An empirical approach via corpus annotation. Journal of Historical Pragmatics 10(2), 286–309 (2009)

    Article  Google Scholar 

  2. Archer, D., McEnery, T., Rayson, P., Hardie, A.: Developing an automated semantic analysis system for early modern english. In: Proceedings of the Corpus Linguistics 2003 conference, pp. 22 – 31 (2003)

    Google Scholar 

  3. Baron, A., Rayson, P.: VARD2: A tool for dealing with spelling variation in historical corpora. In: Proceedings of the Postgraduate Conference in Corpus Linguistics (2008)

    Google Scholar 

  4. Baroni, M., Bernardini, S.: Bootcat: Bootstrapping corpora and terms from the web. In: Proceedings of Language Resources and Evaluation (LREC) 2004, pp. 1313–1316 (2004)

    Google Scholar 

  5. Blecua, A.: Manual de Crítica Textual. Castalia, Madrid (1983)

    Google Scholar 

  6. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: COLT: Proceedings of the Workshop on Computational Learning Theory, Morgan Kaufmann Publishers (1998)

    Google Scholar 

  7. Bluteau, R.: Vocabulario portuguez, e latino [followed by] supplemento ao vocabulario portuguez. vols. 1-8, I-II. Coimbra-Lisboa. (1712–1728)

    Google Scholar 

  8. Brown, P., Levinson, S.C.: Politeness: some universals in language usage. Cambridge University Press, Cambridge (1987)

    Google Scholar 

  9. Cohen, J.: A coefficient of agreement for nominal scales. Education and Psychological Measuremen 20, 37–46 (1960)

    Article  Google Scholar 

  10. Daelemans, W., A.Van den Bosch: Memory-Based Language Processing. Cambridge University Press, Cambridge, UK (2005)

    Book  Google Scholar 

  11. Daelemans, W., Zavrel, J., Van den Bosch, A., Van der Sloot, K.: Mbt: Memory-based tagger, version 3.1, reference guide. Tech. rep., ILK Technical Report Series 07-08 (2007)

    Google Scholar 

  12. Dossena, M., van Ostade, I.T.B. (eds.): Studies in Late Modern English Correspondence. Peter Lang, Bern (2008)

    Google Scholar 

  13. Edmonds, P., Kilgarriff, A.: Introduction to the special issue on evaluating word sense disambiguation systems. Natural Language Engineerin 8(4), 279–291 (2002)

    Article  Google Scholar 

  14. Ernst-Gerlach, A., Fuhr, N.: Retrieval in text collections with historic spelling using linguistic and spelling variants. In: Proceedings of the ACM/IEEE-CS conference on Digital libraries, pp. 333–341 (2007)

    Google Scholar 

  15. Everitt, B.: The Analysis of Contingency Tables, 2nd edn. Chapman and Hall (1992)

    Google Scholar 

  16. Ferret, O.: Segmenter et structurer thématiquement des textes par l’utilisation conjointe de collocations et de la récurrence lexicale. In: TALN 2002. Nancy (2002)

    Google Scholar 

  17. Fitzmaurice, S.M.: Epistolary identity: convention and idiosyncrasy in late modern english letters. In: Studies in Late Modern English Correspondence, pp. 77–112. Peter Lang (2008)

    Google Scholar 

  18. Guillén, C.: Renaissance Genres: Essays on Theory, History and Interpretation, chap. Notes towards the study of the Renaissance letter, pp. 70–101. Harvard University Press (1986)

    Google Scholar 

  19. Hachey, B., Grover, C.: Extractive summarisation of legal texts. Artificial Intelligence and Law: Special Issue on E-government 14, 305–345 (2007)

    Google Scholar 

  20. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: An update. SIGKDD Explorations 11(1) (2009)

    Google Scholar 

  21. Hearst, M.A.: Texttiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics 23(1), 33–64 (1997)

    Google Scholar 

  22. Jurafsky, D., Martin, J.H.: Speech and Language Processing. 2nd edition. Prentice-Hall (2009)

    Google Scholar 

  23. Kilgarriff, A., Palmer, M.: Introduction to the special issue on senseval. Computers in the Humanities 34(1-2), 1–13. (2000)

    Article  Google Scholar 

  24. Koolen, M., Adriaans, F., Kamps, J., de Rijke, M.: A cross-language approach to historic document retrieval. In: Advances in Information Retrieval: 28th European Conference on IR Research (ECIR 2006), LNCS, vol. 3936, pp. 407–419. Springer Verlag, Heidelberg (2006)

    Google Scholar 

  25. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. 18th International Conf. on Machine Learning, pp. 282–289. Morgan Kaufmann, San Francisco, CA (2001)

    Google Scholar 

  26. Levenshtein, V.: Binary codes capable of correcting deletions, insertions, and reversals. Sovjet Physics Doklady 10, 707–710 (1966)

    MathSciNet  Google Scholar 

  27. Merity, S., Murphy, T., Curran, J.R.: Accurate argumentative zoning with maximum entropy models. In: NLPIR4DL ’09: Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries, pp. 19–26. Association for Computational Linguistics, Morristown, NJ, USA (2009)

    Google Scholar 

  28. Mikheev, A.: Periods, capitalized words, etc. Computational Linguistics 28, 289–318 (1999)

    Article  Google Scholar 

  29. Moon, R.: Fixed Expressions and Idioms in English: A Corpus-Based Approach. Oxford University Press, Oxford (1998)

    Google Scholar 

  30. Nevalainen, T., Tanskanen, S.K. (eds.): Letter Writing. John Benjamins Publishing Company, Amsterdam/Philadelphia (2007)

    Google Scholar 

  31. Ng, H.T., Lim, C.Y., Foo, S.K.: A case study on inter-annotator agreement for word sense disambiguation. In: Proceedings of the SIGLEX Workshop On Standardizing Lexical Resources (1999)

    Google Scholar 

  32. Noreen, E.W.: Computer-Intensive Methods for Testing Hypotheses. John Wiley & Sons (1989)

    Google Scholar 

  33. Pevzner, L., Hearst, M.A.: A critique and improvement of an evaluation metric for text segmentation. Comp. Linguistics 28, 1–19 (2002)

    Article  Google Scholar 

  34. Ramshaw, L., Marcus, M.: Text chunking using transformation-based learning. In: Proceedings of the Third Workshop on Very Large Corpora, pp. 82–94 (1995)

    Google Scholar 

  35. Rayson, P., Archer, D., Piao, S.L., McEnery, T.: The UCREL semantic analysis system. In: Proceedings of the workshop on Beyond Named Entity Recognition Semantic labelling for NLP tasks (LREC 2004), pp. 7–12 (2004)

    Google Scholar 

  36. Reynar, J.C., Ratnaparkhi, A.: A maximum entropy approach to identifying sentence boundaries. In: Proceedings of the Fifth Conference on Applied Natural Language Processing, pp. 16–19 (1997)

    Google Scholar 

  37. Sporleder, C., Lapata, M.: Broad coverage paragraph segmentation across languages and domains. ACM Transactions on Speech and Language Processing 3(2), 1–35 (2006)

    Article  Google Scholar 

  38. Teufel, S., Moens, M.: What’s yours and what’s mine: Determining intellectual attribution in scientific text. In: In EMNLP-VLC (2000)

    Google Scholar 

  39. Watts, R.: Politeness. Cambridge University Press, Cambridge (2003)

    Book  Google Scholar 

Download references

Acknowledgements

We would like to thank Mariana Gomes, Ana Rita Guilherme and Leonor Tavares for the manual annotation. We are grateful to JoÃčo Paulo Silvestre for sharing his electronic version of the Bluteau Dictionary and frequency counts. This work is funded by the Portuguese Science Foundation, FCT (FundaÃğÃčo para a CiÃłncia e a Tecnologia).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Iris Hendrickx .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hendrickx, I., Généreux, M., Marquilhas, R. (2011). Automatic Pragmatic Text Segmentation of Historical Letters. In: Sporleder, C., van den Bosch, A., Zervanou, K. (eds) Language Technology for Cultural Heritage. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20227-8_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-20227-8_8

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-20226-1

  • Online ISBN: 978-3-642-20227-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics