Automatic Document Separation: A Combination of Probabilistic Classification and Finite-State Sequence Modeling

Schmidtler, Mauritius A. R.; Amtrup, Jan W.

doi:10.1007/978-1-84628-754-1_8

Mauritius A. R. Schmidtler² &
Jan W. Amtrup²

4739 Accesses
6 Citations
3 Altmetric

Abstract

Large organizations are increasingly confronted with the problem of capturing, processing, and archiving large amounts of data. For several reasons, the problem is especially cumbersome in the case where data is stored on paper. First, the weight, volume, and relative fragility of paper incur problems in handling and require specific, labor-intensive processes to be applied. Second, for automatic processing, the information contained on the pages must be digitized, performing Optical Character Recognition (OCR). This leads to a certain number of errors in the data retrieved from paper. Third, the identities of individual documents become blurred. In a stack of paper, the boundaries between documents are lost, or at least obscured to a large degree.¹

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 149.00; Price excludes VAT (USA)

Hardcover Book: USD 159.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Schmidtler, M., Texeira, S., Harris, C., Samat, S., Borrey, R., Macciola, A.: Automatic document separation. United States Patent Application 20050134935, US Patent & Trademark Office (2005)
Google Scholar
Ratnaparkhi, A.: A Simple Introduction to Maximum Entropy Models for Natural Language Processing. IRCS Report 97-08, University of Pennsylvania, Philadelphia, PA (1997)
Google Scholar
Reynar, J., Ratnaparkhi, A.: A Maximum Entropy Approach to Identifying Sentence Boundaries. In: Proceedings of the ANLP97, Washington, D.C. (1997)
Google Scholar
Collins-Thompson, K., Nickolov, R.: A Clustering-Based Algorithm for Automatic Document Separation. In: SIGIR 2002 Workshop on Information Retrieval and OCR. (2002)
Google Scholar
Pevzner, L., Hearst, M.: A Critique and Improvement of an Evaluation Metric for Text Segmentation. Computational Linguistics 28 (2002) 19–36
Article Google Scholar
Porter, M.: An Algorithm for Suffix Stripping. Program 14 (1980) 130–130
Google Scholar
Joachims, T.: Learning to Classify Text using Support Vector Machines: Methods, Theory, and Algorithms. Kluwer (2002)
Google Scholar
McCallum, A., Freytag, D., Pereira, F.: Maximum entropy markov models for information extraction and segmentation. Technical report, Just Research, AT&T Labs — Research (2000)
Google Scholar
Goodman, J.: A bit of progress in language modeling. Technical Report MSR-TR-2001-72, Machine Learning and Applied Statistics Group Microsoft Research (2001)
Google Scholar
Vapnik, V.: Statistical Learning Theory. JOHN WILEY & SONS, INC (1998)
Google Scholar
Jaakola, T., Meila, M., Jebara, T.: Maximum entropy discrimination. Technical report, MIT AI Lab, MIT Media Lab (1999)
Google Scholar
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Technical Report LS-8 Report 23, Universitat Dortmund Fachbereich Informatik Lehrstuhl VIII Kunstliche Intelligenz (1997)
Google Scholar
Platt, J.: Probabilistic outputs for support vector machines and comparison to regularised likelihood methods. Technical report, Microsoft Research (1999)
Google Scholar
Harris, C., Schmidtler, M.: Effective multi-class support vector machine classification. United States Patent Application 20040111453, US Patent & Trademark Office (2004)
Google Scholar
Jelinek, F.: Statistical Methods for Speech Recognition. Language, Speech and Communication. MIT Press, Cambridge, Massachusetts (1998)
Google Scholar
Pereira, F., Riley, M.: Speech recognition by composition of weighted finite automata. Technical report, AT&T Labs — Research (1996)
Google Scholar
Mohri, M., Pereira, F.C.N., Riley, M.: A Rational Design for a Weighted Finite-State Transducer Library. In: Workshop on Implementing Automata. (1997) 144–158
Google Scholar
Lowerre, B.T.: The HARPY Speech Recognition System. PhD thesis, Carnegie Mellon University (1976)
Google Scholar

Download references

Author information

Authors and Affiliations

Kofax Image Products, 5465 Morehouse Dr, Suite 140, San Diego, CA, 92121, USA
Mauritius A. R. Schmidtler & Jan W. Amtrup

Authors

Mauritius A. R. Schmidtler
View author publications
You can also search for this author in PubMed Google Scholar
Jan W. Amtrup
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Bellevue, WA, 98008, USA
Anne Kao BA, MA, MS, PhD & Stephen R. Poteet BA, MA, CPhil &

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Schmidtler, M.A.R., Amtrup, J.W. (2007). Automatic Document Separation: A Combination of Probabilistic Classification and Finite-State Sequence Modeling. In: Kao, A., Poteet, S.R. (eds) Natural Language Processing and Text Mining. Springer, London. https://doi.org/10.1007/978-1-84628-754-1_8

Download citation

DOI: https://doi.org/10.1007/978-1-84628-754-1_8
Publisher Name: Springer, London
Print ISBN: 978-1-84628-175-4
Online ISBN: 978-1-84628-754-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics