Token Identification Using HMM and PPM Models

Wen, Yingying; Witten, Ian H.; Wang, Dianhui

doi:10.1007/978-3-540-24581-0_15

Yingying Wen^8,9,
Ian H. Witten⁹ &
Dianhui Wang¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2903))

Included in the following conference series:

Australasian Joint Conference on Artificial Intelligence

1538 Accesses
2 Citations

Abstract

Hidden markov models (HMMs) and prediction by partial matching models (PPM) have been successfully used in language processing tasks including learning-based token identification. Most of the existing systems are domain- and language-dependent. The power of retargetability and applicability of these systems is limited. This paper investigates the effect of the combination of HMMs and PPM on token identification. We implement a system that bridges the two well known methods through words new to the identification model. The system is fully domain- and language-independent. No changes of code are necessary when applying to other domains or languages. The only required input of the system is an annotated corpus. The system has been tested on two corpora and achieved an overall F-measure of 69.02% for TCC, and 76.59% for BIB. Although the performance is not as good as that obtained from a system with language-dependent components, our proposed system has power to deal with large scope of domain- and language-independent problem. Identification of date has the best result, 73% and 92% of correct tokens are identified for two corpora respectively. The system also performs reasonably well on people’s name with correct tokens of 68% for TCC, and 76% for BIB.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Sekine, S.: NYU: Description of the Japanese NE system used for MET-2. In: Proceedings of MUC-7 1998 (1998)
Google Scholar
Bennett, S.W., Aone, C., Lovell, C.: Learning to tag multilingual texts through observation. In: Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, Providence, Rhode Island, pp. 109–116 (1997)
Google Scholar
Baluja, S., Mittal, V.O., Sukthankar, R.: Applying machine learning for high performance named-entity extraction. In: Proceedings of the Conference of the Pacific Association for Computational Linguistics, Waterloo, CA, pp. 365–378 (1999)
Google Scholar
Borthwick, A., Sterling, J., Agichtein, E., Grishman, R.: Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In: Proceedings of the Sixth Workshop on Very Large Corpora, Montreal, Canada (1998)
Google Scholar
Viterbi, A.J.: Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theroy IT-13, 260–269 (1967)
Article Google Scholar
Mikheev, A., Moens, M., Grover, C.: Named entity recognition without gazetteers. In: Proceedings of EACL, Bergen, Norway (1999)
Google Scholar
Bikel, D.M., Schwartz, R., Weischedel, R.M.: An algorithm that learns what’s in a name. Machine Learning Journal 34, 211–231 (1999)
Article MATH Google Scholar
Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77, 257–286 (1989)
Article Google Scholar
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes-Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann Publishers, San Francisco (1999) ISBN 1-55860-570-3
Google Scholar
Cleary, J.G., Witten, I.H.: Data compression using adaptive coding and partial string matching. IEEE Trans on Communications 32, 396–402 (1984)
Article Google Scholar
Forney, J.G.D.: The Viterbi algorithm. Proceedings of the IEEE 61, 268–278 (1973)
Article MathSciNet Google Scholar
Witten, I.H., Bell, T.C.: The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory 37, 1085–1093 (1991)
Article Google Scholar
Teahan, W.J., Wen, Y., McNab, R., Witten, I.H.: A compression-based algorithm for Chinese word segmentation. Computational Linguistics 26, 375–393 (2000)
Article Google Scholar
Van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworths, London (1979)
Google Scholar
Lewis, D.D.: Evaluating and optimizing autonomous text classification systems. In: Proceedings of the Eighteenth Annual International ACM Special Interest Group on Information Retrieval, pp. 246–254 (1995)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Software Engineering, Monash University, Clayton, Victoria, 3800, Australia
Yingying Wen
Department of Computer Science, The University of Waikato, Hamilton, New Zealand
Yingying Wen & Ian H. Witten
Department of Computer Science and Computer Engineering, La Trobe University, Victoria, 3086, Australia
Dianhui Wang

Authors

Yingying Wen
View author publications
You can also search for this author in PubMed Google Scholar
Ian H. Witten
View author publications
You can also search for this author in PubMed Google Scholar
Dianhui Wang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Australian National University, ACT 0200, Acton, Australia
Tamás (Tom) Domonkos Gedeon
Murdoch University,
Lance Chun Che Fung

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wen, Y., Witten, I.H., Wang, D. (2003). Token Identification Using HMM and PPM Models. In: Gedeon, T.(.D., Fung, L.C.C. (eds) AI 2003: Advances in Artificial Intelligence. AI 2003. Lecture Notes in Computer Science(), vol 2903. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24581-0_15

Download citation

DOI: https://doi.org/10.1007/978-3-540-24581-0_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20646-0
Online ISBN: 978-3-540-24581-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics