Skip to main content

Optimal Stem Identification in Presence of Suffix List

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7181))

Abstract

Stemming is considered crucial in many NLP and IR applications. In the absence of any linguistic information, stemming is a challenging task. Stemming of words using suffixes of a language as linguistic information is in comparison an easier problem. In this work we considered stemming as a process of obtaining minimum number of lexicon from an unannotated corpus by using a suffix set. We proved that the exact lexicon reduction problem is NP-hard and came up with a polynomial time approximation. One probabilistic model that minimizes the stem distributional entropy is also proposed for stemming. Performances of these models are analyzed using an unannotated corpus and a suffix set of Malayalam, a morphologically rich language of India belonging to the Dravidian family.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Hammarström, H., Borin, L.: Unsupervised learning of morphology. CL, 309–350 (2011)

    Google Scholar 

  2. Goldsmith, J.A.: Unsupervised learning of the morphology of a natural language. CL (2), 153–198 (2001)

    Google Scholar 

  3. Creutz, M., Lagus, K.: Unsupervised models for morpheme segmentation and morphology learning. TSLP 4 (2007)

    Google Scholar 

  4. Clark, A.: Partially supervised learning of morphology with stochastic transducers. In: NLPRS, pp. 341–348 (2001)

    Google Scholar 

  5. Snover, M.G., Jarosz, G.E., Brent, M.R.: Unsupervised learning of morphology using a novel directed search algorithm: taking the first step. In: Proc. of ACL-WMPL 2002, pp. 11–20 (2002)

    Google Scholar 

  6. Dreyer, M., Eisner, J.: Graphical models over multiple strings. In: Proc. of EMNLP 2009, pp. 101–110 (2009)

    Google Scholar 

  7. Johnson, H., Martin, J.: Unsupervised learning of morphology for english and inuktitut. In: Proc. of NAACL-HLT 2003, pp. 43–45 (2003)

    Google Scholar 

  8. Bosch, A.v.d., Daelemans, W.: Memory-based morphological analysis. In: Proc. of ACL 1999 (1999)

    Google Scholar 

  9. Hammarström, H.: A naive theory of affixation and an algorithm for extraction. In: Proc. of HLT-NAACL 2006, pp. 79–88 (June 2006)

    Google Scholar 

  10. Hammarström, H.: Poor Man’s Stemming: Unsupervised Recognition of Same-Stem Words. In: Ng, H.T., Leong, M.-K., Kan, M.-Y., Ji, D. (eds.) AIRS 2006. LNCS, vol. 4182, pp. 323–337. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  11. Monson, C., Carbonell, J.G., Lavie, A., Levin, L.S.: ParaMor and Morpho Challenge 2008. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Peñas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 967–974. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  12. Dasgupta, S., Ng, V.: High-performance, language-independent morphological segmentation. In: HLT-NAACL, pp. 155–163 (2007)

    Google Scholar 

  13. Dasgupta, S., Ng, V.: Unsupervised morphological parsing of bengali. Language Resources and Evaluation, 311–330 (2006)

    Google Scholar 

  14. Lawphongpanich, S.: Frank-wolfe algorithm. In: Encyclopedia of Optimization, pp. 1094–1097 (2009)

    Google Scholar 

  15. David, S.M.I.P.S.: A morphological processor for malayalam language. Technical report, South Asia Research (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Vasudevan, N., Bhattacharyya, P. (2012). Optimal Stem Identification in Presence of Suffix List. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2012. Lecture Notes in Computer Science, vol 7181. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28604-9_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-28604-9_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-28603-2

  • Online ISBN: 978-3-642-28604-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics