Web-Drawn Corpus for Indian Languages: A Case of Hindi

Choudhary, Narayan

doi:10.1007/978-3-642-19403-0_36

Web-Drawn Corpus for Indian Languages: A Case of Hindi

Narayan Choudhary²

Conference paper

716 Accesses
2 Citations

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 139))

Abstract

Text in Hindi on the web has come of age since the advent of Unicode standards in Indic languages. The Hindi content has been growing by leaps and bounds and is now easily accessible on the web at large. For linguists and Natural Language Processing practitioners this could serve as a great corpus to conduct studies. This paper describes how good a manually collected corpus from the web could be. I start with my observations on finding the Hindi text and creating a representative corpus out of it. I compare this corpus with another standard corpus crafted manually and draw conclusions as to what needs to be done with such a web corpus to make it more useful for studies in linguistics.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Kilgarriff, A., Reddy, S., Pomikálek, J., Avinesh, P.V.S.: A Corpus Factory for Many Languages. In: Proceedings of Asialex, Bangkok (2009)
Google Scholar
Mahal, B.K.: The Queens English: How to Speak Pukka. Collins (2006)
Google Scholar
Biemann, C., Heyer, G., Quasthoff, U., Matthias, R.: The Leipzig Corpora Collection: Monolingual Corpora of Standard Size. In: Proceedings of Corpus Linguistics Birmingham, UK (2007)
Google Scholar
Biber, D.: Representativeness in Corpus Design. Literary and Linguistic Computing, 8(4) (1993)
Google Scholar
Leech, G.: New resources or just better old ones? The Holy Grail of Representativeness. In: Mair, C., Meyer, C.F. (eds.) Corpus Linguistics and the Web, Rodopi, Amsterdam, New York (2007)
Google Scholar
Jha, G.N.: The TDIL Program and the Indian Language Corpora Initiative (ILCI). In: Calzolari, N., et al. (eds.) Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC 2010). European Language Resources Association (ELRA). (2010)
Google Scholar
Baroni, M., Bernardini, S.: BootCaT: bootstrapping corpora and terms from the web. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC-2004), Lisbon (2004)
Google Scholar
Taneja, P., et al. (eds.): Devanagari Lipi Tatha Hindi Vartani ka Manakikaran. Central Hindi Directorate, New Delhi (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Jawaharlal Nehru University, New Delhi, India
Narayan Choudhary

Authors

Narayan Choudhary
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Punjabi University, Patiala, India
Chandan Singh , Gurpreet Singh Lehal , Jyotsna Sengupta , Dharam Veer Sharma & Vishal Goyal , , , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Choudhary, N. (2011). Web-Drawn Corpus for Indian Languages: A Case of Hindi. In: Singh, C., Singh Lehal, G., Sengupta, J., Sharma, D.V., Goyal, V. (eds) Information Systems for Indian Languages. ICISIL 2011. Communications in Computer and Information Science, vol 139. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19403-0_36

Download citation

DOI: https://doi.org/10.1007/978-3-642-19403-0_36
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19402-3
Online ISBN: 978-3-642-19403-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics