News Media Analysis Using Focused Crawl and Natural Language Processing: Case of Lithuanian News Websites

Krilavičius, Tomas; Medelis, Žygimantas; Kapočiūtė-Dzikienė, Jurgita; Žalandauskas, Tomas

doi:10.1007/978-3-642-33308-8_5

Tomas Krilavičius³,
Žygimantas Medelis⁴,
Jurgita Kapočiūtė-Dzikienė³ &
…
Tomas Žalandauskas³

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 319))

Included in the following conference series:

International Conference on Information and Software Technologies

998 Accesses
3 Citations

Abstract

The amount of information that is created, used or stored is growing exponentially and types of data sources are diverse. Most of it is available as an unstructured text. Moreover, considerable part of it is available on-line, usually accessible as Internet resources. It is too expensive or even impossible for humans to analyze all the resources for a required information. Classical Information Technology techniques are not sufficient to process such amounts of information and render it in a form convenient for further analysis. Information Retrieval (IR) and Natural Language Processing (NLP) provide a number of instruments for information analysis and retrieval. In this paper we present a combined application of NLP and IR for Lithuanian media analysis. We demonstrate that a combination of IR and NLP tools with appropriate changes can be successfully applied to Lithuanian media texts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Plana, A.: Text/content analytics 2011: User perspectives on solutions and providers. Technical report, Alta Plana (September 2011)
Google Scholar
Manning, C., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge Univ. Press, New York (2008)
Book Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison Wesley (1999)
Google Scholar
Natural Language Access to Structured Text. In: Coling 1982: Proceedings of the Ninth International Conference on Computational Linguistics (1982)
Google Scholar
Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley (2005)
Google Scholar
Rösner, D., Grote, B., Hartmann, K., Höfling, B.: From natural language documents to sharable product knowledge: A knowledge engineering approach. Journal of Universal Computer Science 3(8), 955–987 (1997)
Google Scholar
Apache Foundation: Apache Tika. Web page (2011), http://tika.apache.org (last visited: December 10, 2011)
LingPipe: Lingpipe. Web page (2011), http://alias-i.com/lingpipe/ (last visited: December 10, 2011)
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V., Aswani, N., Roberts, I., Gorrell, G., Funk, A., Roberts, A., Damljanovic, D., Heitz, T., Greenwood, M.A., Saggion, H., Petrak, J., Li, Y., Peters, W.: Text Processing with GATE (Version 6) (2011)
Google Scholar
Vaičiūnas, A., Kaminskas, V., Raškinis, G.: Statistical language models of lithuanian based on word clustering and morphological decomposition. Informatica 15(4), 565–580 (2004)
Google Scholar
Šveikauskienė, D.: Formal description of the syntax of the lithuanian language. Information Technologies and Control 34, 245–256 (2005)
Google Scholar
Bevainytė, A., Butėnas, L.: Document classification using weighted ontology. Materials Physics and Mechanics 9(3), 236–245 (2010)
Google Scholar
Tomović, A., Janičić, P.: A Variant of N-Gram Based Language Classification. In: Basili, R., Pazienza, M.T. (eds.) AI*IA 2007. LNCS (LNAI), vol. 4733, pp. 410–421. Springer, Heidelberg (2007)
Chapter Google Scholar
Zinkevičius, Z.: Lemuoklis - tool for morphological analysis. Darbai ir Dienos (24), 245–274 (2000)
Google Scholar
Marcinkevičienė, R., Vitkutė-Adžgauskienė, D.: Developing the human language technology infrastructure in lithuania. In: Proceedings of the 2010 Conference on Human Language Technologies – The Baltic Perspective: Proceedings of the Fourth International Conference Baltic HLT 2010, pp. 3–10. IOS Press, Amsterdam (2010)
Google Scholar
Pandey, U., Chakravarty, S.: A survey on text classification techniques for e-mail filtering. In: Proceedings of the 2010 Second International Conference on Machine Learning and Computing, ICMLC 2010, pp. 32–36. IEEE Computer Society, Washington, DC (2010)
Chapter Google Scholar
Baharudin, B., Lee, L.H., Khan, K.: A review of machine learning algorithms for text-documents classification. Journal of Advances in Information Technology 1(1), 4–20 (2010)
Article Google Scholar
Harish, B.S., Guru, D.S., Manjunath, S.: Representation and classification of text documents: A brief review. IJCA, Special Issue on RTIPPR (2), 110–119 (2010)
Google Scholar
Maicher, L., Park, J. (eds.): TMRA 2005. LNCS (LNAI), vol. 3873. Springer, Heidelberg (2006)
Google Scholar
Yang, S.Y.: Ontocrawler: A focused crawler with ontology-supported website models for information agents. Expert Systems with Applications 37(7), 5381–5389 (2010)
Article Google Scholar
Porter, M.F.: Snowball: A language for stemming algorithms. Published online (October 2001), http://snowball.tartarus.org/texts/introduction.html (accessed March 11, 2008)
The National Archives: The soundex indexing system. Web page (May 2007), http://www.archives.gov/research/census/soundex.html
Centre of Computational Linguistics: Lithuanian digital resources. Web page (2011), http://sruoga.vdu.lt/lituanistiniai-skaitmeniai-istekliai
TokenMill: Lt language pack. Web page (2012), https://github.com/tokenmill/ltlangpack
Németh, L.: Hunspell. Web page (2012), http://hunspell.sourceforge.net
Lukaševičius, R., Agejevas, A.: ispell-lt. Web page, ftp://ftp.akl.lt/ispell-lt/
Wikipedia: Language identification — wikipedia, the free encyclopedia (2012) (Online; accessed April 30, 2012)
Google Scholar
Wikipedia: Stop words — wikipedia, the free encyclopedia (2012) (Online; accessed April 30, 2012)
Google Scholar
Krilavičius, T., Kuliešienė, D.: Soundex for lithuanian language. Internal report, UAB TokenMill (2010)
Google Scholar
Krilavičius, T., Baltrūnas, M.: Soundex for lithuanian language. Internal report and bachelor thesis, UAB TokenMill and Vytautas Magnus University (2012)
Google Scholar
Paliulionis, V.: Lietuviškų adresų geokodavimo problemos ir jų sprendimo būdai. Informacijos Mokslai, 217–222 (2009)
Google Scholar
Krilavičius, T., Medelis, V.: Porter stemmer for lithuanian language. Internal report and bachelor thesis, UAB TokenMill and Vytautas Magnus University (2010)
Google Scholar
Ghosh, J., Strehl, A.: Similarity-Based Text Clustering: A Comparative Study. In: Kogan, J., Nicholas, C., Teboulle, M. (eds.) Grouping Multidimensional Data, pp. 73–97. Springer, Heidelberg (2006)
Chapter Google Scholar
Zhong, S., Ghosh, J.: Generative model-based document clustering: a comparative study. Knowledge and Information Systems 8, 374–384 (2005), doi:10.1007/s10115-004-0194-1
Article Google Scholar
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining, vol. 400(X), pp. 1–20 (2000)
Google Scholar
Andrews, N.O., Fox, E.A.: Recent developments in document clustering. Technical report (2007)
Google Scholar
Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Journal of Linguisticae Investigationes 30(1), 1–20 (2007)
Article Google Scholar
Kaur, D., Gupta, V.: A survey of named entity recognition in english and other indian languages. IJCSI International Journal of Computer Science Issues 7(6), 239–245 (2010)
Google Scholar
AbdelRahman, S., Elarnaoty, M., Magdy, M., Fahmy, A.: Integrated machine learning techniques for arabic named entity recognition. IJCSI International Journal of Computer Science Issues 7(4), 27–36 (2010)
Google Scholar
Nguyen, D.B., Hoang, S.H., Pham, S.B., Nguyen, T.P.: Named Entity Recognition for Vietnamese. In: Nguyen, N.T., Le, M.T., Świątek, J. (eds.) ACIIDS 2011, Part II. LNCS, vol. 5991, pp. 205–214. Springer, Heidelberg (2010)
Chapter Google Scholar
Kapočiūtė-Dzikienė, J., Raškinis, G.: Rule-based annotation of lithuanian text corpora. Information technology and control. Technologija 34, 290–296 (2005)
Google Scholar
Balčas, J., Krilavičius, T., Medelis, V.: Lithuanian date and time identification using GATE and Jape. Internal report and bachelor thesis, UAB TokenMill and Vytautas Magnus Unviersity (2012)
Google Scholar
Širviskas, R., Krilavičius, T., Medelis, V.: Lithuanian citations identification using GATE and Jape. Internal report and bachelor thesis, UAB TokenMill and Vytautas Magnus University (2012)
Google Scholar
Apache Foundation: Apache Nutch. Web page (2011), http://nutch.apache.org (last visited: December 10, 2011)
Apache Foundation: Apache Mahout. Web page (2011), http://mahout.apache.org (last visited: December 10, 2011)
Apache Foundation: Apache Solr. Web page (2011), http://lucene.apache.org/solr (last visited: December 10, 2011)
Apache Foundation: Apache Lucene. Web page (2011), http://lucene.apache.org (last visited: December 10, 2011)

Download references

Author information

Authors and Affiliations

Baltic Institute of Advanced Technology, Saultekio 15, Vilnius, Lithuania
Tomas Krilavičius, Jurgita Kapočiūtė-Dzikienė & Tomas Žalandauskas
UAB “Tokenmill”, Lithuania
Žygimantas Medelis

Authors

Tomas Krilavičius
View author publications
You can also search for this author in PubMed Google Scholar
Žygimantas Medelis
View author publications
You can also search for this author in PubMed Google Scholar
Jurgita Kapočiūtė-Dzikienė
View author publications
You can also search for this author in PubMed Google Scholar
Tomas Žalandauskas
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Kaunas University of Technology, Studentu g. 50-313a, LT-51368, Kaunas, Lithuania
Tomas Skersys & Rimantas Butleris &
Kaunas University of Technology, Studentu g. 50-309a, LT-51368, Kaunas, Lithuania
Rita Butkiene

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Krilavičius, T., Medelis, Ž., Kapočiūtė-Dzikienė, J., Žalandauskas, T. (2012). News Media Analysis Using Focused Crawl and Natural Language Processing: Case of Lithuanian News Websites. In: Skersys, T., Butleris, R., Butkiene, R. (eds) Information and Software Technologies. ICIST 2012. Communications in Computer and Information Science, vol 319. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33308-8_5

Download citation

DOI: https://doi.org/10.1007/978-3-642-33308-8_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33307-1
Online ISBN: 978-3-642-33308-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics