Skip to main content

Identification of Tweets that Mention Books: An Experimental Comparison of Machine Learning Methods

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9469))

Abstract

In this paper, we address the task of the identification of tweets on Twitter that mention books (TMB) among tweets that contain the same strings as full book titles. Although this task can be treated as a kind of Named Entity Recognition, the fact that book titles consist of ordinary expressions (such as “The Girl on the Train”) makes the task harder. Furthermore, if tweets are gathered through a dictionary-based search, the tweets that contain the same strings as full book titles are often spam. However, assuming a complete list of book titles (i.e. from a union catalogue from a library or commercial bibliographic data from a book store), this task can be solved by text classification. Thus, we proposed a two-step pipeline consisting of spam filtering and TMB classification based on supervised learning with a small amount of labelled data. We constructed optimal classifiers by comparing combinations of four proven supervised learning methods with different features. Given the difficulty of the task, our pipeline performed highly (about 0.7 in terms of F-score).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Yada, S.: Development of a book recommendation system to inspire “infrequent readers”. In: Tuamsuk, K., Jatowt, A., Rasmussen, E. (eds.) ICADL 2014. LNCS, vol. 8839, pp. 399–404. Springer, Heidelberg (2014)

    Google Scholar 

  2. Adobe: Click Here: The State of Online Advertising. Tech. rep., Adobe Systems Incorporated (2013)

    Google Scholar 

  3. Nadeau, D., Sekine, S.: A Survey of Named Entity Eecognition and Classification. Lingvisticae Investigationes 30(1991), 3–26 (2007)

    Article  Google Scholar 

  4. Aggarwal, C.C., Zhai, C.: A survey of text classification algorithms. In: Aggarwal, C.C., Zhai, C. (eds.) Mining Text Data SE - 6, pp. 163–222. Springer US, Boston (2012)

    Chapter  Google Scholar 

  5. Liu, X., Zhang, S., Wei, F., Zhou, M.: Recognizing named entities in tweets. In: 49th Annual Meeting of the Association for Computational Linguistics, pp. 359−367. ACL, June 2011

    Google Scholar 

  6. McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI Workshop on Learning for Text Categorization, pp. 41−48 (1998)

    Google Scholar 

  7. Nigam, K.: Using maximum entropy for text classification. In: Workshop on Machine Learning for Information Filtering, pp. 61−67 (1999)

    Google Scholar 

  8. Cortes, C., Vapnik, V.: Support-Vector Networks. Machine Learning 20(3), 273–297 (1995)

    MATH  Google Scholar 

  9. Breiman, L.: Random Forests. Machine Learning 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  10. McCord, M., Chuah, M.: Spam detection on twitter using traditional classifiers. In: Calero, J.M., Yang, L.T., Mármol, F.G., García Villalba, L.J., Li, A.X., Wang, Y. (eds.) ATC 2011. LNCS, vol. 6906, pp. 175–186. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  11. Lea, D.: Detecting spam bots in online social networking sites: a machine learning approach. In: Foresti, S., Jajodia, S. (eds.) Data and Applications Security and Privacy XXIV. LNCS, vol. 6166, pp. 335–342. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  12. Go, A., Bhayani, R., Huang, L.: Twitter Sentiment Classification Using Distant Supervision. Tech. rep., Stanford (2009)

    Google Scholar 

  13. Pak, A., Paroubek, P.: Twitter as a corpus for sentiment analysis and opinion mining. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., Tapias, D. (eds.) The 17th International Conference on Language Resources and Evaluation, pp. 1320–1326. ELRA, Valletta (2010)

    Google Scholar 

  14. Prasetyo, P.K., Lo, D., Achananuparp, P., Tian, Y., Lim, E.P.: Automatic classification of software related microblogs. In: 28th International Conference on Software Maintenance, pp. 596−599. IEEE, September 2012

    Google Scholar 

  15. Aramaki, E., Maskawa, S., Morita, M.: Twitter catches the flu: detecting influenza epidemics using twitter. In: The Conference on Empirical Methods in Natural Language Processing, pp. 1568−1576. ACL, Stroudsburg (2011)

    Google Scholar 

  16. Tuarob, S., Tucker, C.S., Salathe, M., Ram, N.: An Ensemble Heterogeneous Classification Methodology for Discovering Health-related Knowledge in Social Media Messages. Journal of Biomedical Informatics 49, 255–268 (2014)

    Article  Google Scholar 

  17. Ritter, A., Clark, S., Mausam, Etzioni, O.: Named entity recognition in tweets: an experimental study. In: Conference on Empirical Methods in Natural Language Processing, pp. 1524−1534. ACL, July 2011

    Google Scholar 

  18. Kou, Z., Cohen, W.W., Murphy, R.F.: High-recall Protein Entity Recognition Using a Dictionary. Bioinformatics 21(Suppl 1), i266–i273 (2005)

    Article  Google Scholar 

  19. Yoshida, K., Tsujii, J.: Reranking for biomedical named-entity recognition. In: Workshop on Biological, Translational, and Clinical Language Processing, pp. 209−216. ACL, June 2007

    Google Scholar 

  20. Murai, H., Kawashima, T., Kudou, A.: Quantitative Analysis Concerning the Relationships and Roles of Pronouns in Movie and Theater Critiques (in Japanese). Journal of Japan Society of Information and Knowledge 22(1), 23–43 (2012)

    Article  Google Scholar 

  21. Abekawa, T., Nanba, H., Takamura, H., Okumura, M.: Automatic Extraction of Bibliography with Machine Learning (in Japanese). IPSJ SIG Notes 2003(98), 83–90 (2003)

    Google Scholar 

  22. Kousha, K., Thelwall, M.: An Automatic Method for Extracting Citations from Google Books. Journal of the Association for Information Science and Technology 66(2), 309–320 (2015)

    Article  Google Scholar 

  23. Brin, S.: Extracting patterns and relations from the world wide web. In: Atzeni, P., Mendelzon, A.O., Mecca, G. (eds.) WebDB 1998. LNCS, vol. 1590, pp. 172–183. Springer, Heidelberg (1999)

    Chapter  Google Scholar 

  24. Downey, D., Broadhead, M., Etzioni, O.: Locating complex named entities in web text. In: International Joint Conference on Artificial Intelligence, pp. 2733−2739 (2007)

    Google Scholar 

  25. Yada, S., Kageura, K.: Categorization of tweets mentioning books based on text clustering (in japanese). In: IEICE Technical Committee of Natural Language Understanding and Models of Communication, pp. 61−66 (2015)

    Google Scholar 

  26. Chinnasamy, D.G., Mohanraj, V.: A Survey on Spam Detection in Twitter. International Journal of Computer Science and Business Informatics 14(1), 92–102 (2014)

    Google Scholar 

  27. Ostrowski, D.A.: Feature selection for twitter classification. In: Eighth International Conference on Semantic Computing, pp. 267−272. IEEE, June 2014

    Google Scholar 

  28. Kashioka, H.: Analysis of synonym obtained from redirection of wikipedia (in japanese). In: The 13th Annual Meeting of the Association for Natural Language Processing, pp. 1094−1096 (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shuntaro Yada .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Yada, S., Kageura, K. (2015). Identification of Tweets that Mention Books: An Experimental Comparison of Machine Learning Methods. In: Allen, R., Hunter, J., Zeng, M. (eds) Digital Libraries: Providing Quality Information. ICADL 2015. Lecture Notes in Computer Science(), vol 9469. Springer, Cham. https://doi.org/10.1007/978-3-319-27974-9_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-27974-9_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-27973-2

  • Online ISBN: 978-3-319-27974-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics