Skip to main content

Vector Space Models for the Classification of Short Messages on Social Network Services

  • Conference paper
  • First Online:
Web Information Systems and Technologies (WEBIST 2013)

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 189))

Included in the following conference series:

Abstract

In this chapter we review vector space models to propose a new one based on the Jensen-Shannon divergence with the goal of classifying ignored short messages on a social network service. We assume that ignored messages are those published ones that were not interacted with. Our goal then is to attempt to classify messages to be published as ignored to discard them from a set messages that can be used by a recommender system. To evaluate our model, we conduct experiments comparing different models on a Twitter dataset with more than 13,000 Twitter accounts. Results show that our best model tested obtained an average accuracy of 0.77, compared to 0.74 from a model from the literature. Similarly, this method obtained an average precision of 0.74 compared to 0.58 from the second best performing model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.tweetstats.com/graphs/nytimes/zoom/2012/Sep

  2. 2.

    These numbers are based on discussions in blog posts such as in http://thenextweb.com/twitter/2012/01/07/interesting-fact-most-tweets-posted-are -approximately-30-characters-long/ and http://www.ayman-naaman.net/2010/04/21/how-many-characters-do-you-tweet/. But they do not provide an average. In our own dataset presented in Sect. 4.1, the average number of characters in a tweet is 84.

  3. 3.

    Note that we do not normalize our tf-idf model based on message length since all tend to have similar sizes [18].

  4. 4.

    https://dev.twitter.com/docs/api/1/get/statuses/sample

References

  1. Bell, R., Volinsky, C., Koren, Y.: Matrix factorization techniques for recommender systems. IEEE Comput. 42(8), 30–37 (2009)

    Article  Google Scholar 

  2. Chen, K., Chen, T., Zheng, G., Jin, O., Yao, E., Yu, Y.: Collaborative personalized tweet recommendation. In: Proceedings of the 35th international ACM SIGIR conference on Research and Development in Information Retrieval, pp. 661–670. SIGIR ’12, ACM, New York (2012). http://doi.acm.org/10.1145/2348283.2348372

  3. Chen, M., Jin, X., Shen, D.: Short text classification improved by learning multi-granularity topics. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence, IJCAI’11, vol. 3, pp. 1776–1781. AAAI Press (2011). http://dx.doi.org/10.5591/978-1-57735-516-8/IJCAI11-298

  4. Combarro, E., Montanes, E., Diaz, I., Ranilla, J., Mones, R.: Introducing a family of linear measures for feature selection in text categorization. IEEE Trans. Knowl. Data Eng. 17(9), 1223–1232 (2005)

    Article  Google Scholar 

  5. Dagan, I., Lee, L., Pereira, F.C.N.: Similarity-based models of word cooccurrence probabilities. Mach. Learn. 34(1–3), 43–69 (1999). doi:10.1023/A:1007537716579

    Article  MATH  Google Scholar 

  6. Díaz, I., Ranilla, J., Montañes, E., Fernández, J., Combarro, E.: Improving performance of text categorization by combining filtering and support vector machines. J. Am. Soc. Inf. Sci. Technol. 55(7), 579–592 (2004)

    Article  Google Scholar 

  7. Halawi, G., Dror, G., Gabrilovich, E., Koren, Y.: Large-scale learning of word relatedness with constraints. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, pp. 1406–1414. ACM, New York (2012). http://doi.acm.org.zorac.aub.aau.dk/10.1145/2339530.2339751

  8. Kwak, H., Lee, C., Park, H., Moon, S.: What is twitter, a social network or a news media? In: Proceedings of the 19th International Conference on World Wide Web, pp. 591–600. ACM, Raleigh (2010). http://portal.acm.org/citation.cfm?id=1772690.1772751

  9. Lage, R., Durao, F., Dolog, P.: Towards effective group recommendations for microblogging users. In: Proceedings of the 27th Annual ACM Symposium on Applied Computing, SAC ’12, pp. 923–928. ACM, New York (2012). http://doi.acm.org/10.1145/2245276.2245456

  10. Lan, M., Tan, C.L., Low, H.B., Sung, S.Y.: A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. In: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, WWW ’05, pp. 1032–1033. ACM, New York (2005) http://doi.acm.org.zorac.aub.aau.dk/10.1145/1062745.1062854

  11. Lewis, D.D.: Naive (Bayes) at forty: the independence assumption in information retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–15. Springer, Heidelberg (1998)

    Google Scholar 

  12. Lin, J., Mishne, G.: A study of “Churn” in tweets and real-time search queries. In: 6th International AAAI Conference on Weblogs and Social Media, May 2012. http://www.aaai.org/ocs/index.php/ICWSM/ICWSM12/paper/view/4599

  13. Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and naive Bayes. In: Machine Learning-International Workshop then Conference, pp. 258–267. Morgan Kaufmann Publishers, INC (1999)

    Google Scholar 

  14. Petrovic, S., Osborne, M., Lavrenko, V.: RT to win! predicting message propagation in twitter. In: 5th International AAAI Conference on Weblogs and Social Media, May 2011. http://www.aaai.org/ocs/index.php/ICWSM/ICWSM11/paper/view/2754

  15. Robertson, S.E., Walker, S., Beaulieu, M., Willett, P.: Okapi at TREC-7: automatic ad hoc, filtering, VLC and interactive track. In: TREC, pp. 199–210 (1998)

    Google Scholar 

  16. Sahami, M., Heilman, T.D.: A web-based kernel function for measuring the similarity of short text snippets. In: Proceedings of the 15th International Conference on World Wide Web, WWW ’06, pp. 377–386. ACM, New York (2006). http://doi.acm.org/10.1145/1135777.1135834

  17. Sun, A.: Short text classification using very few words. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’12, pp. 1145–1146. ACM, New York (2012). http://doi.acm.org/10.1145/2348283.2348511

  18. Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37, 141–188 (2010). http://arxiv.org/abs/1003.1141. arXiv:1003.1141

    MATH  MathSciNet  Google Scholar 

  19. Yang, Y., Pedersen, J.: A comparative study on feature selection in text categorization. In: Machine Learning-International Workshop then Conference, pp. 412–420. Morgan Kaufmann Publishers, INC. (1997)

    Google Scholar 

  20. Yih, W.T., Goodman, J., Carvalho, V.R.: Finding advertising keywords on web pages. In: Proceedings of the 15th International Conference on World Wide Web, WWW ’06, pp. 213–222. ACM, New York (2006). http://doi.acm.org/10.1145/1135777.1135813

  21. Yih, W.T., Meek, C.: Improving similarity measures for short segments of text. In: Proceedings of the 22nd National Conference on Artificial Intelligence AAAI’07, vol. 2, pp. 1489–1494. AAAI Press (2007). http://dl.acm.org.zorac.aub.aau.dk/citation.cfm?id=1619797.1619884

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peter Dolog .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lage, R., Dolog, P., Leginus, M. (2014). Vector Space Models for the Classification of Short Messages on Social Network Services. In: Krempels, KH., Stocker, A. (eds) Web Information Systems and Technologies. WEBIST 2013. Lecture Notes in Business Information Processing, vol 189. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44300-2_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-44300-2_13

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-44299-9

  • Online ISBN: 978-3-662-44300-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics