Vector Space Models for the Classification of Short Messages on Social Network Services

Lage, Ricardo; Dolog, Peter; Leginus, Martin

doi:10.1007/978-3-662-44300-2_13

Ricardo Lage^8,9,
Peter Dolog⁸ &
Martin Leginus⁸

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 189))

Included in the following conference series:

International Conference on Web Information Systems and Technologies

612 Accesses
2 Citations

Abstract

In this chapter we review vector space models to propose a new one based on the Jensen-Shannon divergence with the goal of classifying ignored short messages on a social network service. We assume that ignored messages are those published ones that were not interacted with. Our goal then is to attempt to classify messages to be published as ignored to discard them from a set messages that can be used by a recommender system. To evaluate our model, we conduct experiments comparing different models on a Twitter dataset with more than 13,000 Twitter accounts. Results show that our best model tested obtained an average accuracy of 0.77, compared to 0.74 from a model from the literature. Similarly, this method obtained an average precision of 0.74 compared to 0.58 from the second best performing model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://www.tweetstats.com/graphs/nytimes/zoom/2012/Sep
2.
These numbers are based on discussions in blog posts such as in http://thenextweb.com/twitter/2012/01/07/interesting-fact-most-tweets-posted-are -approximately-30-characters-long/ and http://www.ayman-naaman.net/2010/04/21/how-many-characters-do-you-tweet/. But they do not provide an average. In our own dataset presented in Sect. 4.1, the average number of characters in a tweet is 84.
3.
Note that we do not normalize our tf-idf model based on message length since all tend to have similar sizes [18].
4.
https://dev.twitter.com/docs/api/1/get/statuses/sample

References

Bell, R., Volinsky, C., Koren, Y.: Matrix factorization techniques for recommender systems. IEEE Comput. 42(8), 30–37 (2009)
Article Google Scholar
Chen, K., Chen, T., Zheng, G., Jin, O., Yao, E., Yu, Y.: Collaborative personalized tweet recommendation. In: Proceedings of the 35th international ACM SIGIR conference on Research and Development in Information Retrieval, pp. 661–670. SIGIR ’12, ACM, New York (2012). http://doi.acm.org/10.1145/2348283.2348372
Chen, M., Jin, X., Shen, D.: Short text classification improved by learning multi-granularity topics. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence, IJCAI’11, vol. 3, pp. 1776–1781. AAAI Press (2011). http://dx.doi.org/10.5591/978-1-57735-516-8/IJCAI11-298
Combarro, E., Montanes, E., Diaz, I., Ranilla, J., Mones, R.: Introducing a family of linear measures for feature selection in text categorization. IEEE Trans. Knowl. Data Eng. 17(9), 1223–1232 (2005)
Article Google Scholar
Dagan, I., Lee, L., Pereira, F.C.N.: Similarity-based models of word cooccurrence probabilities. Mach. Learn. 34(1–3), 43–69 (1999). doi:10.1023/A:1007537716579
Article MATH Google Scholar
Díaz, I., Ranilla, J., Montañes, E., Fernández, J., Combarro, E.: Improving performance of text categorization by combining filtering and support vector machines. J. Am. Soc. Inf. Sci. Technol. 55(7), 579–592 (2004)
Article Google Scholar
Halawi, G., Dror, G., Gabrilovich, E., Koren, Y.: Large-scale learning of word relatedness with constraints. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, pp. 1406–1414. ACM, New York (2012). http://doi.acm.org.zorac.aub.aau.dk/10.1145/2339530.2339751
Kwak, H., Lee, C., Park, H., Moon, S.: What is twitter, a social network or a news media? In: Proceedings of the 19th International Conference on World Wide Web, pp. 591–600. ACM, Raleigh (2010). http://portal.acm.org/citation.cfm?id=1772690.1772751
Lage, R., Durao, F., Dolog, P.: Towards effective group recommendations for microblogging users. In: Proceedings of the 27th Annual ACM Symposium on Applied Computing, SAC ’12, pp. 923–928. ACM, New York (2012). http://doi.acm.org/10.1145/2245276.2245456
Lan, M., Tan, C.L., Low, H.B., Sung, S.Y.: A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. In: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, WWW ’05, pp. 1032–1033. ACM, New York (2005) http://doi.acm.org.zorac.aub.aau.dk/10.1145/1062745.1062854
Lewis, D.D.: Naive (Bayes) at forty: the independence assumption in information retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–15. Springer, Heidelberg (1998)
Google Scholar
Lin, J., Mishne, G.: A study of “Churn” in tweets and real-time search queries. In: 6th International AAAI Conference on Weblogs and Social Media, May 2012. http://www.aaai.org/ocs/index.php/ICWSM/ICWSM12/paper/view/4599
Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and naive Bayes. In: Machine Learning-International Workshop then Conference, pp. 258–267. Morgan Kaufmann Publishers, INC (1999)
Google Scholar
Petrovic, S., Osborne, M., Lavrenko, V.: RT to win! predicting message propagation in twitter. In: 5th International AAAI Conference on Weblogs and Social Media, May 2011. http://www.aaai.org/ocs/index.php/ICWSM/ICWSM11/paper/view/2754
Robertson, S.E., Walker, S., Beaulieu, M., Willett, P.: Okapi at TREC-7: automatic ad hoc, filtering, VLC and interactive track. In: TREC, pp. 199–210 (1998)
Google Scholar
Sahami, M., Heilman, T.D.: A web-based kernel function for measuring the similarity of short text snippets. In: Proceedings of the 15th International Conference on World Wide Web, WWW ’06, pp. 377–386. ACM, New York (2006). http://doi.acm.org/10.1145/1135777.1135834
Sun, A.: Short text classification using very few words. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’12, pp. 1145–1146. ACM, New York (2012). http://doi.acm.org/10.1145/2348283.2348511
Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37, 141–188 (2010). http://arxiv.org/abs/1003.1141. arXiv:1003.1141
MATH MathSciNet Google Scholar
Yang, Y., Pedersen, J.: A comparative study on feature selection in text categorization. In: Machine Learning-International Workshop then Conference, pp. 412–420. Morgan Kaufmann Publishers, INC. (1997)
Google Scholar
Yih, W.T., Goodman, J., Carvalho, V.R.: Finding advertising keywords on web pages. In: Proceedings of the 15th International Conference on World Wide Web, WWW ’06, pp. 213–222. ACM, New York (2006). http://doi.acm.org/10.1145/1135777.1135813
Yih, W.T., Meek, C.: Improving similarity measures for short segments of text. In: Proceedings of the 22nd National Conference on Artificial Intelligence AAAI’07, vol. 2, pp. 1489–1494. AAAI Press (2007). http://dl.acm.org.zorac.aub.aau.dk/citation.cfm?id=1619797.1619884

Download references

Author information

Authors and Affiliations

Aalborg University, Aalborg, Denmark
Ricardo Lage, Peter Dolog & Martin Leginus
LIP6, University Pierre Et Marie Curie (Paris 6), Paris, France
Ricardo Lage

Authors

Ricardo Lage
View author publications
You can also search for this author in PubMed Google Scholar
Peter Dolog
View author publications
You can also search for this author in PubMed Google Scholar
Martin Leginus
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Peter Dolog .

Editor information

Editors and Affiliations

RWTH Aachen University, Aachen, Germany
Karl-Heinz Krempels
Virtual Vehicle Research Center, Graz, Austria
Alexander Stocker

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lage, R., Dolog, P., Leginus, M. (2014). Vector Space Models for the Classification of Short Messages on Social Network Services. In: Krempels, KH., Stocker, A. (eds) Web Information Systems and Technologies. WEBIST 2013. Lecture Notes in Business Information Processing, vol 189. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44300-2_13

Download citation

DOI: https://doi.org/10.1007/978-3-662-44300-2_13
Published: 25 July 2014
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-44299-9
Online ISBN: 978-3-662-44300-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics