Abstract
In analyzing data from social and communication networks, we encounter the problem of classifying objects where there is explicit link structure amongst the objects. We study the problem of inferring the classification of all the objects from a labeled subset, using only link-based information between objects.
We abstract the above as a labeling problem on multigraphs with weighted edges. We present two classes of algorithms, based on local and global similarities. Then we focus on multigraphs induced by blog data, and carefully apply our general algorithms to specifically infer labels such as age, gender and location associated with the blog based only on the link-structure amongst them. We perform a comprehensive set of experiments with real, large-scale blog data sets and show that significant accuracy is possible from little or no non-link information, and our methods scale to millions of nodes and edges.
Keywords
©ACM, 2007. This is a minor revision of the work published in Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis, http://doi.acm.org/10.1145/1348549.1348560
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Adamic, L.A., Glance, N.: The political blogosphere and the 2004 U.S. election: divided they blog. In: International Workshop on Link Discovery (LinkKDD), pp. 36–43 (2005)
Van Assche, A., Vens, C., Blockeel, H., Džeroski, S.: A random forest approach to relational learning. In: Workshop on Statistical Relational Learning (2004)
Bhagat, S., Cormode, G., Muthukrishnan, S., Rozenbaum, I., Xue, H.: No blog is an island - analyzing connections across information networks. In: International Conference on Weblogs and Social Media (2007)
Burger, J.D., Henderson, J.C.: Barely legal writers: An exploration of features for predicting blogger age. In: AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs (2006)
Chakrabarti, S., Dom, B., Indyk, P.: Enhanced hypertext categorization using hyperlinks. In: ACM SIGMOD (1998)
Domingos, P., Richardson, M.: Markov logic: A unifying framework for statistical relational learning. In: Workshop on Statistical Relational Learning (2004)
Getoor, L., Friedman, N., Koller, D., Taskar, B.: Learning probabilistic models of link structure. Journal of Machine Learning Research 3, 679–707 (2002)
Hu, J., Zeng, H.-J., Li, H., Niu, C., Chen, Z.: Demographic prediction based on user’s browsing behavior. In: International World Wide Web Conference (2007)
Indyk, P., Motwani, R.: Approximate nearest neighbors: Towards removing the curse of dimensionality. In: STOC (1998)
Lu, Q., Getoor, L.: Link-based classification. In: International Conference on Machine Learning (2003)
MacKinnon, I., Warren, R.H.: Age and geographic inferences of the LiveJournal social network. In: Statistical Network Analysis Workshop (2006)
Macskassy, S.A., Provost, F.: A simple relational classifier. In: Workshop on Multi-Relational Data Mining (2003)
McPherson, M., Smith-Lovin, L., Cook, J.M.: Birds of a feather: Homophily in social networks. Annual Review of Sociology 27, 415–444 (2001)
Mishne, G.: Experiments with mood classification in blog posts. In: Workshop on Stylistic Analysis of Text for Information Access (2005)
Neville, J., Jensen, D.: Iterative Classification in Relational Data. In: Workshop on Learning Statistical Models from Relational Data (2000)
Neville, J., Jensen, D., Friedland, L., Hay, M.: Learning relational probability trees. In: ACM Conference on Knowledge Discovery and Data Mining (SIGKDD) (2003)
Qu, H., Pietra, A.L., Poon, S.: Classifying blogs using NLP: Challenges and pitfalls. In: AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs (2006)
Schler, J., Koppel, M., Argamon, S., Pennebaker, J.: Effects of age and gender on blogging. In: AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs (2006)
Taskar, B., Abbeel, P., Koller, D.: Discriminative probabilistic models for relational data. In: Conference on Uncertainty in Artificial Intelligence (2002)
Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, San Francisco (2005)
Yedidia, J., Freeman, W., Weiss, Y.: Generalized belief propagation. In: Advances in Neural Information Processing Systems (NIPS) (2000)
Zhang, T., Popescul, A., Dom, B.: Linear prediction models with graph regularization for web-page categorization. In: ACM Conference on Knowledge Discovery and Data Mining (SIGKDD) (2006)
Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schölkopf, B.: Learning with local and global consistency. In: Advances in Neural Information Processing Systems (2004)
Zhou, D., Huang, J., Schölkopf, B.: Learning from labeled and unlabeled data on a directed graph. In: International Conference on Machine Learning, pp. 1041–1048 (2005)
Zhu, X.: Semi-supervised learning literature survey. Technical report, Computer Sciences, University of Wisconsin-Madison (2006)
Zhu, X., Ghahramani, Z., Lafferty, J.: Semi-supervised learning using Gaussian fields and harmonic functions. In: International Conference on Machine Learning (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bhagat, S., Cormode, G., Rozenbaum, I. (2009). Applying Link-Based Classification to Label Blogs. In: Zhang, H., et al. Advances in Web Mining and Web Usage Analysis. SNAKDD 2007. Lecture Notes in Computer Science(), vol 5439. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00528-2_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-00528-2_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00527-5
Online ISBN: 978-3-642-00528-2
eBook Packages: Computer ScienceComputer Science (R0)