Automatic Annotation of a Dynamic Corpus by Label Propagation

  • Thomas Lansdall-WelfareEmail author
  • Ilias Flaounas
  • Nello Cristianini
Conference paper
Part of the Springer Proceedings in Mathematics & Statistics book series (PROMS, volume 30)


We are interested in the problem of automatically annotating a large, constantly expanding corpus, in the case where potentially neither the dataset nor the class of possible labels that can be used are static, and the annotation of the data needs to be efficient. This application is motivated by real-world scenarios of news content analysis and social-web content analysis. We investigate a method based on the creation of a graph, whose vertices are the documents and the edges represent some notion of semantic similarity. In this graph, label propagation algorithms can be efficiently used to apply labels to documents based on the annotation of their neighbours. This paper presents experimental results about both the efficient creation of the graph and the propagation of the labels. We compare the effectiveness of various approaches to graph construction by building graphs of 800,000 vertices based on the Reuters corpus, showing that relation-based classification is competitive with support vector machines, which can be considered as state of the art. We also show that the combination of our relation-based approach and support vector machines leads to an improvement over the methods individually.


Graph construction Label propagation Large scale Text categorisation 



I. Flaounas and N. Cristianini are supported by FP7 under grant agreement no. 231495 (ComPLACS Project). N. Cristianini is supported by Royal Society Wolfson Research Merit Award. All authors are supported by Pascal2 Network of Excellence.


  1. 1.
    Ali, O., Zappella, G., De Bie, T., Cristianini, N.: An empirical comparison of label prediction algorithms on automatically inferred networks. In: Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods, SciTePress, pp. 259–268 (2012)Google Scholar
  2. 2.
    Angelova, R., Weikum, G.: Graph-based text classification: learn from your neighbors. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 485–492. ACM, New York (2006)Google Scholar
  3. 3.
    Araujo, M., Navarro, G., Ziviani, N.: Large Text Searching Allowing Errors In: Proceedings of the 4th South American Workshop on String Processing (WSP’97), pp. 2-20. Carleton University Press (1997)Google Scholar
  4. 4.
    Baeza-Yates, R., Navarro, G.: Block addressing indices for approximate text retrieval. J. Am. Soc. Inform. Sci. 51(1), 69–82 (2000)CrossRefGoogle Scholar
  5. 5.
    Bayardo, R., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of the 16th International Conference on World Wide Web, pp. 131–140. ACM, New York (2007)Google Scholar
  6. 6.
    Belkin, M., Matveeva, I., Niyogi, P.: Regularization and semi-supervised learning on large graphs. In: Learning Theory, pp. 624–638. Springer, Berlin (2004)Google Scholar
  7. 7.
    Carreira-Perpinan, M., Zemel, R.: Proximity graphs for clustering and manifold learning. In: Advances in Neural Information Processing Systems 17. NIPS-17, MIT Press (2004)Google Scholar
  8. 8.
    Cesa-Bianchi, N., Gentile, C., Vitale, F., Zappella, G.: Random spanning trees and the prediction of weighted graphs. In: Proceedings of ICML, Citeseer, OmniPress, pp. 175–182 (2010)Google Scholar
  9. 9.
    Cesa-Bianchi, N., Gentile, C., Vitale, F., Zappella, G.: Active Learning on Graphs via Spanning Trees, In NIPS Workshop on Networks Across Disciplines (2010)Google Scholar
  10. 10.
    Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:1–27:27 (2011). Software available at
  11. 11.
    Cormen, T., Leiserson, C., Rivest, R.: Introduction to Algorithms, MIT Press and McGraw-Hill (1990)Google Scholar
  12. 12.
    Dietterich, T.: Ensemble methods in machine learning. Multiple Classifier Systems, LNCS, Vol. 1857, Springer, pp. 1–15, (2000)Google Scholar
  13. 13.
    Dong, W., Moses, C., Li, K.: Efficient k-nearest neighbor graph construction for generic similarity measures. In: Proceedings of the 20th International Conference on World Wide Web, pp. 577–586. ACM, New York (2011)Google Scholar
  14. 14.
    Flaounas, I., Ali, O., Turchi, M., Snowsill, T., Nicart, F., De Bie, T., Cristianini, N.: NOAM: news outlets analysis and monitoring system. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, pp. 1275–1278. ACM, New York (2011)Google Scholar
  15. 15.
    Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proceedings of the 25th International Conference on Very Large Data Bases, pp. 518–529. Morgan Kaufmann Publishers Inc., Los Altos (1999)Google Scholar
  16. 16.
    Heaps, H.: Information Retrieval: Computational and Theoretical Aspects. Academic, Orlando (1978)zbMATHGoogle Scholar
  17. 17.
    Herbster, M., Pontil, M.: Prediction on a graph with a perceptron. Adv. Neural Inform. Process. Syst. 19, 577 (2007)Google Scholar
  18. 18.
    Herbster, M., Pontil, M., Rojas-Galeano, S.: Fast prediction on a tree. Adv. Neural Inform. Process. Syst. 21, 657–664 (2009)Google Scholar
  19. 19.
    Jebara, T., Wang, J., Chang, S.: Graph construction and b-matching for semi-supervised learning. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 441–448. ACM, New York (2009)Google Scholar
  20. 20.
    Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Machine Learning: ECML-98, Springer, pp. 137–142 (1998)Google Scholar
  21. 21.
    Lansdall-Welfare, T., Flaounas, I., Cristianini, N.: Scalable corpus annotation by graph construction and label propagation. In: Proceedings of the 1st International Conference on Pattern Recognition Applications and Method, SciTePress, pp. 25–34 (2012)Google Scholar
  22. 22.
    Lewis, D., Yang, Y., Rose, T., Li, F.: RCV1: A new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)Google Scholar
  23. 23.
    Maier, M., Von Luxburg, U., Hein, M.: Influence of graph construction on graph-based clustering measures. Adv. Neural Inform. Process. Syst. 22, 1025–1032 (2009)Google Scholar
  24. 24.
    Manning, C., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)zbMATHCrossRefGoogle Scholar
  25. 25.
    Parzen, E.: On estimation of a probability density function and mode. Ann. Math. Statist. 33(3), 1065–1076 (1962)MathSciNetzbMATHCrossRefGoogle Scholar
  26. 26.
    Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 743–754. ACM, New York (2004)Google Scholar
  27. 27.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)Google Scholar
  28. 28.
    Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004)CrossRefGoogle Scholar
  29. 29.
    Yang, Y., Zhang, J., Kisiel, B.: A scalability analysis of classifiers in text categorization. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 96–103. ACM, New York (2003)Google Scholar
  30. 30.
    Zhang, J., Marszałek, M., Lazebnik, S., Schmid, C.: Local features and kernels for classification of texture and object categories: a comprehensive study. Int. J. Comput. Vision 73(2), 213–238 (2007)CrossRefGoogle Scholar
  31. 31.
    Zhu, X.: Semi-supervised learning literature survey. In: Computer Science. University of Wisconsin-Madison. Madison (2007)Google Scholar
  32. 32.
    Zhu, X., Ghahramani, Z., Lafferty, J.: Semi-supervised learning using gaussian fields and harmonic functions. In: International Conference of Machine Learning, AAAI Press, vol. 20, p. 912 (2003)Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Thomas Lansdall-Welfare
    • 1
    Email author
  • Ilias Flaounas
    • 1
  • Nello Cristianini
    • 1
  1. 1.Intelligent Systems LaboratoryUniversity of BristolBristolUK

Personalised recommendations