Machine Learning

, Volume 92, Issue 2–3, pp 457–477 | Cite as

On using nearly-independent feature families for high precision and confidence



Consider learning tasks where the precision requirement is very high, for example a 99 % precision requirement for a video classification application. We report that when very different sources of evidence such as text, audio, and video features are available, combining the outputs of base classifiers trained on each feature type separately, aka late fusion, can substantially increase the recall of the combination at high precisions, compared to the performance of a single classifier trained on all the feature types, i.e., early fusion, or compared to the individual base classifiers. We show how the probability of a joint false-positive mistake can be less—in some cases significantly less—than the product of individual probabilities of conditional false-positive mistakes (a NoisyOR combination). Our analysis highlights a simple key criterion for this boosted precision phenomenon and justifies referring to such feature families as (nearly) independent. We assess the relevant factors for achieving high precision empirically, and explore combination techniques informed by the analysis.


Classifier combination Independent features High precision Late fusion Early fusion Ensembles Multiple views Supervised learning 



Many thanks to Tomas Izzo, Kevin Murphy, Emre Sargin, Fernando Preira, and Yoram Singer, for discussions and pointers, and to the anonymous reviewers for their valuable feedback.


  1. Abney, S. (2002). Bootstrapping. In 40th Annual Meeting of the Association for Computational Linguistics. Google Scholar
  2. Batista, G. E., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter, 6(1), 20–29. CrossRefGoogle Scholar
  3. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022. MATHGoogle Scholar
  4. Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In COLT: proceedings of the workshop on computational learning theory (pp. 91–100). Google Scholar
  5. Brown, G. (2009). An information theoretic perspective on multiple classifier systems. Google Scholar
  6. Calders, T., & Jaroszewicz, S. (2007). Efficient AUC optimization for classification. In Proceedings of the 11th European conference on principles and practice of knowledge discovery in databases (PKDD). Google Scholar
  7. Cortez, C., & Mohri, M. (2004). AUC optimization vs. error rate minimization. In Advances in neural information processing systems (NIPS). Google Scholar
  8. Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., & Singer, Y. (2006). Online passive-aggressive algorithms. Journal of Machine Learning Research, 7, 551–585. MathSciNetMATHGoogle Scholar
  9. Dani, V., Madani, O., Pennock, D., Sanghai, S., & Galebach, B. (2006). An empirical comparison of expert aggregation techniques. In UAI. Google Scholar
  10. Elkan, C. (2001). The foundations of cost-sensitive learning. In IJCAI. Google Scholar
  11. Gehler, P., & Nowozin, S. (2009). On feature combination for multiclass object classification. In ICCV. Google Scholar
  12. Hansen, L. K., & Salamon, P. (1990). Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(10), 993–1001. CrossRefGoogle Scholar
  13. Henrion, M. (1987). Practical issues in constructing a Bayes’ belief network. In Proceedings of the third conference annual conference on uncertainty in artificial intelligence (UAI-87) (pp. 132–139). New York: Elsevier. Google Scholar
  14. Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8), 1771–1800. MathSciNetMATHCrossRefGoogle Scholar
  15. Ho, T. K., Hull, J. J., & Srihari, S. N. (1994). Decision combination in multiple classifier systems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(1), 66–75. CrossRefGoogle Scholar
  16. Jain, A., Nandakumara, K., & Ross, A. (2005). Score normalization in multimodal biometric systems. Pattern Recognition, 38(12), 2270–2285. CrossRefGoogle Scholar
  17. Kittler, J., Hatef, M., Duin, R., & Matas, J. (1998). On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3), 226–239. CrossRefGoogle Scholar
  18. Long, P., Varadan, V., Gilman, S., Treshock, M., & Servedio, R. A. (2005). Unsupervised evidence integration. In ICML. Google Scholar
  19. Lyon, R. F., Rehn, M., Bengio, S., Walters, T. C., & Chechik, G. (2010). Sound retrieval and ranking using sparse auditory representations. Neural Computation, 22(9), 2390–2416. MATHCrossRefGoogle Scholar
  20. Madani, O., Pennock, D., & Flake, G. (2004). Co-validation: using model disagreement on unlabeled data to validate classification algorithms. In NIPS. Google Scholar
  21. Madani, O., Georg, M., & Ross, D. A. (2012). On using nearly-independent feature families for high precision and confidence. In Asian machine learning conference. Google Scholar
  22. Manning, C. D., & Schutze, H. (1999). Foundations of statistical natural language processing. Cambridge: MIT Press. MATHGoogle Scholar
  23. McCallum, A., Nigam, K., Rennie, J., & Seymore, K. (2000). Automating the construction of Internet portals with machine learning. Information Retrieval, 3(2), 127–163. CrossRefGoogle Scholar
  24. Niculescu-Mizil, A., & Caruana, R. (2005). Predicting good probabilities with supervised learning. In ICML. Google Scholar
  25. Platt, J. (1999). Probabilities for support vector machines and comparisons to regularized likelihood methods. In A. Smola, P. Bartlett, B. Schlkopf, & D. Schuurmans (Eds.), Advances in large margin classifiers (pp. 61–74). Cambridge: MIT Press. Google Scholar
  26. Robertson, T., Wright, F., & Dykstra, R. (1988). Order restricted statistical inference. New York: Wiley. MATHGoogle Scholar
  27. Snoek, C., Worring, M., & Smeulders, A. (2005). Early versus late fusion in semantic video analysis. In ACM conference on multimedia. Google Scholar
  28. Tamrakar, A., Ali, S., Yu, Q., Liu, J., Javed, O., Divakaran, A., Cheng, H., & Sawhney, H. S. (2012). Evaluation of low-level features and their combinations for complex event detection in open source videos. In CVPR. Google Scholar
  29. Toderici, G., Aradhye, H., Pasca, M., Sbaiz, L., & Yagnik, J. (2010). Finding meaning on YouTube: tag recommendation and category discovery. In Computer vision and pattern recognition (CVPR) (pp. 3447–3454). Los Alamitos: IEEE Press. Google Scholar
  30. Tulyakov, S., & Govindaraju, V. (2005). Using independence assumption to improve multimodal biometric fusion. In Lecture notes in computer science. Google Scholar
  31. Walters, T. C., Ross, D. A., & Lyon, R. F. (2012). The intervalgram: an audio feature for large-scale melody recognition. In Proceedings of the 9th international symposium on computer music modeling and retrieval (CMMR). Google Scholar
  32. Wang, W., & Zhou, Z.-H. (2010). A new analysis of co-training. In ICML. Google Scholar
  33. Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5(2), 241–259. MathSciNetCrossRefGoogle Scholar
  34. Yang, W., & Toderici, G. (2011). Discriminative tag learning on YouTube videos with latent sub-tags. In Computer vision and pattern recognition (CVPR). Google Scholar
  35. Zadrozny, B., & Elkan, C. (2001). Learning and making decisions when costs and probabilities are both unknown. In KDD. Google Scholar
  36. Zadrozny, B., & Elkan, C. (2002). Transforming classifier scores into accurate multiclass probability estimates. In KDD. Google Scholar
  37. Zhou, Z.-H. (2012). Ensemble methods: foundations and algorithms. London: Chapman & Hall. Google Scholar

Copyright information

© The Author(s) 2013

Authors and Affiliations

  1. 1.Google Inc.Mountain ViewUSA

Personalised recommendations