Multi-task Feature Selection Using the Multiple Inclusion Criterion (MIC)

  • Paramveer S. Dhillon
  • Brian Tomasik
  • Dean Foster
  • Lyle Ungar
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5781)


We address the problem of joint feature selection in multiple related classification or regression tasks. When doing feature selection with multiple tasks, usually one can “borrow strength” across these tasks to get a more sensitive criterion for deciding which features to select. We propose a novel method, the Multiple Inclusion Criterion (MIC), which modifies stepwise feature selection to more easily select features that are helpful across multiple tasks. Our approach allows each feature to be added to none, some, or all of the tasks. MIC is most beneficial for selecting a small set of predictive features from a large pool of potential features, as is common in genomic and biological datasets. Experimental results on such datasets show that MIC usually outperforms other competing multi-task learning methods not only in terms of accuracy but also by building simpler and more interpretable models.


Feature Selection Code Scheme Minimum Description Length Breast Cancer Dataset Multitask Learning 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Caruana, R.: Multitask learning. In: Machine Learning, pp. 41–75 (1997)Google Scholar
  2. 2.
    Ando, R., Zhang, T.: A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data. The Journal of Machine Learning Research 6, 1817–1853 (2005)MathSciNetzbMATHGoogle Scholar
  3. 3.
    Jacob, L., Bach, F., Vert, J.P.: Clustered multi-task learning: A convex formulation. In: Koller, D., Schuurmans, D., Bengio, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems, vol. 21 (2009)Google Scholar
  4. 4.
    Argyriou, A., Evgeniou, T., Pontil, M.: Convex multi-task feature learning. Mach. Learn. 73(3), 243–272 (2008)CrossRefGoogle Scholar
  5. 5.
    Ben-David, S., Borbely, R.S.: A notion of task relatedness yielding provable multiple-task learning guarantees. Mach. Learn. 73(3), 273–287 (2008)CrossRefGoogle Scholar
  6. 6.
    Lee, S.I., Chatalbashev, V., Vickrey, D., Koller, D.: Learning a meta-level prior for feature relevance from multiple related tasks. In: ICML 2007: Proceedings of the 24th international conference on Machine learning, pp. 489–496. ACM, New York (2007)Google Scholar
  7. 7.
    Raina, R., Ng, A.Y., Koller, D.: Constructing informative priors using transfer learning. In: ICML 2006, pp. 713–720. ACM, New York (2006)Google Scholar
  8. 8.
    Jebara, T.: Multi-task feature and kernel selection for SVMs. In: ICML 2004, ACM Press, New York (2004)Google Scholar
  9. 9.
    Turlach, B., Venables, W., Wright, S.: Simultaneous variable selection. Technometrics 47(3), 349–363 (2005)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Obozinski, G., Taskar, B., Jordan, M.I.: Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing (2009)Google Scholar
  11. 11.
    Efron, B., Hastie, T., Johnstone, L., Tibshirani, R.: Least angle regression. Annals of Statistics 32, 407–499 (2004)MathSciNetCrossRefzbMATHGoogle Scholar
  12. 12.
    Natarajan, B.: Sparse approximate solutions to linear systems. SIAM journal on computing 24, 227 (1995)MathSciNetCrossRefzbMATHGoogle Scholar
  13. 13.
    Lin, D., Pitler, E., Foster, D.P., Ungar, L.H.: In defense of ℓ0. In: Workhsop on Feature Selection at International Conference on Machine Learning, ICML 2008 (2008)Google Scholar
  14. 14.
    Rissanen, J.: Hypothesis selection and testing by the mdl principle. The Computer Journal 42, 260–269 (1999)CrossRefzbMATHGoogle Scholar
  15. 15.
    George, E., Foster, D.: Calibration and empirical Bayes variable selection. Biometrika 87(4), 731–747 (2000)MathSciNetCrossRefzbMATHGoogle Scholar
  16. 16.
    Foster, D.P., George, E.I.: The risk inflation criterion for multiple regression. The Annals of Statistics 22(4), 1947–1975 (1994)MathSciNetCrossRefzbMATHGoogle Scholar
  17. 17.
    Zhou, J., Foster, D., Stine, R., Ungar, L.: Streamwise feature selection. The Journal of Machine Learning Research 7, 1861–1885 (2006)MathSciNetzbMATHGoogle Scholar
  18. 18.
    Rissanen, J.: A universal prior for integers and estimation by minimum description length. Annals of Statistics 11(2), 416–431 (1983)MathSciNetCrossRefzbMATHGoogle Scholar
  19. 19.
    Elias, P.: Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory 21(2), 194–203 (1975)MathSciNetCrossRefzbMATHGoogle Scholar
  20. 20.
    Friedman, J.: Fast Sparse Regression and Classification (2008)Google Scholar
  21. 21.
    Litvin, O., Causton, H.C., Chen, B., Pe’er, D.: Special feature: Modularity and interactions in the genetics of gene expression. Proceedings of the National Academy of Sciences of the United States of America (February 2009) PMID: 19223586Google Scholar
  22. 22.
    van’t Veer, L.J., Dai, H., van de Vijver, M.J., He, Y.D., Hart, A.A., Mao, M., Peterse, H.L., van der Kooy, K., Marton, M.J., Witteveen, A.T., Schreiber, G.J., Kerkhoven, R.M., Roberts, C., Linsley, P.S., Bernards, R., Friend, S.H.: Gene expression profiling predicts clinical outcome of breast cancer. Nature 415(6871), 530–536 (2002)CrossRefGoogle Scholar
  23. 23.
    Kao, W.C., Rakhlin, A.: Transfer learning toolkit (2007),

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Paramveer S. Dhillon
    • 1
  • Brian Tomasik
    • 2
  • Dean Foster
    • 3
  • Lyle Ungar
    • 1
  1. 1.CIS DepartmentUniversity of PennsylvaniaPhiladelphiaU.S.A.
  2. 2.Computer Science DepartmentSwarthmore CollegeU.S.A.
  3. 3.Statistics DepartmentUniversity of PennsylvaniaPhiladelphiaU.S.A.

Personalised recommendations