Regression and subgroup detection for heterogeneous samples


Regression analysis of heterogeneous samples with subgroup structure is essential to the development of precision medicine. In practice, this task is often challenging owing to the lack of prior knowledge of subgroup labels. Therefore, detecting the subgroups with similar characteristics becomes critical, which often controls the accuracy of regression analysis. In this article, we investigate a new framework for detecting the subgroups that have similar characters in feature space and similar treatment effects. The key idea is that we incorporate K-means clustering into the regression framework of concave pairwise fusion, so that the regression and subgroup detection tasks can be performed simultaneously. Our method is specifically tailored for handling the situations where the sample is not homogeneous in the sense that the response variables in different domains of feature space are generated through different mechanisms.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9


  1. Eckstein J (2012) Augmented Lagrangian and alternating direction methods for convex optimization: a tutorial and some illustrative computational results. In: RUTCOR research report RRR 32-2012, Rutgers University, pp 1–34

  2. El-Banna M (2017) Modified Mahalanobis Taguchi system for imbalance data classification. Comput Intell Neurosc 2017:5874896–15

    Article  Google Scholar 

  3. Everitt BS, Landau S, Leese M (2001) Cluster analysis, 4th edn. Arnold, London

    MATH  Google Scholar 

  4. Fan J, Li R (2001) Variable selection via non-concave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360

    Article  Google Scholar 

  5. Fortin M, Glowinski R (1983) On decomposition-coordination methods using an augmented Lagrangian. In: Fortin M, Glowinski R (eds) Augmented Lagrangian methods: applications to the solution of boundary-value problems. North-Holland, Amsterdam

    Google Scholar 

  6. Huang H (2017) Regression in heterogeneous problems. Statistica Sinica 27(1):71–88

    MathSciNet  MATH  Google Scholar 

  7. Hartigan JA (1975) Clustering algorithms. Wiley, New York

    MATH  Google Scholar 

  8. Hastie T, Tibshirani R, Friedman J (2016) The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edn. Springer, Berlin, pp 459–463

    MATH  Google Scholar 

  9. Huber PJ (1981) Robust statistics. Wiley, New York, pp 153–164

    Book  Google Scholar 

  10. Kumar P, Kanaujia SK, Singh A, Pradhan A (2019) In vivo detection of oral precancer using a fluorescence-based, in-house-fabricated device: a Mahalanobis distance-based classification. Lasers Med Sci 34(6):1243–1251

    Article  Google Scholar 

  11. Ma S, Huang J (2017) A concave pairwise fusion approach to subgroup analysis. J Am Stat Assoc 112(517):410–423

    MathSciNet  Article  Google Scholar 

  12. Martino A, Ghiglietti A, Ieva F, Paganoni AM (2019) A k-means procedure based on a Mahalanobis type distance for clustering multivariate functional data. Stat Methods Appl 28(2):301–322

    MathSciNet  Article  Google Scholar 

  13. Meier L, van de Geer S, Bühlmann P (2008) The group Lasso for logistic regression. J R Stat Soc Ser B (Stat Methodol) 70(1):53–71

    MathSciNet  Article  Google Scholar 

  14. Morgan KL, Rubin DB (2015) Rerandomization to balance tiers of covariates. J Am Stat Assoc 110(512):1412–1421

    MathSciNet  Article  Google Scholar 

  15. Nikpay S, Freedman S, Levy H, Buchmueller T (2017) Effect of the affordable care act medicaid expansion on emergency department visits: evidence from state-level emergency department databases. Ann Emerg Med 70(2):215–225.e6

    Article  Google Scholar 

  16. Sorensen T (1996) Which patients may be harmed by good treatments? Lancet 384:351–352

    Article  Google Scholar 

  17. Shen J, He X (2015) Inference for subgroup analysis with a structured logistic-normal mixture model. J Am Stat Assoc 110(509):303–312

    MathSciNet  Article  Google Scholar 

  18. Tehan H, Witteveen K, Tolan GA, Tehan G, Senior GJ (2018) Using mahalanobis distance to evaluate recovery in acute stroke. Arch Clin Neuropsychol 33(5):577–582

    Article  Google Scholar 

  19. Wang H, Li R, Tsai CL (2007) Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika 94(3):553–568

    MathSciNet  Article  Google Scholar 

  20. Zhang C (2010) Nearly unbiased variable selection under minimax concave penalty. Ann Stat 38:894–942

    MathSciNet  Article  Google Scholar 

  21. Zhang Y, Wang HJ, Zhu Z (2019) Robust subgroup identification. Stat Sin 29(4):1873–1889

    MathSciNet  MATH  Google Scholar 

  22. Zhao L, Tian L, Cai T, Claggett B, Wei LJ (2013) Effectively selecting a target population for a future comparative study. J Am Stat Assoc 108(502):527–539

    MathSciNet  Article  Google Scholar 

Download references


The authors thank AE and two anonymous reviewers for their helpful comments and valuable suggestions on earlier versions of this article. The authors also thank professor Shujie Ma for her constructive comments on our work during the meeting at LICAS 2019. This research was supported by the Fundamental Research Funds for the Central Universities, Beijing Natural Science Foundation (No. 1204031), and the National Natural Science Foundation of China (No. 11901013).

Author information



Corresponding author

Correspondence to Yanping Qiu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.


Supplementary material 1 (pdf 217 KB)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Liang, B., Wu, P., Tong, X. et al. Regression and subgroup detection for heterogeneous samples. Comput Stat 35, 1853–1878 (2020).

Download citation


  • Concave fusion
  • Heterogeneous problem
  • K-means clustering
  • Regression
  • Subgroup detection