Skip to main content

First Application of a Distance-Based Outlier Approach to Detect Highly Differentiated Genomic Regions Across Human Populations

  • Chapter
  • 1997 Accesses

Abstract

Genomic scans for positive selection or population differentiation are often used in evolutionary genetics to shortlist genetic loci with potentially adaptive biological functions. However, the vast majority of such tests relies on empirical ranking methods, which suffer from high false positive rates. In this work we computed a modified genetic distance on a 10,000 bp sliding window between sets of three samples each from CHB, CEU and YRI samples from the 1000 Genomes Project. We applied SolvingSet, a distance-based outlier detection method capable of mining hundreds of thousands of multivariate entries in a computationally efficient manner, to the average pairwise distances obtained from each window for each CHB-CEU, CHB-YRI and CEU-YRI to compute the top-n genic windows exhibiting the highest scores for the three distances. The outliers detected by this approach were screened for their biological significance, showing good overlap with previously known targets of differentiation and positive selection in human populations.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. 1000 Genomes Project Consortium, Abecasis, G.R., Auton, A., Brooks, L.D., DePristo, M.A., Durbin, R.M., Handsaker, R.E., Kang, H.M., Marth, G.T., McVean, G.A.: An integrated map of genetic variation from 1,092 human genomes. Nature 491(7422), 56–65 (2012)

    Google Scholar 

  2. Angiulli, F., Basta, S., Lodi, S., Sartori, C.: Distributed strategies for mining outliers in large data sets. IEEE Trans. Knowl. Data Eng. 25(7), 1520–1532 (2013)

    Article  Google Scholar 

  3. Angiulli, F., Basta, S., Lodi, S., Sartori, C.: Fast outlier detection using a gpu. In: International Conference on High Performance Computing and Simulation (HPCS), pp. 143–150 (2013)

    Google Scholar 

  4. Angiulli, F., Pizzuti, C.: Outlier mining in large high-dimensional data sets. Trans. Knowl. Data Eng. 2(17), 203–215 (2005)

    Article  Google Scholar 

  5. Angiulli, F., Basta, S., Lodi, S., Sartori, C.: Accelerating outlier detection with intra- and inter-node parallelism. In: International Conference on High Performance Computing and Simulation (HPCS), pp. 476–483. IEEE, Bologna, Italy, 21–25 July (2014)

    Google Scholar 

  6. Angiulli, F., Basta, S., Pizzuti, C.: Distance-based detection and prediction of outliers. Trans. Knowl. Data Eng. 18(2), 145–160 (2006)

    Article  Google Scholar 

  7. Angiulli, F., Fassetti, F.: Dolphin: an efficient algorithm for mining distance-based outliers in very large datasets. ACM Trans. Knowl. Disc. Data 3(1), 4:1–4:57 (2009)

    Google Scholar 

  8. Ayub, Q., Moutsianas, L., Chen, Y., Panoutsopoulou, K., Colonna, V., Pagani, L., Prokopenko, I., Ritchie, G.R.S., Smith, T.C., McCarthy, M.I., et al.: Revisiting the thrifty gene hypothesis via 65 loci associated with susceptibility to type 2 diabetes. Am. J Hum. Genet. 94(2), 176–185 (2014)

    Article  Google Scholar 

  9. Barnett, V., Lewis, T.: Outliers in Statistical Data, 3rd edn. Wiley, Chichester (1994)

    MATH  Google Scholar 

  10. Bay, S.D., Schwabacher, M.: Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Knowledge Discovery and Data Mining (2003)

    Book  Google Scholar 

  11. Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 93–104. ACM, New York, USA (2000)

    Google Scholar 

  12. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3), 15:1–15:58 (2009)

    Google Scholar 

  13. Colonna, V., Ayub, Q., Chen, Y., Pagani, L., Luisi, P., Pybus, M., Garrison, E., Xue, Y., Tyler-Smith, C., et al.: Human genomic regions with exceptionally high levels of population differentiation identified from 911 whole-genome sequences. Genome Biol. 15(6), R88 (2014)

    Article  Google Scholar 

  14. Dutta, H., Giannella, C., Borne, K.D., Kargupta, H.: Distributed top-k outlier detection from astronomy catalogs using the DEMAC system. In: SDM (2007)

    Book  Google Scholar 

  15. Ewing, G., Hermisson, J.: Msms: a coalescent simulation program including recombination, demographic structure and selection at a single locus. Bioinformatics 26(26), 2064–2065 (2010)

    Article  Google Scholar 

  16. Fay, J.C., Wu, C.I.: The neutral theory in the genomic era. Curr. Opin. Genet. Dev. 11(6), 642–646 (2001)

    Article  Google Scholar 

  17. Ghoting, A., Parthasarathy, S., Otey, M.E.: Fast mining of distance-based outliers in high-dimensional datasets. Data Min. Knowl. Disc. 16(3), 349–364 (2008)

    Article  MathSciNet  Google Scholar 

  18. Han, J., Kamber, M.: Data Mining, Concepts and Technique. Morgan Kaufmann, San Francisco (2001)

    Google Scholar 

  19. Hodge, V.J., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22, 85–126 (2004)

    Article  MATH  Google Scholar 

  20. Hung, E., Cheung, D.W.: Parallel mining of outliers in large database. Distrib. Parallel Dat. 12(1), 5–26 (2002)

    Article  MATH  Google Scholar 

  21. Knorr, E., Ng, R.: Algorithms for mining distance-based outliers in large datasets. In: VLDB. pp. 392–403 (1998)

    Google Scholar 

  22. Koufakou, A., Georgiopoulos, M.: A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes. Data Min. Knowl. Disc. (2009, Published online)

    Google Scholar 

  23. Lozano, E., Acuña, E.: Parallel algorithms for distance-based and density-based outliers. In: ICDM. pp. 729–732 (2005)

    Google Scholar 

  24. Otey, M.E., Ghoting, A., Parthasarathy, S.: Fast distributed outlier detection in mixed-attribute data sets. Data Min. Knowl. Disc. 12(2–3), 203–228 (2006)

    Article  MathSciNet  Google Scholar 

  25. Pickrell, J.K., Coop, G., Novembre, J., Kudaravalli, S., Li, J.Z., Absher, D., Srinivasan, B.S., Barsh, G.S., Myers, R.M., Feldman, M.W., et al.: Signals of recent positive selection in a worldwide sample of human populations. Genome Res. 19(5), 826–837 (2009)

    Article  Google Scholar 

  26. Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: SIGMOD, pp. 427–438 (2000)

    Google Scholar 

  27. Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 427–438. ACM, New York, USA (2000)

    Google Scholar 

  28. Sabeti, P.C., Varilly, P., Fry, B., Lohmueller, J., Hostetter, E., Cotsapas, C., Xie, X., Byrne, E.H., McCarroll, S.A., Gaudet, R., et al.: Genome-wide detection and characterization of positive selection in human populations. Nature 449(7164), 913–918 (2007)

    Article  Google Scholar 

  29. Tajima, F.: Statistical method for testing the neutral mutation hypothesis by dna polymorphism. Genetics 123(3), 585–595 (1989)

    MathSciNet  Google Scholar 

  30. Tao, Y., Xiao, X., Zhou, S.: Mining distance-based outliers from large databases in any metric space. In: KDD, pp. 394–403 (2006)

    Google Scholar 

  31. Voight, B.F., Kudaravalli, S., Wen, X., Pritchard, J.K.: A map of recent positive selection in the human genome. PLoS Biol. 4(3), e72 (2006)

    Article  Google Scholar 

  32. Wright, S.: Isolation by distance under diverse systems of mating. Genetics 31(1), 39 (1946)

    Google Scholar 

  33. Yi, X., Liang, Y., Huerta-Sanchez, E., Jin, X., Cuo, Z.X.P., Pool, J.E., Xu, X., Jiang, H., Vinckenbosch, N., Korneliussen, T.S., et al.: Sequencing of 50 human exomes reveals adaptation to high altitude. Science 329(5987), 75–78 (2010)

    Article  Google Scholar 

Download references

Acknowledgements

This work has been partially supported by the Italian Ministry of Education, Universities and Research under PRIN Data-Centric Genomic Computing (GenData 2020) and by CINECA ISCRA project HIOXICGP. Luca Pagani would like to thank Guy Jacobs for his help with simulations. The authors have no conflict of interests to declare.

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Lodi, S., Angiulli, F., Basta, S., Luiselli, D., Pagani, L., Sartori, C. (2015). First Application of a Distance-Based Outlier Approach to Detect Highly Differentiated Genomic Regions Across Human Populations. In: Zazzu, V., Ferraro, M., Guarracino, M. (eds) Mathematical Models in Biology. Springer, Cham. https://doi.org/10.1007/978-3-319-23497-7_10

Download citation

Publish with us

Policies and ethics