Skip to main content

Two-Step Verifications for Multi-instance Features Selection: A Machine Learning Approach

  • Chapter
  • First Online:
Classification in BioApps

Part of the book series: Lecture Notes in Computational Vision and Biomechanics ((LNCVB,volume 26))

  • 1969 Accesses

Abstract

Multi-instance features measurement is an important step in identifying characteristics that are bound to various experimental events. In biological data processing, a set of critical factors is responsible for several diseases. Computational simulation will help to design an optimal tool for cost-effective drug design. In this regard, the processing of big data is valuable for efficient simulation. Recent experimental results generate huge amounts of related data. In the current work, noisy data have been treated with three filtering techniques: cross-validated committees filtering (CVCF), iterative partitioning filtering (IPF) and ensemble filtering (EF). A comparison was made of these three filtering approaches. The filtered datasets were normalized. The repeated application of three normalization techniques removed the irregularities and structured the datasets. Wide ranges of comparisons were made among these three normalization techniques. After being appropriately structured, these normalized datasets were transformed accordingly with three different transformation processes: rank transformation, nominal to binary transformation and Box-Cox transformation. To prevent false positive and false negative outcomes of the experiments, certain key aspects were considered: accuracy, sensitivity and F-measures. Accuracy of the experiments relates to the level of precise detection of certain factors; specificity allows the selection of the dominant factors; and sensitivity and F-measures are the ratio between the training and testing datasets. Detailed experimental analysis included a comparison study of the four classifiers for the deoxyribonucleic acid (DNA) dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Gupta R, Gupta H, Mohania M (2012) Cloud computing and big data analytics: what is new from databases perspective? In: Proceedings of the 1st international conference on big data analytics (BDA 2012), vol. 7678 of Lecture Notes on Computer Science, pp 42–61

    Google Scholar 

  2. Minelli M, Chambers M, Dhiraj A (2013) Big data, big analytics: emerging business intelligence and analytic trends for today’s businesses. Wiley, USA

    Google Scholar 

  3. López V, del Río S, Benítez J, Herrera F (2014) Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data, Fuzzy Sets Syst. http://dx.doi.org/10.1016/j.fss.2014.01.01

  4. Batista GEAPA, Prati RC, Monard MC (2004) A study of the behaviour of several methods for balancing machine learning training data. SIGKDD Explor 6 (1):20–29

    Google Scholar 

  5. Batuwita R, Palade V (2012) Adjusted geometric-mean: a novel performance measure for imbalanced bioinformatics datasets learning. J Bioinform Comput Biol 10(4)

    Google Scholar 

  6. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  MATH  Google Scholar 

  7. Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees, Wadsworth and Brooks

    Google Scholar 

  8. Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-SMOTE: safe-level-synthetic minority over-sampling TEchnique for handling the class imbalanced problem. In: Proceedings of the 13th Pacific Asia conference on advances in knowledge discovery and data mining PAKDD’09, pp 475–482

    Google Scholar 

  9. Berg OG, von Hippel PH (1987) Selection of DNA binding sites by regulatory proteins: Statistical-mechanical theory and application to operators and promoters. J Mol Biol 193(4):723–743

    Article  Google Scholar 

  10. Stormo GD (1990) Consensus patterns in DNA. Methods Enzymol 183:211–221

    Article  Google Scholar 

  11. Zhao X, Huang H, Speed TP (2005) Finding short DNA motifs using permuted markov models. J Comput Biol 12(6):894–906

    Article  Google Scholar 

  12. Badis G, Berger MF, Philippakis AA, Talukder S, Gehrke AR, Jaeger SA, Chan ET, Metzler G, Vedenko A, Chen X et al (2009) Diversity and complexity in DNA recognition by transcription factors. Science 324(5935):1720–1723

    Article  Google Scholar 

  13. Nutiu R, Friedman RC, Luo S, Khrebtukova I, Silva D, Li R, Zhang L, Schroth GP, Burge CB (2011) Direct measurement of DNA affinity landscapes ona high-throughput sequencing instrument. Nat Biotechnol 29(7):659–664

    Article  Google Scholar 

  14. Maerkl SJ, Quake SR (2007) A systems approach to measuring the binding energy landscapes of transcription factors. Science 315(5809):233–237

    Article  Google Scholar 

  15. Gao Z, Zhao R, Ruan J (2013) A genome-wide cis-regulatory element discovery method based on promoter sequences and gene co-expression networks. BMC Genom 14(Suppl 1):4

    Google Scholar 

  16. Bauer AL, Hlavacek WS, Unkefer PJ, Mu F (2010) Using sequence-specific chemical and structural properties of DNA to predict transcription factor binding sites. PLoS Comput Biol 6(11):1001007

    Article  MathSciNet  Google Scholar 

  17. Chen QK, Hertz GZ, Stormo GD (1995) Matrix search 1.0: a computer program that scans DNA sequences for transcriptional elements using a database of weight matrices. Computer applications in the biosciences. CABIOS 11(5):563–566

    Google Scholar 

  18. Djordjevic M, Sengupta AM, Shraiman BI (2003) A biophysical approach to transcription factor binding site discovery. Genome Res 13(11):2381–2390

    Article  Google Scholar 

  19. Gordân R, Hartemink AJ, Bulyk ML (2009) Distinguishing direct versus indirect transcription factor-DNA interactions. Genome Res 19(11):2090–2100

    Article  Google Scholar 

  20. Mukherjee S, Berger MF, Jona G, Wang XS, Muzzey D, Snyder M, Young RA, Bulyk ML (2004) Rapid analysis of the DNA-binding specificities of transcription factors with DNA microarrays. Nat Genet 36(12):1331–1339

    Article  Google Scholar 

  21. Andrews S, Tsochantaridis I, Hofmann T (2002) Support vector machines for multiple-instance learning. Adv Neural Inf Process Syst, 561–568

    Google Scholar 

  22. Auer P (1997) On learning from multi-instance examples: empirical evaluation of a theoretical approach. In: Proc. 17th international con. on machine learning, vol 97. Morgan Kaufmann, pp 21–29

    Google Scholar 

  23. Wang J, Zucker J-D(2000) Solving the multiple-instance problem: a lazy learning approach. In: Proc. 17th international con. on machine learning Morgan Kaufman, 1119–1125

    Google Scholar 

  24. Maron O, Lozano-Pérez T (1998) A framework for multiple instance learning. Adv Neural Inf Process Syst 10(10):570–576

    Google Scholar 

  25. Dietterich TG, Lathrop RH, Lozano-P´erez T (1997) Solving the multiple instance problem with axis-parallel rectangles. Artif Intell 89(1–2):31–71

    Google Scholar 

  26. Zhang Q, Goldman SA (2002) EM-DD: An improved multiple-instance learning technique. Adv Neural Inf Process Syst 14(14):1073–1080

    Google Scholar 

  27. Zhou Z-H, Zhang M-L (2003) Ensembles of multi-instance learners. Lect Notes Artif Intell 2837:492–502

    Google Scholar 

  28. Zucker J-D, Chevaleyre Y (2001) Solving multiple-instance and multiple-part learning problems with decision trees and rule sets, application to the mutagenesis problem. Lect Notes Artif Intell 2056:204–214

    MATH  Google Scholar 

  29. Xu X, Frank E (2004) Logistic regression and boosting for labeled bags of instances. In: Proc. Pacific-Asia conf. on knowledge discovery and data mining, pp 272–281

    Google Scholar 

  30. Gärtner T, Flach PA, Kowalczyk A, Smola AJ(2002) Multi-instance kernels. In: Proc. 19th Int’l conf. on machine learning, pp 179–186

    Google Scholar 

  31. Chen Y, Wang JZ (2004) Image categorization by learning and reasoning with regions. J Machine Learning Res 5:913–939

    MathSciNet  Google Scholar 

  32. Blanchette M, Tompa M (May 1 2002) Discovery of regulatory elements by a computational method for phylogenetic foot printing. Genome Res 12(5):739–48. doi:10.1101/gr.6902 PMID: 11997340

  33. Prestridge DS (Jun 23 1995) Predicting Pol II promoter sequences using transcription factor binding sites. J Mol Biol 249(5):923–32. doi:10.1006/jmbi.1995.0349 PMID: 7791218

  34. Wu S, Xie X, Liew AW, Yan H (2007) Eukaryotic promoter prediction based on relative entropy and positional information. Phys Rev E 75(4):041908

    Article  Google Scholar 

  35. Kouser K, Rangarajan L, Chandrashekar DS, Kshitish KA, Abraham EM (2015 Apr 15) Alignment free frequency based distance measures for promoter sequence comparison. In: International conference on bioinformatics and biomedical engineering, pp. 183–193. Springer International Publishing

    Google Scholar 

  36. Kouser K, Rangarajan L (2015) Promoter sequence analysis through no gap multiple sequence alignment of Motif Pairs. Procedia Comput Sci 31(58):35662

    Google Scholar 

  37. Kamath U, De Jong K, Shehu A (2014 Jul 17) Effective automated feature construction and selection for classification of biological sequences. PloS one 9(7) e99982. doi:10.1371/journal.pone.0099982 PMID: 25033270

  38. Pan F, Wang B, Hu X, Perrizo W (2004 Aug 31) Comprehensive vertical sample-based KNN/LSVM classification for gene expression analysis. J Biomed Informatics 37(4):240–48. doi: 10.1016/j.jbi.2004.07.003 PMID: 15465477

  39. Liu B, Liu F, Wang X, Chen J, Fang L, Chou K C (2015 Jul 1) Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res 43(W1):W65–71. doi: 10.1093/nar/gkv458 PMID: 25958395

  40. Liu B, Liu F, Fang L, Wang X, Chou K C (2016) repRNA: A web server for generating various feature vectors of RNA sequences. Molecular Genet Genomics 291(1):473–481. doi: 10.1007/s00438-015-1078-7 PMID: 26085220

  41. Chen Y, Bi J, Wang JZ (2006) Miles: multiple-instance learning via embedded instance selection. IEEE transactions on pattern analysis and machine intelligence, to appear

    Google Scholar 

  42. Beyer M, Laney D (2001) 3D data management: controlling data volume, velocity and variety. http://blogs.gartner.com/doug-laney/files/2012/01ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf. Accessed Aug 2013

  43. Liu B, Liu F, Fang L, Wang X, Chou K C (2015) repDNA: A Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects. Bioinformatics 31(8):1307–1309. doi:10.1093/bioinformatics/btu820 PMID: 25504848

  44. Kamal MS, Nimmy SF, Parvin S (2016) Performance evaluation comparison for detecting DNA structural break through big data analysis. Comput Syst Sci Eng 31:275–289

    Google Scholar 

  45. Kamal MS, Dey N, Nimmy SF, Ashour AS, Ripon SH, Ali NY, et al (2016). Evolutionary framework for coding area selection from cancer data. Neural Computing and Appl 1–23. doi:10.1007/s00521-016-2513-3

  46. Kamal MS, Ripon SH, Dey N, Ashour AS, Santhi V (2016) A MapReduce approach to diminish imbalance parameters for big deoxyribonucleic acid dataset. Comput Methods Programs Biomed 131:191–206. doi:10.1016/j.cmpb.2016.04.005

    Article  Google Scholar 

  47. Kamal MS, Nimmy SF (2016) StrucBreak: a computational framework for structural break detection in DNA. Interdisciplinary Sci: Computational Life Sci. 1–16. doi:10.1007/s12539-016-0158-7

  48. Gunes H, Pantic M (2010) Automatic, dimensional and continuous emotion recognition. Int. J. Synth. Emot. 1(1):68–99

    Article  Google Scholar 

  49. Ripon SH, Kamal S, Hossain S, Dey N (2016) Theoretical analysis of different classifiers under reduction rough data set: a brief proposal. Int J Rough Sets Data Anal 3(3):1–20

    Article  Google Scholar 

  50. Ahmed SS, Dey N, Ashour AS et al (2017) Effect of fuzzy partitioning in Crohn’s disease classification: a neuro-fuzzy-based approach. Med Biol Eng Comput 55:101

    Article  Google Scholar 

  51. Kamal MS, Chowdhury L, Khan MI, Ashour AS, Tavares JMRS, Dey N (2017 Apr 13) Hidden Markov model and Chapman Kolmogrov for protein structures prediction from images. Computational Biol Chem 68:231–244

    Google Scholar 

  52. Tripathy A, Rath SK (2017) Classification of sentiment of reviews using supervised machine learning techniques. Int J Rough Sets Data Anal (IJRSDA) 4(1)

    Google Scholar 

  53. Kausar N, Abdullah A, Samir BB, Palaniappan S, AlGhamdi BS, Dey N (2016) J Med Imaging and Health Informatics 6(1):78–87(10)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. N. Y. Ali .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Ali, M.N.Y., Nimmy, S.F. (2018). Two-Step Verifications for Multi-instance Features Selection: A Machine Learning Approach. In: Dey, N., Ashour, A., Borra, S. (eds) Classification in BioApps. Lecture Notes in Computational Vision and Biomechanics, vol 26. Springer, Cham. https://doi.org/10.1007/978-3-319-65981-7_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-65981-7_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-65980-0

  • Online ISBN: 978-3-319-65981-7

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics