Two-Step Verifications for Multi-instance Features Selection: A Machine Learning Approach

Ali, M. N. Y.; Nimmy, S. F.

doi:10.1007/978-3-319-65981-7_7

M. N. Y. Ali⁶ &
S. F. Nimmy⁷

Part of the book series: Lecture Notes in Computational Vision and Biomechanics ((LNCVB,volume 26))

1969 Accesses

Abstract

Multi-instance features measurement is an important step in identifying characteristics that are bound to various experimental events. In biological data processing, a set of critical factors is responsible for several diseases. Computational simulation will help to design an optimal tool for cost-effective drug design. In this regard, the processing of big data is valuable for efficient simulation. Recent experimental results generate huge amounts of related data. In the current work, noisy data have been treated with three filtering techniques: cross-validated committees filtering (CVCF), iterative partitioning filtering (IPF) and ensemble filtering (EF). A comparison was made of these three filtering approaches. The filtered datasets were normalized. The repeated application of three normalization techniques removed the irregularities and structured the datasets. Wide ranges of comparisons were made among these three normalization techniques. After being appropriately structured, these normalized datasets were transformed accordingly with three different transformation processes: rank transformation, nominal to binary transformation and Box-Cox transformation. To prevent false positive and false negative outcomes of the experiments, certain key aspects were considered: accuracy, sensitivity and F-measures. Accuracy of the experiments relates to the level of precise detection of certain factors; specificity allows the selection of the dominant factors; and sensitivity and F-measures are the ratio between the training and testing datasets. Detailed experimental analysis included a comparison study of the four classifiers for the deoxyribonucleic acid (DNA) dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Gupta R, Gupta H, Mohania M (2012) Cloud computing and big data analytics: what is new from databases perspective? In: Proceedings of the 1st international conference on big data analytics (BDA 2012), vol. 7678 of Lecture Notes on Computer Science, pp 42–61
Google Scholar
Minelli M, Chambers M, Dhiraj A (2013) Big data, big analytics: emerging business intelligence and analytic trends for today’s businesses. Wiley, USA
Google Scholar
López V, del Río S, Benítez J, Herrera F (2014) Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data, Fuzzy Sets Syst. http://dx.doi.org/10.1016/j.fss.2014.01.01
Batista GEAPA, Prati RC, Monard MC (2004) A study of the behaviour of several methods for balancing machine learning training data. SIGKDD Explor 6 (1):20–29
Google Scholar
Batuwita R, Palade V (2012) Adjusted geometric-mean: a novel performance measure for imbalanced bioinformatics datasets learning. J Bioinform Comput Biol 10(4)
Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article MATH Google Scholar
Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees, Wadsworth and Brooks
Google Scholar
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-SMOTE: safe-level-synthetic minority over-sampling TEchnique for handling the class imbalanced problem. In: Proceedings of the 13th Pacific Asia conference on advances in knowledge discovery and data mining PAKDD’09, pp 475–482
Google Scholar
Berg OG, von Hippel PH (1987) Selection of DNA binding sites by regulatory proteins: Statistical-mechanical theory and application to operators and promoters. J Mol Biol 193(4):723–743
Article Google Scholar
Stormo GD (1990) Consensus patterns in DNA. Methods Enzymol 183:211–221
Article Google Scholar
Zhao X, Huang H, Speed TP (2005) Finding short DNA motifs using permuted markov models. J Comput Biol 12(6):894–906
Article Google Scholar
Badis G, Berger MF, Philippakis AA, Talukder S, Gehrke AR, Jaeger SA, Chan ET, Metzler G, Vedenko A, Chen X et al (2009) Diversity and complexity in DNA recognition by transcription factors. Science 324(5935):1720–1723
Article Google Scholar
Nutiu R, Friedman RC, Luo S, Khrebtukova I, Silva D, Li R, Zhang L, Schroth GP, Burge CB (2011) Direct measurement of DNA affinity landscapes ona high-throughput sequencing instrument. Nat Biotechnol 29(7):659–664
Article Google Scholar
Maerkl SJ, Quake SR (2007) A systems approach to measuring the binding energy landscapes of transcription factors. Science 315(5809):233–237
Article Google Scholar
Gao Z, Zhao R, Ruan J (2013) A genome-wide cis-regulatory element discovery method based on promoter sequences and gene co-expression networks. BMC Genom 14(Suppl 1):4
Google Scholar
Bauer AL, Hlavacek WS, Unkefer PJ, Mu F (2010) Using sequence-specific chemical and structural properties of DNA to predict transcription factor binding sites. PLoS Comput Biol 6(11):1001007
Article MathSciNet Google Scholar
Chen QK, Hertz GZ, Stormo GD (1995) Matrix search 1.0: a computer program that scans DNA sequences for transcriptional elements using a database of weight matrices. Computer applications in the biosciences. CABIOS 11(5):563–566
Google Scholar
Djordjevic M, Sengupta AM, Shraiman BI (2003) A biophysical approach to transcription factor binding site discovery. Genome Res 13(11):2381–2390
Article Google Scholar
Gordân R, Hartemink AJ, Bulyk ML (2009) Distinguishing direct versus indirect transcription factor-DNA interactions. Genome Res 19(11):2090–2100
Article Google Scholar
Mukherjee S, Berger MF, Jona G, Wang XS, Muzzey D, Snyder M, Young RA, Bulyk ML (2004) Rapid analysis of the DNA-binding specificities of transcription factors with DNA microarrays. Nat Genet 36(12):1331–1339
Article Google Scholar
Andrews S, Tsochantaridis I, Hofmann T (2002) Support vector machines for multiple-instance learning. Adv Neural Inf Process Syst, 561–568
Google Scholar
Auer P (1997) On learning from multi-instance examples: empirical evaluation of a theoretical approach. In: Proc. 17th international con. on machine learning, vol 97. Morgan Kaufmann, pp 21–29
Google Scholar
Wang J, Zucker J-D(2000) Solving the multiple-instance problem: a lazy learning approach. In: Proc. 17th international con. on machine learning Morgan Kaufman, 1119–1125
Google Scholar
Maron O, Lozano-Pérez T (1998) A framework for multiple instance learning. Adv Neural Inf Process Syst 10(10):570–576
Google Scholar
Dietterich TG, Lathrop RH, Lozano-P´erez T (1997) Solving the multiple instance problem with axis-parallel rectangles. Artif Intell 89(1–2):31–71
Google Scholar
Zhang Q, Goldman SA (2002) EM-DD: An improved multiple-instance learning technique. Adv Neural Inf Process Syst 14(14):1073–1080
Google Scholar
Zhou Z-H, Zhang M-L (2003) Ensembles of multi-instance learners. Lect Notes Artif Intell 2837:492–502
Google Scholar
Zucker J-D, Chevaleyre Y (2001) Solving multiple-instance and multiple-part learning problems with decision trees and rule sets, application to the mutagenesis problem. Lect Notes Artif Intell 2056:204–214
MATH Google Scholar
Xu X, Frank E (2004) Logistic regression and boosting for labeled bags of instances. In: Proc. Pacific-Asia conf. on knowledge discovery and data mining, pp 272–281
Google Scholar
Gärtner T, Flach PA, Kowalczyk A, Smola AJ(2002) Multi-instance kernels. In: Proc. 19th Int’l conf. on machine learning, pp 179–186
Google Scholar
Chen Y, Wang JZ (2004) Image categorization by learning and reasoning with regions. J Machine Learning Res 5:913–939
MathSciNet Google Scholar
Blanchette M, Tompa M (May 1 2002) Discovery of regulatory elements by a computational method for phylogenetic foot printing. Genome Res 12(5):739–48. doi:10.1101/gr.6902 PMID: 11997340
Prestridge DS (Jun 23 1995) Predicting Pol II promoter sequences using transcription factor binding sites. J Mol Biol 249(5):923–32. doi:10.1006/jmbi.1995.0349 PMID: 7791218
Wu S, Xie X, Liew AW, Yan H (2007) Eukaryotic promoter prediction based on relative entropy and positional information. Phys Rev E 75(4):041908
Article Google Scholar
Kouser K, Rangarajan L, Chandrashekar DS, Kshitish KA, Abraham EM (2015 Apr 15) Alignment free frequency based distance measures for promoter sequence comparison. In: International conference on bioinformatics and biomedical engineering, pp. 183–193. Springer International Publishing
Google Scholar
Kouser K, Rangarajan L (2015) Promoter sequence analysis through no gap multiple sequence alignment of Motif Pairs. Procedia Comput Sci 31(58):35662
Google Scholar
Kamath U, De Jong K, Shehu A (2014 Jul 17) Effective automated feature construction and selection for classification of biological sequences. PloS one 9(7) e99982. doi:10.1371/journal.pone.0099982 PMID: 25033270
Pan F, Wang B, Hu X, Perrizo W (2004 Aug 31) Comprehensive vertical sample-based KNN/LSVM classification for gene expression analysis. J Biomed Informatics 37(4):240–48. doi: 10.1016/j.jbi.2004.07.003 PMID: 15465477
Liu B, Liu F, Wang X, Chen J, Fang L, Chou K C (2015 Jul 1) Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res 43(W1):W65–71. doi: 10.1093/nar/gkv458 PMID: 25958395
Liu B, Liu F, Fang L, Wang X, Chou K C (2016) repRNA: A web server for generating various feature vectors of RNA sequences. Molecular Genet Genomics 291(1):473–481. doi: 10.1007/s00438-015-1078-7 PMID: 26085220
Chen Y, Bi J, Wang JZ (2006) Miles: multiple-instance learning via embedded instance selection. IEEE transactions on pattern analysis and machine intelligence, to appear
Google Scholar
Beyer M, Laney D (2001) 3D data management: controlling data volume, velocity and variety. http://blogs.gartner.com/doug-laney/files/2012/01ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf. Accessed Aug 2013
Liu B, Liu F, Fang L, Wang X, Chou K C (2015) repDNA: A Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects. Bioinformatics 31(8):1307–1309. doi:10.1093/bioinformatics/btu820 PMID: 25504848
Kamal MS, Nimmy SF, Parvin S (2016) Performance evaluation comparison for detecting DNA structural break through big data analysis. Comput Syst Sci Eng 31:275–289
Google Scholar
Kamal MS, Dey N, Nimmy SF, Ashour AS, Ripon SH, Ali NY, et al (2016). Evolutionary framework for coding area selection from cancer data. Neural Computing and Appl 1–23. doi:10.1007/s00521-016-2513-3
Kamal MS, Ripon SH, Dey N, Ashour AS, Santhi V (2016) A MapReduce approach to diminish imbalance parameters for big deoxyribonucleic acid dataset. Comput Methods Programs Biomed 131:191–206. doi:10.1016/j.cmpb.2016.04.005
Article Google Scholar
Kamal MS, Nimmy SF (2016) StrucBreak: a computational framework for structural break detection in DNA. Interdisciplinary Sci: Computational Life Sci. 1–16. doi:10.1007/s12539-016-0158-7
Gunes H, Pantic M (2010) Automatic, dimensional and continuous emotion recognition. Int. J. Synth. Emot. 1(1):68–99
Article Google Scholar
Ripon SH, Kamal S, Hossain S, Dey N (2016) Theoretical analysis of different classifiers under reduction rough data set: a brief proposal. Int J Rough Sets Data Anal 3(3):1–20
Article Google Scholar
Ahmed SS, Dey N, Ashour AS et al (2017) Effect of fuzzy partitioning in Crohn’s disease classification: a neuro-fuzzy-based approach. Med Biol Eng Comput 55:101
Article Google Scholar
Kamal MS, Chowdhury L, Khan MI, Ashour AS, Tavares JMRS, Dey N (2017 Apr 13) Hidden Markov model and Chapman Kolmogrov for protein structures prediction from images. Computational Biol Chem 68:231–244
Google Scholar
Tripathy A, Rath SK (2017) Classification of sentiment of reviews using supervised machine learning techniques. Int J Rough Sets Data Anal (IJRSDA) 4(1)
Google Scholar
Kausar N, Abdullah A, Samir BB, Palaniappan S, AlGhamdi BS, Dey N (2016) J Med Imaging and Health Informatics 6(1):78–87(10)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, East West University, Dhaka, Bangladesh
M. N. Y. Ali
Department of Computer Science and Engineering, Notre Dame University, Dhaka, Bangladesh
S. F. Nimmy

Authors

M. N. Y. Ali
View author publications
You can also search for this author in PubMed Google Scholar
S. F. Nimmy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to M. N. Y. Ali .

Editor information

Editors and Affiliations

Department of Information Technology, Techno India College of Technology, Kolkata, India
Nilanjan Dey
Faculty of Engineering, Tanta University, Tanta, Egypt
Amira S. Ashour
K.S. Institute of Technology, Bangalore, Karnataka, India
Surekha Borra

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ali, M.N.Y., Nimmy, S.F. (2018). Two-Step Verifications for Multi-instance Features Selection: A Machine Learning Approach. In: Dey, N., Ashour, A., Borra, S. (eds) Classification in BioApps. Lecture Notes in Computational Vision and Biomechanics, vol 26. Springer, Cham. https://doi.org/10.1007/978-3-319-65981-7_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-65981-7_7
Published: 14 November 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-65980-0
Online ISBN: 978-3-319-65981-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics