Abstract
In this paper, we present three different approaches for feature selection, starting from a naïve Markov Chain Monte Carlo random walk algorithm to more refined methods like simulated annealing and genetic algorithms. It is typical for textual data to have thousands of dimensions in their feature space which makes feature selection a crucial phase before the final classification. Classification of legal documents into eight categories was performed via a simple document similarity measure based on term frequency and the nearest neighbour concept. With an average success rate of 76.4%, the random walk algorithm not only performed better than the simulated annealing and genetic algorithms but also matched the accuracy of support vector machines. Although these methods have commonly been used for selecting appropriate features in other fields, their use in text categorisation have not been satisfactorily investigated. And, to our knowledge, this is the first work which investigates their use in the legal domain. This generic text classification framework can further be enhanced by using an active learning methodology for the selection of training samples rather than following a passive learning approach.
References
Al-Maqaleh, B.M., Shahbazkia, H.: A genetic algorithm for discovering classification rules in data mining. Int. J. Comput. Appl. 41(18), 40–44 (2012)
Andrieu, C., de Freitas, N., Doucet, A., Jordan, M.I.: An introduction to MCMC for machine learning. Mach. Learn. 50, 5–43 (2003)
Atkinson-Abutridy, J., Mellish, C., Aitken, S.: Combining information extraction with genetic algorithms for text mining. IEEE Intell. Syst. 19(3), 22–30 (2004)
Bagheri, A., Saraee, M., Nadi, S.: PSA: a hybrid feature selection approach for Persian text classification. J. Comput. Secur. 1(4), 261–272 (2014)
Bermejo, P., Gamez, J.A., Puerta, J.M.: A GRASP algorithm for fast hybrid filter-wrapper feature subset selection in high-dimensional datasets. Pattern Recogn. Lett. 32(5), 701–711 (2011)
Borg, C.: Automatic Definition Extraction using Evolutionary Algorithms. Thesis (MSc), University of Malta, Malta (2009)
Branavan, S.R.K., Silver, D., Barzilay, R.: Learning to win by reading manuals in a Monte Carlo framework. J. Artif. Intell. Res. 43, 661–704 (2012)
Browne, C., Powley, E., Whitehouse, D., Lucas, S., Cowling, P.I., Rohlfshagen, P., Tavener, S., Perez, D., Samothrakis, S., Colton, S.: A survey of Monte Carlo tree search methods. IEEE Trans. Comput. Intell. AI Games 4(1), 1–43 (2012)
Buxey, G.M.: The vehicle scheduling problem and Monte Carlo simulation. J. Oper. Res. Soc. 30(6), 563–573 (1979)
Chen, H., Kim, J.: GANNET: a machine learning approach to document retrieval. J. Manag. Inf. Syst. 11(3), 7–41 (1994)
Chen, H., Jiang, W., Li, C., Li, R.: A heuristic feature selection approach for text categorization by using chaos optimization and genetic algorithm. Math. Problems Eng. 2013, Article ID: 524017
Cunningham, M., Tablan, B.: GATE: a framework and graphical development environment for robust NLP Tools and applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL 2002), 7–12 July 2002, Philadelphia, Pennsylvania (2002)
Desjardins, G., Godin, R., Proulx, R.: A genetic algorithm for text mining. WIT Trans. Inf. Commun. Technol. 35, 133–142 (2005)
Diaconis, P.: The Markov chain Monte Carlo revolution. Bull. Am. Math. Soc. 46, 179–205 (2009)
Draminski, M., Rada-Iglesias, A., Enroth, S., Wadelius, C., Koronacki, J., Komorowski, J.: Monte Carlo feature selection for supervised classification. Bioinformatics 24(1), 110–117 (2008)
Ebbert, M.T.W., Bastien, R.R.L., Boucher, K.M., Martin, M., Carrasco, E., Caballero, R., Stijleman, I.J., Bernard, P.S., Facelli, J.C.: Characterization of uncertainty in the classification of multivariate assays: application to PAM50 centroid-based genomic predictors for breast cancer treatment plans. J. Clin. Bioinform. 1, 37 (2011)
Esbensen, H., Mazumder, P.: SAGA: a unification of the genetic algorithm with simulated annealing and its application to macro-cell placement. In: Proceedings of the 7th International Conference on VLSI Design, Calcutta, India, 5–8 January 1994, pp. 211–214 (1994)
Figueroa, R.L., Zeng-Treitler, Q., Ngo, L.H., Goryachev, S., Wiechmann, E.P.: Active learning for clinical text classification: is it better than random sampling? J. Am. Med. Inform. Assoc. 19(5), 809–816 (2012)
Gavrilis, D., Tsoulos, I.G., Dermatas, E.: Stochastic classification of scientific abstracts. In: Proceedings of the 6th Speech and Computer Conference, Patras, Greece (2005)
Gavrilis, D., Tsoulos, I.G., Dermatas, E.: Neural recognition and genetic features selection for robust detection of E-mail spam. Adv. Artif. Intell. 3955, 498–501 (2006)
Goncharov, Y., Okten, G., Shah, M.: Computation of the endogenous mortgage rates with randomized quasi-Monte Carlo simulations. Math. Comput. Model. 46(3–4), 459–481 (2007)
Gordon, M.: Probabilistic and genetic algorithms for document retrieval. Commun. ACM 31(10), 1208–1218 (1988)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)
Hassan, S., Mihalcea, R., Banea, C.: Random walk term weighting for improved text classification. Int. J. Semant. Comput. 1(4), 421–439 (2007)
Hastings, W.K.: Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57(1), 97–109 (1970)
Holland, J.H.: Adaptation in Natural and Artificial Systems. University of Michigan Press, Michigan (1975)
Houghton, J., Siegel, M., Wirsch, A., Moulton, A., Madnick, S., Goldsmith, D.: A survey of methods for data inclusion in system dynamics models: methods, tools and applications. Massachusetts Institute of Technology, Cambridge, Working Paper CISL# 2013-03 (2014)
Jovic, A., Brkic, K., Bogunovic, N.: A review of feature selection methods with applications. In: Proceedings of the 38th IEEE International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO 2015), Opatija, Croatia, 25–29 May 2015, pp. 1200–1205 (2015)
Khalessizadeh, S.M., Zaefarian, R., Nasseri, S.H., Ardil, E.: Genetic mining: using genetic algorithm for topic based on concept distribution. In: Proceedings of the World Academy of Science, Engineering and Technology (2006)
Khan, A., Baharudin, B., Lee, L., Khan, K.: A review of machine learning algorithms for text documents classification. J. Adv. Inf. Technol. 1(1), 4–20 (2010)
Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by simulated annealing. Science 220(4598), 671–680 (1983)
Liang, F., Wong, W.H.: Evolutionary Monte Carlo: applications to Cp model sampling and change point problem. Stat. Sin. 10, 317–342 (2000)
Liu, X., Fu, H.: A hybrid algorithm for text classification problem. Electrical review, R. 88 NR 1b (2012)
Martin, O., Otto, S.W., Felten, E.W.: Large-step Markov chains for the travelling salesman problem, p. 16. CSETech, Paper (1991)
Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equation of state calculations by fast computing machines. J. Chem. Phys. 21(6), 1087–1092 (1953)
Mitchell, M.: An Introduction to Genetic Algorithms. MIT Press, Cambridge (1998)
Moncao, A.C.L., Camilo-JR, C.G., Queiroz, L.T., Rodrigues, C.L., Leitao-JR, P.S., Vincenzi, A.M.R.: Applying genetic algorithms to data selection for SQL mutation analysis. In: Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation (GECCO 2013), Amsterdam, The Netherlands, 7–10 July 2013, pp. 207–208 (2013)
Moshki, M., Kabiri, P., Mohebalhojeh, A.: Scalable feature selection in high-dimensional data based on GRASP. Appl. Artif. Intell. 29, 283–296 (2015)
Pavlyshenko, B.: Genetic optimization of keywords subset in the classification analysis of texts authorship. J. Quant. Linguist. 21(4), 341–349 (2014)
Pemantle, R.: A survey of random processes with reinforcement *. Prob. Surv. 4, 1–79 (2007)
Pietramala, A., Policicchio, V.L., Rullo, P., Sidhu, I.: A genetic algorithm for text classification rule induction. Lect. Notes Comput. Sci. 5212, 188–203 (2008)
Pudaruth, S., Soyjaudah, K.M.S., Gunputh, R.P.: Categorisation of supreme court cases using multiple horizontal thesauri. Intell. Syst. Technol. Appl. 2, 355–368 (2016)
Read, J., Martino, L., Luengo, D.: Efficient Monte Carlo methods for multi-dimensional learning with classifier chains. Pattern Recogn. 47, 1535–1546 (2014)
Rogers, B.C.: Using genetic algorithms for feature set selection in text mining. Thesis (MSc), Miami University, Oxford, Ohio (2013)
Roy, N., Mccallum, A.: Toward optimal active learning through sampling estimation of error reduction. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 441–448 (2001)
Sahin, I.E.: Online text categorization using genetic algorithms. Bilkent University, Turkey, Technical report, BU-CE-0704 (2007)
Samad, S.A.: Random walk oversampling technique for minority class classification. Thesis (MSc), Tampere University of Technology (2013)
Smith, R., Hussain, M.S.: Genetic algorithm sequential Monte Carlo methods for stochastic volatility and parameter estimation. In: Proceedings of the World Congress on Engineering (WCE 2012), London, UK, 4–6 July 2012, vol. 1 (2012)
Song, W., Park, S.C.: Genetic algorithm for text clustering based on latent semantic indexing. Comput. Math Appl. 57, 1901–1907 (2009)
ter Braak, C.J.F.: A Markov Chain Monte Carlo version of the genetic algorithm differential evolution: easy Bayesian computing for real parameter spaces. Stat. Comput. 16(3), 239–249 (2006)
Thomas, J.D., Sycara, K.: Integrating genetic algorithms and text learning for financial prediction. In: Proceedings of the Genetic and Evolutionary Computing Conference (GECCO), Las Vegas, Nevada, pp. 72–75
Waad, B., Mufti, G.B, Liman, M.: A new feature selection technique applied to credit scoring data using a ranked aggregation approach based on: optimisation, genetic algorithm and similarity. In: Osei-Bryson, K., Barclay, C. (eds.) Knowledge Discovery Process And Methods To Enhance Organisational Performance, pp. 347–376. CRC Press, Boca Raton (2014)
Wang, R., Youssef, A.M., Elhakeem, A.K.: On some feature selection strategies for spam filter design. In: Proceedings of the IEEE Canadian Conference on Electrical and Computer Engineering (CCECE 2006), Ottawa, Canada, 7–10 May 2006, pp. 2155–2158 (2006)
Winands, M.H.M., Bjornsson, Y., Saito, J.T.: Monte Carlo tree search solver. In: Proceedings of the 6th International Conference on Computers and Games, pp. 25–36 (2008)
WordNet: a lexical database for English. Princeton University (2017). https://wordnet.princeton.edu/wordnet/. Accessed 31 Jan 2017
Wu, J., Zheng, C., Chien, C.C., Zheng, L.: A comparative study of Monte Carlo simple genetic algorithm and noisy genetic algorithm for cost-effective sampling network design under uncertainty. Adv. Water Resour. 29, 899–911 (2006)
Xiao, X.: Advanced Monte Carlo techniques: an approach for foreign exchange derivative pricing. Thesis (PhD), University of Manchester, UK (2007)
Yang, C., Li, Y., Zhang, C., Hu, Y.: A fast KNN algorithm based on simulated annealing. In: Proceedings of the International Conference on Data Mining, Las Vegas, Nevada, 25–28 June 2007, pp. 46–51 (2007)
Zhong, M., Shen, K., Seiferas, J.: The convergence-guaranteed random walk and its application in peer-to-peer networks. IEEE Trans. Comput. 57(5), 619–633 (2008)
Zhou, Y.: A random-walk based privacy-preserving access control for online social networks. Int. J. Adv. Comput. Sci. Appl. 7(2), 74–79 (2016)
Zhu, F., Li, H., Yao, N., Zhu, H.: Text feature selection applied by improved SAA*. J. Comput. Inf. Syst. 11(17), 6419–6427 (2015)
Zhu, H., Chen S., Pu, C., Liu, Y., Eguchi, K., Zhang, S.: Paralleling genetic annealing algorithm with OpenMP. In: Proceedings of the 2nd IEEE International Conference on Intelligent Networks and Intelligent Systems (ICINIS 2009), Tianjin, China, 1–3 November 2009
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Pudaruth, S., Soyjaudah, K.M.S., Gunputh, R.P. (2018). Markov Chain Monte Carlo Methods and Evolutionary Algorithms for Automatic Feature Selection from Legal Documents. In: Thampi, S., Mitra, S., Mukhopadhyay, J., Li, KC., James, A., Berretti, S. (eds) Intelligent Systems Technologies and Applications. ISTA 2017. Advances in Intelligent Systems and Computing, vol 683. Springer, Cham. https://doi.org/10.1007/978-3-319-68385-0_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-68385-0_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-68384-3
Online ISBN: 978-3-319-68385-0
eBook Packages: EngineeringEngineering (R0)