Skip to main content

Markov Chain Monte Carlo Methods and Evolutionary Algorithms for Automatic Feature Selection from Legal Documents

  • Conference paper
  • First Online:
Intelligent Systems Technologies and Applications (ISTA 2017)

Abstract

In this paper, we present three different approaches for feature selection, starting from a naïve Markov Chain Monte Carlo random walk algorithm to more refined methods like simulated annealing and genetic algorithms. It is typical for textual data to have thousands of dimensions in their feature space which makes feature selection a crucial phase before the final classification. Classification of legal documents into eight categories was performed via a simple document similarity measure based on term frequency and the nearest neighbour concept. With an average success rate of 76.4%, the random walk algorithm not only performed better than the simulated annealing and genetic algorithms but also matched the accuracy of support vector machines. Although these methods have commonly been used for selecting appropriate features in other fields, their use in text categorisation have not been satisfactorily investigated. And, to our knowledge, this is the first work which investigates their use in the legal domain. This generic text classification framework can further be enhanced by using an active learning methodology for the selection of training samples rather than following a passive learning approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

References

  • Al-Maqaleh, B.M., Shahbazkia, H.: A genetic algorithm for discovering classification rules in data mining. Int. J. Comput. Appl. 41(18), 40–44 (2012)

    Google Scholar 

  • Andrieu, C., de Freitas, N., Doucet, A., Jordan, M.I.: An introduction to MCMC for machine learning. Mach. Learn. 50, 5–43 (2003)

    Article  MATH  Google Scholar 

  • Atkinson-Abutridy, J., Mellish, C., Aitken, S.: Combining information extraction with genetic algorithms for text mining. IEEE Intell. Syst. 19(3), 22–30 (2004)

    Article  Google Scholar 

  • Bagheri, A., Saraee, M., Nadi, S.: PSA: a hybrid feature selection approach for Persian text classification. J. Comput. Secur. 1(4), 261–272 (2014)

    Google Scholar 

  • Bermejo, P., Gamez, J.A., Puerta, J.M.: A GRASP algorithm for fast hybrid filter-wrapper feature subset selection in high-dimensional datasets. Pattern Recogn. Lett. 32(5), 701–711 (2011)

    Article  Google Scholar 

  • Borg, C.: Automatic Definition Extraction using Evolutionary Algorithms. Thesis (MSc), University of Malta, Malta (2009)

    Google Scholar 

  • Branavan, S.R.K., Silver, D., Barzilay, R.: Learning to win by reading manuals in a Monte Carlo framework. J. Artif. Intell. Res. 43, 661–704 (2012)

    MATH  Google Scholar 

  • Browne, C., Powley, E., Whitehouse, D., Lucas, S., Cowling, P.I., Rohlfshagen, P., Tavener, S., Perez, D., Samothrakis, S., Colton, S.: A survey of Monte Carlo tree search methods. IEEE Trans. Comput. Intell. AI Games 4(1), 1–43 (2012)

    Article  Google Scholar 

  • Buxey, G.M.: The vehicle scheduling problem and Monte Carlo simulation. J. Oper. Res. Soc. 30(6), 563–573 (1979)

    Article  MATH  Google Scholar 

  • Chen, H., Kim, J.: GANNET: a machine learning approach to document retrieval. J. Manag. Inf. Syst. 11(3), 7–41 (1994)

    Article  Google Scholar 

  • Chen, H., Jiang, W., Li, C., Li, R.: A heuristic feature selection approach for text categorization by using chaos optimization and genetic algorithm. Math. Problems Eng. 2013, Article ID: 524017

    Google Scholar 

  • Cunningham, M., Tablan, B.: GATE: a framework and graphical development environment for robust NLP Tools and applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL 2002), 7–12 July 2002, Philadelphia, Pennsylvania (2002)

    Google Scholar 

  • Desjardins, G., Godin, R., Proulx, R.: A genetic algorithm for text mining. WIT Trans. Inf. Commun. Technol. 35, 133–142 (2005)

    Google Scholar 

  • Diaconis, P.: The Markov chain Monte Carlo revolution. Bull. Am. Math. Soc. 46, 179–205 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  • Draminski, M., Rada-Iglesias, A., Enroth, S., Wadelius, C., Koronacki, J., Komorowski, J.: Monte Carlo feature selection for supervised classification. Bioinformatics 24(1), 110–117 (2008)

    Article  Google Scholar 

  • Ebbert, M.T.W., Bastien, R.R.L., Boucher, K.M., Martin, M., Carrasco, E., Caballero, R., Stijleman, I.J., Bernard, P.S., Facelli, J.C.: Characterization of uncertainty in the classification of multivariate assays: application to PAM50 centroid-based genomic predictors for breast cancer treatment plans. J. Clin. Bioinform. 1, 37 (2011)

    Article  Google Scholar 

  • Esbensen, H., Mazumder, P.: SAGA: a unification of the genetic algorithm with simulated annealing and its application to macro-cell placement. In: Proceedings of the 7th International Conference on VLSI Design, Calcutta, India, 5–8 January 1994, pp. 211–214 (1994)

    Google Scholar 

  • Figueroa, R.L., Zeng-Treitler, Q., Ngo, L.H., Goryachev, S., Wiechmann, E.P.: Active learning for clinical text classification: is it better than random sampling? J. Am. Med. Inform. Assoc. 19(5), 809–816 (2012)

    Article  Google Scholar 

  • Gavrilis, D., Tsoulos, I.G., Dermatas, E.: Stochastic classification of scientific abstracts. In: Proceedings of the 6th Speech and Computer Conference, Patras, Greece (2005)

    Google Scholar 

  • Gavrilis, D., Tsoulos, I.G., Dermatas, E.: Neural recognition and genetic features selection for robust detection of E-mail spam. Adv. Artif. Intell. 3955, 498–501 (2006)

    Google Scholar 

  • Goncharov, Y., Okten, G., Shah, M.: Computation of the endogenous mortgage rates with randomized quasi-Monte Carlo simulations. Math. Comput. Model. 46(3–4), 459–481 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  • Gordon, M.: Probabilistic and genetic algorithms for document retrieval. Commun. ACM 31(10), 1208–1218 (1988)

    Article  Google Scholar 

  • Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)

    Article  Google Scholar 

  • Hassan, S., Mihalcea, R., Banea, C.: Random walk term weighting for improved text classification. Int. J. Semant. Comput. 1(4), 421–439 (2007)

    Article  Google Scholar 

  • Hastings, W.K.: Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57(1), 97–109 (1970)

    Article  MathSciNet  MATH  Google Scholar 

  • Holland, J.H.: Adaptation in Natural and Artificial Systems. University of Michigan Press, Michigan (1975)

    Google Scholar 

  • Houghton, J., Siegel, M., Wirsch, A., Moulton, A., Madnick, S., Goldsmith, D.: A survey of methods for data inclusion in system dynamics models: methods, tools and applications. Massachusetts Institute of Technology, Cambridge, Working Paper CISL# 2013-03 (2014)

    Google Scholar 

  • Jovic, A., Brkic, K., Bogunovic, N.: A review of feature selection methods with applications. In: Proceedings of the 38th IEEE International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO 2015), Opatija, Croatia, 25–29 May 2015, pp. 1200–1205 (2015)

    Google Scholar 

  • Khalessizadeh, S.M., Zaefarian, R., Nasseri, S.H., Ardil, E.: Genetic mining: using genetic algorithm for topic based on concept distribution. In: Proceedings of the World Academy of Science, Engineering and Technology (2006)

    Google Scholar 

  • Khan, A., Baharudin, B., Lee, L., Khan, K.: A review of machine learning algorithms for text documents classification. J. Adv. Inf. Technol. 1(1), 4–20 (2010)

    Google Scholar 

  • Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by simulated annealing. Science 220(4598), 671–680 (1983)

    Article  MathSciNet  MATH  Google Scholar 

  • Liang, F., Wong, W.H.: Evolutionary Monte Carlo: applications to Cp model sampling and change point problem. Stat. Sin. 10, 317–342 (2000)

    MATH  Google Scholar 

  • Liu, X., Fu, H.: A hybrid algorithm for text classification problem. Electrical review, R. 88 NR 1b (2012)

    Google Scholar 

  • Martin, O., Otto, S.W., Felten, E.W.: Large-step Markov chains for the travelling salesman problem, p. 16. CSETech, Paper (1991)

    MATH  Google Scholar 

  • Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equation of state calculations by fast computing machines. J. Chem. Phys. 21(6), 1087–1092 (1953)

    Article  Google Scholar 

  • Mitchell, M.: An Introduction to Genetic Algorithms. MIT Press, Cambridge (1998)

    MATH  Google Scholar 

  • Moncao, A.C.L., Camilo-JR, C.G., Queiroz, L.T., Rodrigues, C.L., Leitao-JR, P.S., Vincenzi, A.M.R.: Applying genetic algorithms to data selection for SQL mutation analysis. In: Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation (GECCO 2013), Amsterdam, The Netherlands, 7–10 July 2013, pp. 207–208 (2013)

    Google Scholar 

  • Moshki, M., Kabiri, P., Mohebalhojeh, A.: Scalable feature selection in high-dimensional data based on GRASP. Appl. Artif. Intell. 29, 283–296 (2015)

    Article  Google Scholar 

  • Pavlyshenko, B.: Genetic optimization of keywords subset in the classification analysis of texts authorship. J. Quant. Linguist. 21(4), 341–349 (2014)

    Article  Google Scholar 

  • Pemantle, R.: A survey of random processes with reinforcement *. Prob. Surv. 4, 1–79 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  • Pietramala, A., Policicchio, V.L., Rullo, P., Sidhu, I.: A genetic algorithm for text classification rule induction. Lect. Notes Comput. Sci. 5212, 188–203 (2008)

    Article  Google Scholar 

  • Pudaruth, S., Soyjaudah, K.M.S., Gunputh, R.P.: Categorisation of supreme court cases using multiple horizontal thesauri. Intell. Syst. Technol. Appl. 2, 355–368 (2016)

    Google Scholar 

  • Read, J., Martino, L., Luengo, D.: Efficient Monte Carlo methods for multi-dimensional learning with classifier chains. Pattern Recogn. 47, 1535–1546 (2014)

    Article  MATH  Google Scholar 

  • Rogers, B.C.: Using genetic algorithms for feature set selection in text mining. Thesis (MSc), Miami University, Oxford, Ohio (2013)

    Google Scholar 

  • Roy, N., Mccallum, A.: Toward optimal active learning through sampling estimation of error reduction. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 441–448 (2001)

    Google Scholar 

  • Sahin, I.E.: Online text categorization using genetic algorithms. Bilkent University, Turkey, Technical report, BU-CE-0704 (2007)

    Google Scholar 

  • Samad, S.A.: Random walk oversampling technique for minority class classification. Thesis (MSc), Tampere University of Technology (2013)

    Google Scholar 

  • Smith, R., Hussain, M.S.: Genetic algorithm sequential Monte Carlo methods for stochastic volatility and parameter estimation. In: Proceedings of the World Congress on Engineering (WCE 2012), London, UK, 4–6 July 2012, vol. 1 (2012)

    Google Scholar 

  • Song, W., Park, S.C.: Genetic algorithm for text clustering based on latent semantic indexing. Comput. Math Appl. 57, 1901–1907 (2009)

    Article  MATH  Google Scholar 

  • ter Braak, C.J.F.: A Markov Chain Monte Carlo version of the genetic algorithm differential evolution: easy Bayesian computing for real parameter spaces. Stat. Comput. 16(3), 239–249 (2006)

    Article  MathSciNet  Google Scholar 

  • Thomas, J.D., Sycara, K.: Integrating genetic algorithms and text learning for financial prediction. In: Proceedings of the Genetic and Evolutionary Computing Conference (GECCO), Las Vegas, Nevada, pp. 72–75

    Google Scholar 

  • Waad, B., Mufti, G.B, Liman, M.: A new feature selection technique applied to credit scoring data using a ranked aggregation approach based on: optimisation, genetic algorithm and similarity. In: Osei-Bryson, K., Barclay, C. (eds.) Knowledge Discovery Process And Methods To Enhance Organisational Performance, pp. 347–376. CRC Press, ‎Boca Raton (2014)

    Google Scholar 

  • Wang, R., Youssef, A.M., Elhakeem, A.K.: On some feature selection strategies for spam filter design. In: Proceedings of the IEEE Canadian Conference on Electrical and Computer Engineering (CCECE 2006), Ottawa, Canada, 7–10 May 2006, pp. 2155–2158 (2006)

    Google Scholar 

  • Winands, M.H.M., Bjornsson, Y., Saito, J.T.: Monte Carlo tree search solver. In: Proceedings of the 6th International Conference on Computers and Games, pp. 25–36 (2008)

    Google Scholar 

  • WordNet: a lexical database for English. Princeton University (2017). https://wordnet.princeton.edu/wordnet/. Accessed 31 Jan 2017

  • Wu, J., Zheng, C., Chien, C.C., Zheng, L.: A comparative study of Monte Carlo simple genetic algorithm and noisy genetic algorithm for cost-effective sampling network design under uncertainty. Adv. Water Resour. 29, 899–911 (2006)

    Article  Google Scholar 

  • Xiao, X.: Advanced Monte Carlo techniques: an approach for foreign exchange derivative pricing. Thesis (PhD), University of Manchester, UK (2007)

    Google Scholar 

  • Yang, C., Li, Y., Zhang, C., Hu, Y.: A fast KNN algorithm based on simulated annealing. In: Proceedings of the International Conference on Data Mining, Las Vegas, Nevada, 25–28 June 2007, pp. 46–51 (2007)

    Google Scholar 

  • Zhong, M., Shen, K., Seiferas, J.: The convergence-guaranteed random walk and its application in peer-to-peer networks. IEEE Trans. Comput. 57(5), 619–633 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  • Zhou, Y.: A random-walk based privacy-preserving access control for online social networks. Int. J. Adv. Comput. Sci. Appl. 7(2), 74–79 (2016)

    Google Scholar 

  • Zhu, F., Li, H., Yao, N., Zhu, H.: Text feature selection applied by improved SAA*. J. Comput. Inf. Syst. 11(17), 6419–6427 (2015)

    Google Scholar 

  • Zhu, H., Chen S., Pu, C., Liu, Y., Eguchi, K., Zhang, S.: Paralleling genetic annealing algorithm with OpenMP. In: Proceedings of the 2nd IEEE International Conference on Intelligent Networks and Intelligent Systems (ICINIS 2009), Tianjin, China, 1–3 November 2009

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to S. Pudaruth .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this paper

Cite this paper

Pudaruth, S., Soyjaudah, K.M.S., Gunputh, R.P. (2018). Markov Chain Monte Carlo Methods and Evolutionary Algorithms for Automatic Feature Selection from Legal Documents. In: Thampi, S., Mitra, S., Mukhopadhyay, J., Li, KC., James, A., Berretti, S. (eds) Intelligent Systems Technologies and Applications. ISTA 2017. Advances in Intelligent Systems and Computing, vol 683. Springer, Cham. https://doi.org/10.1007/978-3-319-68385-0_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-68385-0_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-68384-3

  • Online ISBN: 978-3-319-68385-0

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics