Markov Chain Monte Carlo Methods and Evolutionary Algorithms for Automatic Feature Selection from Legal Documents

Pudaruth, S.; Soyjaudah, K. M. S.; Gunputh, R. P.

doi:10.1007/978-3-319-68385-0_12

S. Pudaruth²⁰,
K. M. S. Soyjaudah²¹ &
R. P. Gunputh²²

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 683))

Included in the following conference series:

The International Symposium on Intelligent Systems Technologies and Applications

989 Accesses

Abstract

In this paper, we present three different approaches for feature selection, starting from a naïve Markov Chain Monte Carlo random walk algorithm to more refined methods like simulated annealing and genetic algorithms. It is typical for textual data to have thousands of dimensions in their feature space which makes feature selection a crucial phase before the final classification. Classification of legal documents into eight categories was performed via a simple document similarity measure based on term frequency and the nearest neighbour concept. With an average success rate of 76.4%, the random walk algorithm not only performed better than the simulated annealing and genetic algorithms but also matched the accuracy of support vector machines. Although these methods have commonly been used for selecting appropriate features in other fields, their use in text categorisation have not been satisfactorily investigated. And, to our knowledge, this is the first work which investigates their use in the legal domain. This generic text classification framework can further be enhanced by using an active learning methodology for the selection of training samples rather than following a passive learning approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

References

Al-Maqaleh, B.M., Shahbazkia, H.: A genetic algorithm for discovering classification rules in data mining. Int. J. Comput. Appl. 41(18), 40–44 (2012)
Google Scholar
Andrieu, C., de Freitas, N., Doucet, A., Jordan, M.I.: An introduction to MCMC for machine learning. Mach. Learn. 50, 5–43 (2003)
Article MATH Google Scholar
Atkinson-Abutridy, J., Mellish, C., Aitken, S.: Combining information extraction with genetic algorithms for text mining. IEEE Intell. Syst. 19(3), 22–30 (2004)
Article Google Scholar
Bagheri, A., Saraee, M., Nadi, S.: PSA: a hybrid feature selection approach for Persian text classification. J. Comput. Secur. 1(4), 261–272 (2014)
Google Scholar
Bermejo, P., Gamez, J.A., Puerta, J.M.: A GRASP algorithm for fast hybrid filter-wrapper feature subset selection in high-dimensional datasets. Pattern Recogn. Lett. 32(5), 701–711 (2011)
Article Google Scholar
Borg, C.: Automatic Definition Extraction using Evolutionary Algorithms. Thesis (MSc), University of Malta, Malta (2009)
Google Scholar
Branavan, S.R.K., Silver, D., Barzilay, R.: Learning to win by reading manuals in a Monte Carlo framework. J. Artif. Intell. Res. 43, 661–704 (2012)
MATH Google Scholar
Browne, C., Powley, E., Whitehouse, D., Lucas, S., Cowling, P.I., Rohlfshagen, P., Tavener, S., Perez, D., Samothrakis, S., Colton, S.: A survey of Monte Carlo tree search methods. IEEE Trans. Comput. Intell. AI Games 4(1), 1–43 (2012)
Article Google Scholar
Buxey, G.M.: The vehicle scheduling problem and Monte Carlo simulation. J. Oper. Res. Soc. 30(6), 563–573 (1979)
Article MATH Google Scholar
Chen, H., Kim, J.: GANNET: a machine learning approach to document retrieval. J. Manag. Inf. Syst. 11(3), 7–41 (1994)
Article Google Scholar
Chen, H., Jiang, W., Li, C., Li, R.: A heuristic feature selection approach for text categorization by using chaos optimization and genetic algorithm. Math. Problems Eng. 2013, Article ID: 524017
Google Scholar
Cunningham, M., Tablan, B.: GATE: a framework and graphical development environment for robust NLP Tools and applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL 2002), 7–12 July 2002, Philadelphia, Pennsylvania (2002)
Google Scholar
Desjardins, G., Godin, R., Proulx, R.: A genetic algorithm for text mining. WIT Trans. Inf. Commun. Technol. 35, 133–142 (2005)
Google Scholar
Diaconis, P.: The Markov chain Monte Carlo revolution. Bull. Am. Math. Soc. 46, 179–205 (2009)
Article MathSciNet MATH Google Scholar
Draminski, M., Rada-Iglesias, A., Enroth, S., Wadelius, C., Koronacki, J., Komorowski, J.: Monte Carlo feature selection for supervised classification. Bioinformatics 24(1), 110–117 (2008)
Article Google Scholar
Ebbert, M.T.W., Bastien, R.R.L., Boucher, K.M., Martin, M., Carrasco, E., Caballero, R., Stijleman, I.J., Bernard, P.S., Facelli, J.C.: Characterization of uncertainty in the classification of multivariate assays: application to PAM50 centroid-based genomic predictors for breast cancer treatment plans. J. Clin. Bioinform. 1, 37 (2011)
Article Google Scholar
Esbensen, H., Mazumder, P.: SAGA: a unification of the genetic algorithm with simulated annealing and its application to macro-cell placement. In: Proceedings of the 7th International Conference on VLSI Design, Calcutta, India, 5–8 January 1994, pp. 211–214 (1994)
Google Scholar
Figueroa, R.L., Zeng-Treitler, Q., Ngo, L.H., Goryachev, S., Wiechmann, E.P.: Active learning for clinical text classification: is it better than random sampling? J. Am. Med. Inform. Assoc. 19(5), 809–816 (2012)
Article Google Scholar
Gavrilis, D., Tsoulos, I.G., Dermatas, E.: Stochastic classification of scientific abstracts. In: Proceedings of the 6th Speech and Computer Conference, Patras, Greece (2005)
Google Scholar
Gavrilis, D., Tsoulos, I.G., Dermatas, E.: Neural recognition and genetic features selection for robust detection of E-mail spam. Adv. Artif. Intell. 3955, 498–501 (2006)
Google Scholar
Goncharov, Y., Okten, G., Shah, M.: Computation of the endogenous mortgage rates with randomized quasi-Monte Carlo simulations. Math. Comput. Model. 46(3–4), 459–481 (2007)
Article MathSciNet MATH Google Scholar
Gordon, M.: Probabilistic and genetic algorithms for document retrieval. Commun. ACM 31(10), 1208–1218 (1988)
Article Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)
Article Google Scholar
Hassan, S., Mihalcea, R., Banea, C.: Random walk term weighting for improved text classification. Int. J. Semant. Comput. 1(4), 421–439 (2007)
Article Google Scholar
Hastings, W.K.: Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57(1), 97–109 (1970)
Article MathSciNet MATH Google Scholar
Holland, J.H.: Adaptation in Natural and Artificial Systems. University of Michigan Press, Michigan (1975)
Google Scholar
Houghton, J., Siegel, M., Wirsch, A., Moulton, A., Madnick, S., Goldsmith, D.: A survey of methods for data inclusion in system dynamics models: methods, tools and applications. Massachusetts Institute of Technology, Cambridge, Working Paper CISL# 2013-03 (2014)
Google Scholar
Jovic, A., Brkic, K., Bogunovic, N.: A review of feature selection methods with applications. In: Proceedings of the 38th IEEE International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO 2015), Opatija, Croatia, 25–29 May 2015, pp. 1200–1205 (2015)
Google Scholar
Khalessizadeh, S.M., Zaefarian, R., Nasseri, S.H., Ardil, E.: Genetic mining: using genetic algorithm for topic based on concept distribution. In: Proceedings of the World Academy of Science, Engineering and Technology (2006)
Google Scholar
Khan, A., Baharudin, B., Lee, L., Khan, K.: A review of machine learning algorithms for text documents classification. J. Adv. Inf. Technol. 1(1), 4–20 (2010)
Google Scholar
Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by simulated annealing. Science 220(4598), 671–680 (1983)
Article MathSciNet MATH Google Scholar
Liang, F., Wong, W.H.: Evolutionary Monte Carlo: applications to Cp model sampling and change point problem. Stat. Sin. 10, 317–342 (2000)
MATH Google Scholar
Liu, X., Fu, H.: A hybrid algorithm for text classification problem. Electrical review, R. 88 NR 1b (2012)
Google Scholar
Martin, O., Otto, S.W., Felten, E.W.: Large-step Markov chains for the travelling salesman problem, p. 16. CSETech, Paper (1991)
MATH Google Scholar
Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equation of state calculations by fast computing machines. J. Chem. Phys. 21(6), 1087–1092 (1953)
Article Google Scholar
Mitchell, M.: An Introduction to Genetic Algorithms. MIT Press, Cambridge (1998)
MATH Google Scholar
Moncao, A.C.L., Camilo-JR, C.G., Queiroz, L.T., Rodrigues, C.L., Leitao-JR, P.S., Vincenzi, A.M.R.: Applying genetic algorithms to data selection for SQL mutation analysis. In: Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation (GECCO 2013), Amsterdam, The Netherlands, 7–10 July 2013, pp. 207–208 (2013)
Google Scholar
Moshki, M., Kabiri, P., Mohebalhojeh, A.: Scalable feature selection in high-dimensional data based on GRASP. Appl. Artif. Intell. 29, 283–296 (2015)
Article Google Scholar
Pavlyshenko, B.: Genetic optimization of keywords subset in the classification analysis of texts authorship. J. Quant. Linguist. 21(4), 341–349 (2014)
Article Google Scholar
Pemantle, R.: A survey of random processes with reinforcement *. Prob. Surv. 4, 1–79 (2007)
Article MathSciNet MATH Google Scholar
Pietramala, A., Policicchio, V.L., Rullo, P., Sidhu, I.: A genetic algorithm for text classification rule induction. Lect. Notes Comput. Sci. 5212, 188–203 (2008)
Article Google Scholar
Pudaruth, S., Soyjaudah, K.M.S., Gunputh, R.P.: Categorisation of supreme court cases using multiple horizontal thesauri. Intell. Syst. Technol. Appl. 2, 355–368 (2016)
Google Scholar
Read, J., Martino, L., Luengo, D.: Efficient Monte Carlo methods for multi-dimensional learning with classifier chains. Pattern Recogn. 47, 1535–1546 (2014)
Article MATH Google Scholar
Rogers, B.C.: Using genetic algorithms for feature set selection in text mining. Thesis (MSc), Miami University, Oxford, Ohio (2013)
Google Scholar
Roy, N., Mccallum, A.: Toward optimal active learning through sampling estimation of error reduction. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 441–448 (2001)
Google Scholar
Sahin, I.E.: Online text categorization using genetic algorithms. Bilkent University, Turkey, Technical report, BU-CE-0704 (2007)
Google Scholar
Samad, S.A.: Random walk oversampling technique for minority class classification. Thesis (MSc), Tampere University of Technology (2013)
Google Scholar
Smith, R., Hussain, M.S.: Genetic algorithm sequential Monte Carlo methods for stochastic volatility and parameter estimation. In: Proceedings of the World Congress on Engineering (WCE 2012), London, UK, 4–6 July 2012, vol. 1 (2012)
Google Scholar
Song, W., Park, S.C.: Genetic algorithm for text clustering based on latent semantic indexing. Comput. Math Appl. 57, 1901–1907 (2009)
Article MATH Google Scholar
ter Braak, C.J.F.: A Markov Chain Monte Carlo version of the genetic algorithm differential evolution: easy Bayesian computing for real parameter spaces. Stat. Comput. 16(3), 239–249 (2006)
Article MathSciNet Google Scholar
Thomas, J.D., Sycara, K.: Integrating genetic algorithms and text learning for financial prediction. In: Proceedings of the Genetic and Evolutionary Computing Conference (GECCO), Las Vegas, Nevada, pp. 72–75
Google Scholar
Waad, B., Mufti, G.B, Liman, M.: A new feature selection technique applied to credit scoring data using a ranked aggregation approach based on: optimisation, genetic algorithm and similarity. In: Osei-Bryson, K., Barclay, C. (eds.) Knowledge Discovery Process And Methods To Enhance Organisational Performance, pp. 347–376. CRC Press, ‎Boca Raton (2014)
Google Scholar
Wang, R., Youssef, A.M., Elhakeem, A.K.: On some feature selection strategies for spam filter design. In: Proceedings of the IEEE Canadian Conference on Electrical and Computer Engineering (CCECE 2006), Ottawa, Canada, 7–10 May 2006, pp. 2155–2158 (2006)
Google Scholar
Winands, M.H.M., Bjornsson, Y., Saito, J.T.: Monte Carlo tree search solver. In: Proceedings of the 6th International Conference on Computers and Games, pp. 25–36 (2008)
Google Scholar
WordNet: a lexical database for English. Princeton University (2017). https://wordnet.princeton.edu/wordnet/. Accessed 31 Jan 2017
Wu, J., Zheng, C., Chien, C.C., Zheng, L.: A comparative study of Monte Carlo simple genetic algorithm and noisy genetic algorithm for cost-effective sampling network design under uncertainty. Adv. Water Resour. 29, 899–911 (2006)
Article Google Scholar
Xiao, X.: Advanced Monte Carlo techniques: an approach for foreign exchange derivative pricing. Thesis (PhD), University of Manchester, UK (2007)
Google Scholar
Yang, C., Li, Y., Zhang, C., Hu, Y.: A fast KNN algorithm based on simulated annealing. In: Proceedings of the International Conference on Data Mining, Las Vegas, Nevada, 25–28 June 2007, pp. 46–51 (2007)
Google Scholar
Zhong, M., Shen, K., Seiferas, J.: The convergence-guaranteed random walk and its application in peer-to-peer networks. IEEE Trans. Comput. 57(5), 619–633 (2008)
Article MathSciNet MATH Google Scholar
Zhou, Y.: A random-walk based privacy-preserving access control for online social networks. Int. J. Adv. Comput. Sci. Appl. 7(2), 74–79 (2016)
Google Scholar
Zhu, F., Li, H., Yao, N., Zhu, H.: Text feature selection applied by improved SAA*. J. Comput. Inf. Syst. 11(17), 6419–6427 (2015)
Google Scholar
Zhu, H., Chen S., Pu, C., Liu, Y., Eguchi, K., Zhang, S.: Paralleling genetic annealing algorithm with OpenMP. In: Proceedings of the 2nd IEEE International Conference on Intelligent Networks and Intelligent Systems (ICINIS 2009), Tianjin, China, 1–3 November 2009
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Ocean Engineering and ICT, Faculty of Ocean Studies, University of Mauritius, Moka, Mauritius
S. Pudaruth
Department of Electrical and Electronic Engineering, Faculty of Engineering, University of Mauritius, Moka, Mauritius
K. M. S. Soyjaudah
Department of Law, Faculty of Law and Management, University of Mauritius, Moka, Mauritius
R. P. Gunputh

Authors

S. Pudaruth
View author publications
You can also search for this author in PubMed Google Scholar
K. M. S. Soyjaudah
View author publications
You can also search for this author in PubMed Google Scholar
R. P. Gunputh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to S. Pudaruth .

Editor information

Editors and Affiliations

School of CS/IT, Indian Institute of Information Technology, Trivandrum, Kerala, India
Sabu M. Thampi
Machine Intelligence Unit, Indian Statistical Institute, Kolkata, India
Sushmita Mitra
Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, West Bengal, India
Jayanta Mukhopadhyay
Xiamen University, Xiamen, China
Kuan-Ching Li
Department of Electrical and Electronic, Nazarbayev University, Astana, Kazakhstan
Alex Pappachen James
Dipartimento di Ingegneria, Università degli Studi di Firenze, Firenze, Italy
Stefano Berretti

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pudaruth, S., Soyjaudah, K.M.S., Gunputh, R.P. (2018). Markov Chain Monte Carlo Methods and Evolutionary Algorithms for Automatic Feature Selection from Legal Documents. In: Thampi, S., Mitra, S., Mukhopadhyay, J., Li, KC., James, A., Berretti, S. (eds) Intelligent Systems Technologies and Applications. ISTA 2017. Advances in Intelligent Systems and Computing, vol 683. Springer, Cham. https://doi.org/10.1007/978-3-319-68385-0_12

Download citation

DOI: https://doi.org/10.1007/978-3-319-68385-0_12
Published: 21 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-68384-3
Online ISBN: 978-3-319-68385-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics