Information extraction framework to build legislation network


This paper concerns an information extraction process for building a dynamic legislation network from legal documents. Unlike supervised learning approaches which require additional calculations, the idea here is to apply information extraction methodologies by identifying distinct expressions in legal text in order to extract network information. The study highlights the importance of data accuracy in network analysis and improves approximate string matching techniques to produce reliable network data-sets with more than 98% precision and recall. The applications and the complexity of the created dynamic legislation network are also discussed and challenged.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8


  1. 1.

    Shepard’s Citations include a judicial history of cases and statutes.

  2. 2.

    For more details about MetaLex please refer to Boer et al. (2010).

  3. 3.

    To estimate this error rate, a cluster sampling method is used to randomly choose ten sets of 30 entities. By manual check of the samples, the rate of incorrectly matched entities is observed.

  4. 4.

    Time periods: before 1800, 1800–1850, 1850–1900, 1900–1950, 1950–2000, 2000–2018.

  5. 5.

    To find the frequent words, Textalyzer Python module is used. The frequent prepositions, conjunctions and articles are excluded from the analysis.

  6. 6.

    Based on their connectivity (total degree).


  1. Albert R, Jeong H, Barabási A-L (2000) Error and attack tolerance of complex networks. Nature 406(6794):378

    Article  Google Scholar 

  2. Andersen PM, Hayes PJ, Huettner AK, Schmandt LM, Nirenburg IB, Weinstein SP (1992) Automatic extraction of facts from press releases to generate news stories. In: Proceedings of the third conference on applied natural language processing. Association for Computational Linguistics, pp 170–177

  3. Alexander B, Hoekstra R, De Maat E, Vitali F, Palmirani M, Ratai B (2010) Metalex (open xml interchange format for legal and legislative resources). Management Center, Akon

    Google Scholar 

  4. Borgatti SP, Carley KM, Krackhardt D (2006) On the robustness of centrality measures under conditions of imperfect data. Soc Netw 28(2):124–136

    Article  Google Scholar 

  5. Butts CT (2003) Network inference, error, and informant (in) accuracy: a Bayesian approach. Soc Netw 25(2):103–140

    Article  Google Scholar 

  6. Canisius S, Sporleder C (2007) Bootstrapping information extraction from field books. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL)

  7. Carlson A, Schafer C (2008) Bootstrapping information extraction from semi-structured web pages. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, Berlin, pp 195–210

  8. Casteigts A, Flocchini P, Quattrociocchi W, Santoro N (2012) Time-varying graphs and dynamic networks. Int J Parallel Emergent Distrib Syst 27(5):387–408

    Article  Google Scholar 

  9. Chiticariu L, Li Y, Reiss FR (2013) Rule-based information extraction is dead! long live rule-based information extraction systems! In: Proceedings of the 2013 conference on empirical methods in natural language processing, pp 827–832

  10. Cohen KB, Demner-Fushman D (2014) Biomedical natural language processing, vol 11. John Benjamins Publishing Company, Amsterdam

    Google Scholar 

  11. Cohen W, Ravikumar P, Fienberg S (2003) A comparison of string metrics for matching names and records. In: KDD workshop on data cleaning and object consolidation, vol 3, pp 73–78

  12. Damerau FJ (1964) A technique for computer detection and correction of spelling errors. Commun ACM 7(3):171–176

    Article  Google Scholar 

  13. De Maat E, Winkels R, van Engers T (2006) Automated detection of reference structures in law. Frontiers in artificial intelligence and applications. IOS Press, Amsterdam, p 41

    Google Scholar 

  14. EUR-Lex (2020) Access to European Union law. Accessed 10 Sept 2017

  15. Fowler JH, Johnson TR, Spriggs JF, Jeon S, Wahlbeck PJ (2007) Network analysis and the law: measuring the legal importance of precedents at the US supreme court. Polit Anal 15(3):324–346

    Article  Google Scholar 

  16. Freitag D (2000) Machine learning for information extraction in informal domains. Mach Learn 39(2–3):169–202

    Article  Google Scholar 

  17. Gultemen D, van Engers T (2013) Graph-based linking and visualization for legislation documents (glvd). In: Network analysis in law workshop, at ICAIL 2013: XIV international conference on AI and law, NAiL2013 ICAIL, Rome, Italy, 14 June

  18. Hafner CD (1978) An information retrieval system based on a computer model of legal knowledge. UMI Research Press, Ann Arbor, MI

    Google Scholar 

  19. Hall PAV, Dowling GR (1980) Approximate string matching. ACM Comput Surv (CSUR) 12(4):381–402

    MathSciNet  Article  Google Scholar 

  20. Hearst MA (1992) Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th conference on computational linguistics, vol 2. Association for Computational Linguistics, pp 539–545

  21. Humphries MD, Gurney K (2008) Network small-world-ness: a quantitative method for determining canonical network equivalence. PLoS ONE 3(4):e0002051

    Article  Google Scholar 

  22. Jurafsky D, Martin JH (2014) Speech and language processing, vol 3. Pearson, London

    Google Scholar 

  23. Kartoun U (2017) Text nailing: an efficient human-in-the-loop text-processing method. Interactions 24(6):44–49

    Article  Google Scholar 

  24. Koniaris M, Anagnostopoulos I, Vassiliou Y (2017) Network analysis in the legal domain: a complex model for European Union legal sources. J Complex Netw 6(2):243–268

    Article  Google Scholar 

  25. Krallinger M, Leitner F, Rabal O, Vazquez M, Oyarzabal J, Valencia A (2013) Overview of the chemical compound and drug name recognition (chemdner) task. In: BioCreative challenge evaluation workshop, vol 2, p 2

  26. Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl 10:707–710

    MathSciNet  Google Scholar 

  27. McCallum A (2005) Information extraction: distilling structured data from unstructured text. Queue 3(9):4

    Article  Google Scholar 

  28. Mendelson E (2008) Abbyy finereader professional 9.0. PC Magazine

  29. Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv CSUR) 33(1):31–88

    Article  Google Scholar 

  30. New Zealand Legal Information Institute (2020) Free access to legal information in New Zealand. Accessed 31 Oct 2018

  31. New Zealand Parliamentary Counsel Office (2020) The authoritative source of New Zealand legislation. Accessed 31 Oct 2018

  32. Niu Q, Zeng A, Fan Y, Di Z (2015) Robustness of centrality measures against network manipulation. Physica A 438:124–131

    Article  Google Scholar 

  33. Pasula H, Marthi B, Milch B, Russell SJ, Shpitser I (2003) Identity uncertainty and citation matching. In: Advances in neural information processing systems, pp 1425–1432

  34. Philips L (1990) Hanging on the metaphone. Comput Lang 7(12):39–43

    Google Scholar 

  35. Sakhaee N (2018) Leginet New Zealand, first outcome of the new information extraction framework proposed to build legislation network. Published 21 Sept 2018

  36. Sakhaee N, Wilson M, Hendy S, Zakeri G (2017) Network analysis of New Zealand legislation. NZ Law J 10:332–337

    Google Scholar 

  37. Sakhaee N, Wilson MC, Zakeri G (2016) New Zealand legislation network. In: Legal knowledge and information systems: JURIX 2016: the twenty-ninth annual conference, vol 294. IOS Press, p 199

  38. Tabak BM, Takami M, Rocha JMC, Cajueiro DO, Souza SRS (2014) Directed clustering coefficient as ameasure of systemic risk in complex banking networks. Phys A Stat Mech Appl 394:211–216

    Article  Google Scholar 

  39. Tin CT, Jeffrey LC, Mark DT, Kenneth GY, Rachel E (2009) Information extraction from legal documents. In: 2009 eighth international symposium on natural language processing

  40. Trier OD, Jain AK, Taxt T et al (1996) Feature extraction methods for character recognition-a survey. Pattern Recognit 29(4):641–662

    Article  Google Scholar 

  41. Ukkonen E (1992) Approximate string-matching with q-grams and maximal matches. Theor Comput Sci 92(1):191–211

    MathSciNet  Article  Google Scholar 

  42. Watts DJ (2004) Small worlds: the dynamics of networks between order and randomness, vol 9. Princeton University Press, Princeton

    Google Scholar 

  43. Watts DJ, Strogatz SH (1998) Collective dynamics of small-world networks. Nature 393(6684):440

    Article  Google Scholar 

  44. Winkler WE (1999) The state of record linkage and current research problems. Statistical Research Division, US Census Bureau, Suitland

    Google Scholar 

  45. Zhang P, Koppaka L (2007) Semantics-based legal citation network. In: Proceedings of the 11th international conference on artificial intelligence and law. ACM, pp 123–130

  46. Zhang Y, Patrick J (2005) Paraphrase identification by text canonicalization. In: Proceedings of the Australasian language technology workshop, pp 160–166

Download references

Author information



Corresponding author

Correspondence to Neda Sakhaee.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Sakhaee, N., Wilson, M.C. Information extraction framework to build legislation network. Artif Intell Law 29, 35–58 (2021).

Download citation


  • Optical character recognition
  • Information extraction
  • Named entity recognition
  • Relation extraction
  • Approximate string matching
  • Legislation network
  • Evaluation