A framework for crime data analysis using relationship among named entities


Many crime reports are available online in various blogs and Newswire. Though manual annotation of these massive reports is quite tedious for crime data analysis, it gives an overall crime scenario of all over the world. This motivates us to propose a framework for crime data analysis based on the online reports. Initially, the method extracts the crime reports and identifies named entities. The intermediate sequence of context words between every consecutive pair of named entities is termed as a crime vector that provides relationships between the entities. The feature vectors for each entity pair are generated from these crime vectors using the Word2Vec model. The paper considers three different types of named entity pairs to facilitate the major crime data analysis task, and for each type, similarity between every pair of entities is measured using respective feature vectors. For each type of named entity pair, a separate weighted graph is generated with entity pairs as vertices and similarity score between them as the weight of the corresponding edge. Then, Infomap, a graph-based clustering algorithm, is applied to obtain optimal set of clusters of entity pairs and a representative entity pair of each cluster. Each cluster is labelled by the relationship, represented by the crime vector, of its representative entity pair. In reality, all the entity pairs in a cluster may not reflect contextual similarity with their representative entity pair. So the clusters are further partitioned into subclusters based on WordNet-based path similarity measure which makes the entity pairs in each subcluster more contextually similar compared to their original cluster. These subclusters provide us various statistical crime information over the time period. The method is experimented only using the crime reports related to crime against women in India. The experimental results demonstrate the effectiveness and superiority of the method compared to others for analysing the crime data.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7


  1. 1.

    Agichtein E, Gravano L (2000) Snowball: extracting relations from large plain-text collections. In: Proceedings of the fifth ACM conference on digital libraries

  2. 2.

    An J, Kim H (2018) A data analytics approach to the cybercrime underground economy. IEEE Access 6:26636–26652

    Google Scholar 

  3. 3.

    Arbelaitz O, Gurrutxaga I, Muguerza J, Prez JM, Perona I (2013) An extensive comparative study of cluster validity indices. Pattern Recognit 46(1):243–256

    Google Scholar 

  4. 4.

    Arulanandam R, Savarimuthu BTR, Purvis MA (2014) Extracting crime information from online newspaper articles. In: Second Australasian Web Conference (AWC 2014), vol 155, pp 31–38

  5. 5.

    Basili R, Giannone C, Del Vescovo C, Moschitti A, Naggar P (2009) Kernel-based relation extraction for crime investigation. In: AI*IA, Citeseer, pp 161–171

  6. 6.

    Bergmanis T, Goldwater S (2018) Context sensitive neural lemmatization with lematus. In: 16th annual conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1391–1400

  7. 7.

    Bird S, Klein E, Loper E (2009) Natural language processing in python. O’Reilly Media

  8. 8.

    Brin S (1999) Extracting patterns and relations from the World Wide Web. In: International workshop on the world wide web and databases, pp 172–183

  9. 9.

    Caliski T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat 3(1):1–27

    MathSciNet  MATH  Google Scholar 

  10. 10.

    Chau M, Xu JJ, Chen H (2002) Extracting meaningful entities from police narrative reports. In: Annual national conference on digital government research, pp 1–5

  11. 11.

    Chen H, Chung W, Xu JJ, Wang G, Qin Y, Chau M (2004) Crime data mining: a general framework and some examples. IEEE Comput Soc 37(4):50–56

    Google Scholar 

  12. 12.

    Cunningham H (2002) Gate, a general architecture for text engineering. Comput Humanit 36(2):223–254

    Google Scholar 

  13. 13.

    Das P, Das AK (2017) An application of strength pareto evolutionary algorithm for feature selection from crime data. In: 8th international conference on computing, communication and networking technologies, pp 1–6

  14. 14.

    Das P, Das AK (2018) Crime pattern analysis by identifying named entities and relation among entities. In: Advanced computational and communication paradigms, pp 75–84

  15. 15.

    Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1(2):224–227

    Google Scholar 

  16. 16.

    Doddington G, Mitchell A, Przybocki M, Ramshaw L, Strassel S, Weischedel R (2004) The automatic content extraction (ace) program tasks, data, and evaluation. In: Proceedings of the fourth international conference on language resources and evaluation (LREC-2004), pp 837–840

  17. 17.

    Fellbaum C (1998) WordNet: an electronic lexical database. Bradford Books, Cambridge

  18. 18.

    Grishman R, Sundheim B (1996) Message understanding conference-6: a brief history. In: Proceedings of the 16th conference on computational linguistics, vol 1, pp 466–471

  19. 19.

    Hasegawa T, Sekine S, Grishman R (2004) Discovering relations among named entities from large corpora. In: Proceedings of the 42nd annual meeting on association for computational linguistics, p 415

  20. 20.

    Hasegawa T, Sekine S, Grishman R (2005) Unsupervised paraphrase acquisition via relation discovery. In: 11th annual meeting of the Japanese association for natural language processing

  21. 21.

    IRSIG-CNR (2002–2006) Astrea, information and communication for justice. Italian Research Council/Research Institute on Judicial Systems (IRSIG-CNR)

  22. 22.

    Karaa WBA, Gribâa N (2013) Information retrieval with porter stemmer: a new version for English. In: Advances in computational science, engineering and information technology, pp 243–254

  23. 23.

    Ku CH, Iriberri A, Leroy G (2008) Natural language processing and e-government: crime information extraction from heterogeneous data sources. In: Ninth international conference on digital government research, pp 162–170

  24. 24.

    Ku CH, Iriberri A, Leroy G (2008) Crime information extraction from police and witness narrative reports. In: IEEE conference on technologies for Homeland security, pp 193–198

  25. 25.

    Lin D, Pantel P (2001) Dirt—discovery of inference rules from text. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, pp 323–328

  26. 26.

    Loper E, Bird S (2002) Nltk: The natural language toolkit. In: Proceedings of the ACL-02 workshop on effective tools and methodologies for teaching natural language processing and computational linguistics, vol 1, pp 63–70

  27. 27.

    Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. CoRR abs/1301.3781:1–12

  28. 28.

    Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013b) Distributed representations of words and phrases and their compositionality. CoRR abs/1310.4546:1–9

  29. 29.

    Mohamed TP, Hruschka ER Jr, Mitchell TM (2011) Discovering relations between noun categories. In: Proceedings of the conference on empirical methods in natural language processing, Association for Computational Linguistics, EMNLP ’11, pp 1447–1455

  30. 30.

    Pinheiro V, Furtado V, Pequeno T, Nogueira D (2010) Natural language processing based on semantic inferentialism for extracting crime information from text. In: IEEE international conference on intelligence and security informatics (ISI), pp 19–24

  31. 31.

    Rendón E, Garcia R, Abundez I, Gutierrez C, Gasca E, Del Razo F, Gonzalez A (2008) Niva: a robust cluster validity. In: Proceedings of the 12th WSEAS international conference on communications, pp 241–248

  32. 32.

    Rosvall M (2009) Infomap. www.mapequation.org/code.html

  33. 33.

    Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20(Supplement C):53–65

  34. 34.

    Sekine S, Sudo K, Nobata C (2002) Extended named entity hierarchy. In: Third international conference on language resources and evaluation (LREC-2002), pp 1818–1824

  35. 35.

    Sekine S (2005) Automatic paraphrase discovery based on context and keywords between ne pairs. In: Proceedings of IWP, pp 4–6

  36. 36.

    Shabat H, Omar N, Rahem K (2014) Named entity recognition in crime using machine learning approach. In: Information retrieval technology, pp 280–288

  37. 37.

    Shabat HA, Omar N (2015) Named entity recognition in crime news documents using classifiers combination. Middle-East J Sci Res 23(6):1215–1221

    Google Scholar 

  38. 38.

    Syed Z, Viegas E (2010) A hybrid approach to unsupervised relation discovery based on linguistic analysis and semantic typing. In: First international workshop on formalisms and methodology for learning by reading, pp 105–113

  39. 39.

    Weir G, Anagnostou N (2007) Exploring newspapers: a case study in corpus analysis. In: ICTATLL Workshop

  40. 40.

    Zhang M, Su J, Wang D, Zhou G, Tan CL (2005) Discovering relations between named entities from a large raw corpus using tree similarity-based clustering. In: Second international joint conference on natural language processing, pp 378–389

Download references

Author information



Corresponding author

Correspondence to Priyanka Das.

Ethics declarations

Conflict of interest

The authors declare that this manuscript has no conflict of interest with any other published source and has not been published previously (partly or in full). No data have been fabricated or manipulated to support our conclusion.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Das, P., Das, A.K., Nayak, J. et al. A framework for crime data analysis using relationship among named entities. Neural Comput & Applic 32, 7671–7689 (2020). https://doi.org/10.1007/s00521-019-04150-8

Download citation


  • Crime analysis
  • Online news
  • Entity recognition
  • Relation extraction
  • Paraphrase extraction
  • Graph-based clustering