Abstract
Whenever the actors of a social network are not uniquely identifiable in the data, then entity resolution in the form of actor identification becomes a critical facet of a social network construction process. Here we develop SAINT, a pipeline for supervised entity resolution that uses relational information to improve, or tune, the quality of the constructed network. The first phase of SAINT uses attribute only based entity resolution to create an initial social network. Relational information between actors, actor network properties and other relational output of the first classification phase, are used in a second phase to improve the results of the original entity resolution. When compared to single phased approaches, the results from this two phased approach are consistently superior in both recall and precision measures. Embedded within SAINT are a series of evaluation checkpoints designed to measure both the quality of the individual classifiers and their impact within the entire pipeline. Our evaluation results provide insight on the potential propagation of error and open research questions for further improvement of the individual classifiers within the entire pipeline. As the main application of the process is to improve actor identification in social networks, we characterise the impact that entity resolution has on the final constructed network. We compare the network constructed using SAINT with a ground truth network using perfect entity resolution and use global and local network measures to study the differences.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Adamic L, Adar E (2003) Friends and neighbors on the web. Soc Netw 25:211–230
Ananthakrishna R, Chaudhuri S, Ganti V (2002) Eliminating fuzzy duplicates in data warehouses. In: Proceedings of the 28th international conference on very large data bases, VLDB’02, VLDB Endowment, pp 586–597
Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. ACM, New York
Baxter R, Christen P, Churches T (2003) A comparison of fast blocking methods for record linkage. In: Proceedings of the KDD-2003 workshop on data cleaning, record linkage, and object consolidation, Washington DC, vol 3. pp 25–27
Benjelloun O, Garcia-Molina H, Kawai H, Larson TE, Menestrina D, Su Q, Thavisomboon S, Widom J (2006) Generic entity resolution in the serf project. Technical Report 2006-14, Stanford InfoLab
Bhattacharya I, Getoor L (2007) Collective entity resolution in relational data. ACM Trans Knowl Discov Data 1:5
Bilgic M, Licamele L, Getoor L, Shneiderman B (2006) D-dupe: an interactive tool for entity resolution in social networks, 31 2006–Nov. 2, pp. 43–50.
Blondel V, Guillaume J, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech Theory Exp 2008:P10008
Burges C (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 2:121–167
Christen P (2006) A comparison of personal name matching: techniques and practical issues. Tech. Rep. TR-CS-06-02
Christen P (2008) Automatic record linkage using seeded nearest neighbour and support vector machine classification. In: KDD ’08: proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, pp 151–159
Christen P, Churches T, Hegland M (2004) Febrl – a parallel open source data linkage system. In: Dai H, Srikant R, Zhang C (eds) Advances in knowledge discovery and data mining. Lecture notes in computer science, vol 3056. Springer, Berlin, pp 638–647
Cohen WW, Ravikumar P, Fienberg SE (2003) A comparison of string distance metrics for name-matching tasks, pp 73–78
Dunn H (1946) Record linkage. Am J Publ Health 36:1412
Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19:1–16
Farrugia M, Quigley A (2009) Enhancing airline customer relationship management data by inferring ties between passengers. In: Proceedings of the international conference on social computing
Farrugia M, Hurley N, Quigley A (2011) Snap: towards a validation of the social network assembly pipeline. In: International conference on advances in social network analysis and mining, pp 228–235
Fellegi I, Sunter A (1969) A theory for record linkage. J Am Stat Assoc 64:1183–1210
Hernández M, Stolfo S (1998) Real-world data is dirty: data cleansing and the merge/purge problem. Data Min Knowl Discov 2:9–37
Hirschman L, Chinchor N (1997) Muc-7 coreference task definition – version 3.0
Katz L (1953) A new status index derived from sociometric analysis. Psychometrika 18:39–43
Liben-Nowell D, Kleinberg J (2007) The link-prediction problem for social networks. J Am Soc Inf Sci Technol 58:1019–1031
Lü L, Zhou T (2011) Link prediction in complex networks: a survey. Phys A, Stat Mech Appl 390(6):1150–1170
Macskassy S, Provost F (2003) A simple relational classifier. In: Proc. of the 2nd workshop on multi-relational data mining (MRDM 03), pp 64–76
Makrehchi M, Kamel M (2007) A text classification framework with a local feature ranking for learning social networks. In: 2007 seventh IEEE international conference on data mining, ICDM 2007, pp 589–594
Menestrina D, Whang S, Garcia-Molina H (2010) Evaluating entity resolution results. Proc VLDB Endow 3:208–219
Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33:31–88
Newman M (2001) Scientific collaboration networks. I. Network construction and fundamental results. Phys Rev E 64:16131
Piatetsky-Shapiro G, Djeraba C, Getoor L, Grossman R, Feldman R, Zaki M (2006) What are the grand challenges for data mining, KDD-2006 panel report. ACM SIGKDD Explor Newsl 8:70–77
Porter E, Winkler W, of the Census B, States U, Division SR (1997) Approximate string comparison and its effects on an advanced record linkage system. US Bureau of the Census
Qiu J, Lin Z, Tang C, Qiao S (2009) Discovering organizational structure in dynamic social network. In: 2009 ninth IEEE International conference on data mining, ICDM ’09, pp 932–937
Quercia D, Lathia N, Calabrese F, Di Lorenzo G, Crowcroft J (2010) Recommending social events from mobile phone location data. In: 2010 IEEE 10th international conference on data mining, ICDM, pp 971–976
Scharstein D, Szeliski R (2002) A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int J Comput Vis 47:7–42. Has 1205 citations
Tan P, Steinbach M, Kumar V (2005) Introduction to data mining. Addison-Wesley, Reading
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Farrugia, M., Hurley, N., Quigley, A. (2013). SAINT: Supervised Actor Identification for Network Tuning. In: Özyer, T., Erdem, Z., Rokne, J., Khoury, S. (eds) Mining Social Networks and Security Informatics. Lecture Notes in Social Networks. Springer, Dordrecht. https://doi.org/10.1007/978-94-007-6359-3_6
Download citation
DOI: https://doi.org/10.1007/978-94-007-6359-3_6
Publisher Name: Springer, Dordrecht
Print ISBN: 978-94-007-6358-6
Online ISBN: 978-94-007-6359-3
eBook Packages: Computer ScienceComputer Science (R0)