Abstract
In modern societies control is based on information. Nowadays, in many countries, companies are obligated to provide to tax administrations all their invoices and withholders and financial entities to provide information that is used to offer prefilled tax declaration. In the case of Spain, the Tax Agency (AEAT) receives 180 million invoices by month and must process in a few days at the end of January more than 500 millions of registers to prefill Income Tax forms. Hundreds of thousands of these data are not correctly identified by the provider and must be returned to the sender or stored as not identified and analyzed afterwards. Traditionally this process consumed many technical and human resources. AEAT has been able to provide for first time a solution for identification in real time with enormous throughput that fulfil its needs. It is based in a combination of six algorithms, based in three different ideas, n-gram, TI-ILF, and Monge-Elkan that has surpassed any previous expectative.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
OECD: Using Third Party Information Reports to Assist Taxpayers Meet their Return Filing Obligations—Country Experiences with the Use of Pre-populated Personal Tax Returns Prepared by Forum on Tax Administration Taxpayer Services Sub-group (2006)
IMF: TADAT. Field Guide (2016). http://www.tadat.org
Akinwale, A., Niewiadomski, A.: Efficient similarity measures for texts matching. J. Appl. Comput. Sci. 23(1), 7–28 (2015)
Chapman, S.: SimMetrics: a java & c# .net library of similarity metrics (2006). http://sourceforge.net/projects/simmetrics/
Gomaa, W., Fahmy, A.: A survey of text similarity approaches. Int. J. Comput. Appl. (0975 – 8887) 68(13), 13–18 (2013)
Gelbukh, A. (ed.): CICLing 2009. LNCS, vol. 5449, pp. 17–23. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-00382-0
Naumann, F.: Similarity measures. Hasso Plattner Institut Universität Postdam (2013). https://hpi.de
William, W.C., Ravikumar, P., Fienberg, S.E.: Comparison for string distance metrics for name-matching Tasks. American Association for Artificial Intelligence (2003). https://cs.cmu.edu
Cormode, G., Mutukhrishnan, S.: The string edit distance matching problem with move. ACM Trans. Algorithms (TALG) 3(1), 2 (2006)
Hamming, R.W.: Error detecting and error correcting codes. Bell Syst. Tech. J. 29(2), 147–160 (1950)
Levenshtein, V.I.: Binary codes capable of correcting spurious insertions and deletions of ones. Probl. Inf. Transm. 1(1), 8–17 (1965)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Dokl. Akad. Nauk SSSR 163(4), 845–848 (1965)
Damerau, F.J.: A technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171–176 (1964)
Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)
Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Am. Stat. Assoc. 84(406), 414–420 (1989)
Jaro, M.: Probabililistic linkeage of large public health data files. Stat. Med. 14, 491–498 (1995)
Winkler, W.E.: String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage, pp. 354–359 (1990)
Kondrak, G.: N-gram similarity and distance. In: Consens, M., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 115–126. Springer, Heidelberg (2005). https://doi.org/10.1007/11575832_13
Monge, A., Elkan, C.: The field matching problem: algorithms and applications. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (1996)
Monge, A., Elkan, C.: An efficient domain-independent algorithm for detecting approximately duplicate data records. In: The Proceedings of the SIGMOD 1997 Workshop on Data Mining and Knowledge Discovery (1997)
Bafna, P., Pramod, D., Vaidya, A.: Document clustering: TF-IDF approach. In: International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), Chennai, pp. 61–66 (2016)
Prasetya, D., Wibawa, A., Hirashima, T.: The performance of text similarity algorithms. Int. J. Adv. Intell. Inform. 4(1), 63–69 (2018)
Jimenez, S., Becerra, C., Gelbukh, A., Gonzalez, F.: Generalized Mongue-Elkan method for approximate text string comparison. In: Gelbukh, A. (ed.) CICLing 2009. LNCS, vol. 5449, pp. 559–570. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-00382-0_45
Spackman, K.: Signal detection theory: valuable tools for evaluating inductive learning. In: Proceedings of the Sixth International Workshop on Machine Learning, pp. 160–163 (1989)
AEAT: Implementation of a Monge-Elkan based system of identification (2019). https://www.agenciatributaria.es/AEAT.desarrolladores/Desarrolladores/Desarrolladores.html
Acknowledgements
This paper was supported by the Spanish Ministry of Economy and Competitiveness project MTM2017-86875-C3-3-R.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
González, I., Mateos, A. (2019). Multilayer Identification: Combining N-Grams, TF-IDF and Monge-Elkan in Massive Real Time Processing. In: Torra, V., Narukawa, Y., Pasi, G., Viviani, M. (eds) Modeling Decisions for Artificial Intelligence. MDAI 2019. Lecture Notes in Computer Science(), vol 11676. Springer, Cham. https://doi.org/10.1007/978-3-030-26773-5_19
Download citation
DOI: https://doi.org/10.1007/978-3-030-26773-5_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-26772-8
Online ISBN: 978-3-030-26773-5
eBook Packages: Computer ScienceComputer Science (R0)