Multilayer Identification: Combining N-Grams, TF-IDF and Monge-Elkan in Massive Real Time Processing

González, Ignacio; Mateos, Alfonso

doi:10.1007/978-3-030-26773-5_19

Ignacio González¹² &
Alfonso Mateos¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11676))

Included in the following conference series:

International Conference on Modeling Decisions for Artificial Intelligence

644 Accesses

Abstract

In modern societies control is based on information. Nowadays, in many countries, companies are obligated to provide to tax administrations all their invoices and withholders and financial entities to provide information that is used to offer prefilled tax declaration. In the case of Spain, the Tax Agency (AEAT) receives 180 million invoices by month and must process in a few days at the end of January more than 500 millions of registers to prefill Income Tax forms. Hundreds of thousands of these data are not correctly identified by the provider and must be returned to the sender or stored as not identified and analyzed afterwards. Traditionally this process consumed many technical and human resources. AEAT has been able to provide for first time a solution for identification in real time with enormous throughput that fulfil its needs. It is based in a combination of six algorithms, based in three different ideas, n-gram, TI-ILF, and Monge-Elkan that has surpassed any previous expectative.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

OECD: Using Third Party Information Reports to Assist Taxpayers Meet their Return Filing Obligations—Country Experiences with the Use of Pre-populated Personal Tax Returns Prepared by Forum on Tax Administration Taxpayer Services Sub-group (2006)
Google Scholar
IMF: TADAT. Field Guide (2016). http://www.tadat.org
Akinwale, A., Niewiadomski, A.: Efficient similarity measures for texts matching. J. Appl. Comput. Sci. 23(1), 7–28 (2015)
Google Scholar
Chapman, S.: SimMetrics: a java & c# .net library of similarity metrics (2006). http://sourceforge.net/projects/simmetrics/
Gomaa, W., Fahmy, A.: A survey of text similarity approaches. Int. J. Comput. Appl. (0975 – 8887) 68(13), 13–18 (2013)
Google Scholar
Gelbukh, A. (ed.): CICLing 2009. LNCS, vol. 5449, pp. 17–23. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-00382-0
Book Google Scholar
Naumann, F.: Similarity measures. Hasso Plattner Institut Universität Postdam (2013). https://hpi.de
William, W.C., Ravikumar, P., Fienberg, S.E.: Comparison for string distance metrics for name-matching Tasks. American Association for Artificial Intelligence (2003). https://cs.cmu.edu
Cormode, G., Mutukhrishnan, S.: The string edit distance matching problem with move. ACM Trans. Algorithms (TALG) 3(1), 2 (2006)
Article MathSciNet Google Scholar
Hamming, R.W.: Error detecting and error correcting codes. Bell Syst. Tech. J. 29(2), 147–160 (1950)
Article MathSciNet Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting spurious insertions and deletions of ones. Probl. Inf. Transm. 1(1), 8–17 (1965)
MATH Google Scholar
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Dokl. Akad. Nauk SSSR 163(4), 845–848 (1965)
MathSciNet MATH Google Scholar
Damerau, F.J.: A technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171–176 (1964)
Article Google Scholar
Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)
Article Google Scholar
Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Am. Stat. Assoc. 84(406), 414–420 (1989)
Article Google Scholar
Jaro, M.: Probabililistic linkeage of large public health data files. Stat. Med. 14, 491–498 (1995)
Article Google Scholar
Winkler, W.E.: String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage, pp. 354–359 (1990)
Google Scholar
Kondrak, G.: N-gram similarity and distance. In: Consens, M., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 115–126. Springer, Heidelberg (2005). https://doi.org/10.1007/11575832_13
Chapter Google Scholar
Monge, A., Elkan, C.: The field matching problem: algorithms and applications. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (1996)
Google Scholar
Monge, A., Elkan, C.: An efficient domain-independent algorithm for detecting approximately duplicate data records. In: The Proceedings of the SIGMOD 1997 Workshop on Data Mining and Knowledge Discovery (1997)
Google Scholar
Bafna, P., Pramod, D., Vaidya, A.: Document clustering: TF-IDF approach. In: International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), Chennai, pp. 61–66 (2016)
Google Scholar
Prasetya, D., Wibawa, A., Hirashima, T.: The performance of text similarity algorithms. Int. J. Adv. Intell. Inform. 4(1), 63–69 (2018)
Article Google Scholar
Jimenez, S., Becerra, C., Gelbukh, A., Gonzalez, F.: Generalized Mongue-Elkan method for approximate text string comparison. In: Gelbukh, A. (ed.) CICLing 2009. LNCS, vol. 5449, pp. 559–570. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-00382-0_45
Chapter Google Scholar
Spackman, K.: Signal detection theory: valuable tools for evaluating inductive learning. In: Proceedings of the Sixth International Workshop on Machine Learning, pp. 160–163 (1989)
Chapter Google Scholar
AEAT. https://www.agenciatributaria.es/AEAT.internet/Inicio/La_Agencia_Tributaria/Campanas/_Campanas_/Declaraciones_informativas/_SERVICIOS_DE_AYUDA/Identificacion_fiscal/Identificacion_fiscal.shtml
AEAT: Implementation of a Monge-Elkan based system of identification (2019). https://www.agenciatributaria.es/AEAT.desarrolladores/Desarrolladores/Desarrolladores.html

Download references

Acknowledgements

This paper was supported by the Spanish Ministry of Economy and Competitiveness project MTM2017-86875-C3-3-R.

Author information

Authors and Affiliations

Decision Analysis and Statistics Group, Departamento de Inteligencia Artificial, Universidad Politécnica de Madrid, Madrid, Spain
Ignacio González & Alfonso Mateos

Authors

Ignacio González
View author publications
You can also search for this author in PubMed Google Scholar
Alfonso Mateos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alfonso Mateos .

Editor information

Editors and Affiliations

Maynooth University, Maynooth, Ireland
Vicenç Torra
Tamagawa University, Machida, Tokyo, Japan
Yasuo Narukawa
University of Milano-Bicocca, Milan, Italy
Gabriella Pasi
University of Milano-Bicocca, Milan, Italy
Marco Viviani

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

González, I., Mateos, A. (2019). Multilayer Identification: Combining N-Grams, TF-IDF and Monge-Elkan in Massive Real Time Processing. In: Torra, V., Narukawa, Y., Pasi, G., Viviani, M. (eds) Modeling Decisions for Artificial Intelligence. MDAI 2019. Lecture Notes in Computer Science(), vol 11676. Springer, Cham. https://doi.org/10.1007/978-3-030-26773-5_19

Download citation

DOI: https://doi.org/10.1007/978-3-030-26773-5_19
Published: 24 July 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-26772-8
Online ISBN: 978-3-030-26773-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics