Skip to main content

Multilayer Identification: Combining N-Grams, TF-IDF and Monge-Elkan in Massive Real Time Processing

  • Conference paper
  • First Online:
Modeling Decisions for Artificial Intelligence (MDAI 2019)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11676))

  • 644 Accesses

Abstract

In modern societies control is based on information. Nowadays, in many countries, companies are obligated to provide to tax administrations all their invoices and withholders and financial entities to provide information that is used to offer prefilled tax declaration. In the case of Spain, the Tax Agency (AEAT) receives 180 million invoices by month and must process in a few days at the end of January more than 500 millions of registers to prefill Income Tax forms. Hundreds of thousands of these data are not correctly identified by the provider and must be returned to the sender or stored as not identified and analyzed afterwards. Traditionally this process consumed many technical and human resources. AEAT has been able to provide for first time a solution for identification in real time with enormous throughput that fulfil its needs. It is based in a combination of six algorithms, based in three different ideas, n-gram, TI-ILF, and Monge-Elkan that has surpassed any previous expectative.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.ciat.org/la-factura-electronica-en-america-latina/

  2. 2.

    https://www.agenciatributaria.es/AEAT.internet/en_gb/informativas.shtml

  3. 3.

    http://sourceforge.net/projects/secondstring/; http://sourceforce.net/projects/simmetrics; http://geographiclib.sourceforge.net.

References

  1. OECD: Using Third Party Information Reports to Assist Taxpayers Meet their Return Filing Obligations—Country Experiences with the Use of Pre-populated Personal Tax Returns Prepared by Forum on Tax Administration Taxpayer Services Sub-group (2006)

    Google Scholar 

  2. IMF: TADAT. Field Guide (2016). http://www.tadat.org

  3. Akinwale, A., Niewiadomski, A.: Efficient similarity measures for texts matching. J. Appl. Comput. Sci. 23(1), 7–28 (2015)

    Google Scholar 

  4. Chapman, S.: SimMetrics: a java & c# .net library of similarity metrics (2006). http://sourceforge.net/projects/simmetrics/

  5. Gomaa, W., Fahmy, A.: A survey of text similarity approaches. Int. J. Comput. Appl. (0975 – 8887) 68(13), 13–18 (2013)

    Google Scholar 

  6. Gelbukh, A. (ed.): CICLing 2009. LNCS, vol. 5449, pp. 17–23. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-00382-0

    Book  Google Scholar 

  7. Naumann, F.: Similarity measures. Hasso Plattner Institut Universität Postdam (2013). https://hpi.de

  8. William, W.C., Ravikumar, P., Fienberg, S.E.: Comparison for string distance metrics for name-matching Tasks. American Association for Artificial Intelligence (2003). https://cs.cmu.edu

  9. Cormode, G., Mutukhrishnan, S.: The string edit distance matching problem with move. ACM Trans. Algorithms (TALG) 3(1), 2 (2006)

    Article  MathSciNet  Google Scholar 

  10. Hamming, R.W.: Error detecting and error correcting codes. Bell Syst. Tech. J. 29(2), 147–160 (1950)

    Article  MathSciNet  Google Scholar 

  11. Levenshtein, V.I.: Binary codes capable of correcting spurious insertions and deletions of ones. Probl. Inf. Transm. 1(1), 8–17 (1965)

    MATH  Google Scholar 

  12. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Dokl. Akad. Nauk SSSR 163(4), 845–848 (1965)

    MathSciNet  MATH  Google Scholar 

  13. Damerau, F.J.: A technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171–176 (1964)

    Article  Google Scholar 

  14. Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)

    Article  Google Scholar 

  15. Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Am. Stat. Assoc. 84(406), 414–420 (1989)

    Article  Google Scholar 

  16. Jaro, M.: Probabililistic linkeage of large public health data files. Stat. Med. 14, 491–498 (1995)

    Article  Google Scholar 

  17. Winkler, W.E.: String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage, pp. 354–359 (1990)

    Google Scholar 

  18. Kondrak, G.: N-gram similarity and distance. In: Consens, M., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 115–126. Springer, Heidelberg (2005). https://doi.org/10.1007/11575832_13

    Chapter  Google Scholar 

  19. Monge, A., Elkan, C.: The field matching problem: algorithms and applications. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (1996)

    Google Scholar 

  20. Monge, A., Elkan, C.: An efficient domain-independent algorithm for detecting approximately duplicate data records. In: The Proceedings of the SIGMOD 1997 Workshop on Data Mining and Knowledge Discovery (1997)

    Google Scholar 

  21. Bafna, P., Pramod, D., Vaidya, A.: Document clustering: TF-IDF approach. In: International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), Chennai, pp. 61–66 (2016)

    Google Scholar 

  22. Prasetya, D., Wibawa, A., Hirashima, T.: The performance of text similarity algorithms. Int. J. Adv. Intell. Inform. 4(1), 63–69 (2018)

    Article  Google Scholar 

  23. Jimenez, S., Becerra, C., Gelbukh, A., Gonzalez, F.: Generalized Mongue-Elkan method for approximate text string comparison. In: Gelbukh, A. (ed.) CICLing 2009. LNCS, vol. 5449, pp. 559–570. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-00382-0_45

    Chapter  Google Scholar 

  24. Spackman, K.: Signal detection theory: valuable tools for evaluating inductive learning. In: Proceedings of the Sixth International Workshop on Machine Learning, pp. 160–163 (1989)

    Chapter  Google Scholar 

  25. AEAT. https://www.agenciatributaria.es/AEAT.internet/Inicio/La_Agencia_Tributaria/Campanas/_Campanas_/Declaraciones_informativas/_SERVICIOS_DE_AYUDA/Identificacion_fiscal/Identificacion_fiscal.shtml

  26. AEAT: Implementation of a Monge-Elkan based system of identification (2019). https://www.agenciatributaria.es/AEAT.desarrolladores/Desarrolladores/Desarrolladores.html

Download references

Acknowledgements

This paper was supported by the Spanish Ministry of Economy and Competitiveness project MTM2017-86875-C3-3-R.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alfonso Mateos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

González, I., Mateos, A. (2019). Multilayer Identification: Combining N-Grams, TF-IDF and Monge-Elkan in Massive Real Time Processing. In: Torra, V., Narukawa, Y., Pasi, G., Viviani, M. (eds) Modeling Decisions for Artificial Intelligence. MDAI 2019. Lecture Notes in Computer Science(), vol 11676. Springer, Cham. https://doi.org/10.1007/978-3-030-26773-5_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-26773-5_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-26772-8

  • Online ISBN: 978-3-030-26773-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics