Erkennung von Duplikaten in Big Data am Fallbeispiel der digitalen Musiknutzung

Lindner, Tobias; Mandl, Peter; Bauer, Nikolai; Grimm, Markus

doi:10.1365/s40702-017-0387-1

Erkennung von Duplikaten in Big Data am Fallbeispiel der digitalen Musiknutzung

Detection of Duplicates in Big Data in the Use Case of Digital Music Usage

Schwerpunkt
Published: 11 January 2018

Volume 55, pages 581–600, (2018)
Cite this article

HMD Praxis der Wirtschaftsinformatik Aims and scope Submit manuscript

Tobias Lindner¹,
Peter Mandl ORCID: orcid.org/0000-0003-4508-7667¹,
Nikolai Bauer¹ &
…
Markus Grimm²

674 Accesses
Explore all metrics

Zusammenfassung

Die Beschreibung von Musikwerken ist heute nicht international genormt und daher kommt es vor allem in der Online-Musiknutzung häufig vor, dass Musikwerke in Online-Plattformen wie Spotify und Apple Music unterschiedlich gespeichert sind. Die Abrechnung von Musiknutzungen ist bei den zuständigen Verwertungsgesellschaften zwar schon seit längerem digitalisiert, aber die Feststellung der Eindeutigkeit von Musikwerken ist nicht ohne weiteres möglich. Dazu bedarf es effizienter Algorithmen zur Objektidentifikation. In dieser Arbeit wird ein Vergleich verschiedener Algorithmen wie Damerau-Levenshtein, Jaro-Winkler, Smith-Waterman u. a. zur Objektidentifikation bei Musikwerken durchgeführt. Da es sich um sehr rechenintensive Algorithmen handelt, haben wir die Algorithmen für eine Massenverarbeitung in einem Apache Hadoop-Cluster unter Nutzung von MapReduce adaptiert. Über einen umfangreichen Vergleichsdatensatz, der mit Apache HBase verteilt gespeichert wurde, haben wir die wichtigsten Algorithmen auf die Qualität der Duplikatserkennung und auf ihre Leistung hin untersucht. Es hat sich gezeigt, dass die sehr häufig verwendete Levenshtein-Distanz nicht am besten abschneidet. Durch den Einsatz anderer Algorithmen, beispielsweise der Jaro-Winkler-Distanz sind bessere Ergebnisse erzielbar und zwar sowohl bei der Zuordnungsqualität als auch bei der Verarbeitungsgeschwindigkeit.

Abstract

Today there is no international standard that specifies the description of a musical work. Therefore online platforms like Spotify or Apple Music store these works using different attributes. So even with a digital billing process that collecting societies use today, it is often difficult to identify a work correctly. Therefore efficient algorithms for object identification are necessary. In this article we compare different algorithms like Damerau-Levenshtein, Jaro-Winkler, Smith-Waterman and others in this context. Since these algorithms are computationally quite expensive, we have adapted them for mass data processing in an Apache Hadoop cluster using MapReduce. Using an extensive set of comparative data, stored with Apache HBase, we examined the most important algorithms for the quality of duplicate recognition and their performance. The results indicate that the frequently used Levenshtein distance does not perform best. By using other algorithms, such as the Jaro-Winkler distance, better results can be achieved in both matching quality and processing speed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

Die GEMA (Gesellschaft für musikalische Aufführungs- und mechanische Vervielfältigungsrechte) vertritt in Deutschland die Urheberrechte von mehr als 70.000 Komponisten, Textdichtern und Musikverlegern. Siehe https://www.gema.de/. Zugegriffen am: 5. Dezember 2017.
Eine detaillierte Beschreibung des MapReduce-Ansatzes kann in der offiziellen Veröffentlichung von (Dean et al. 2004) nachgelesen werden.
Die Ermittlung eines optimalen Schwellwertes für jede Ähnlichkeitsmetrik wäre eine weitere Forschungsaufgabe und wird in dieser Arbeit nicht weiter betrachtet.
QPI steht für QuickPath Interconnect, ein System zur Kommunikation zwischen Prozessoren und zwischen Prozessoren und Chipsatz in Intel Prozessoren.
Siehe http://www.vmware.com, Zugegriffen: 21.07.2017.

Literatur

Amazon (2016) Amazon Elastic MapReduce (EMR). https://aws.amazon.com/de/elasticmapreduce/. Zugegriffen: 14. März 2016
Google Scholar
Apel D, Behme W, Eberlei R, Merighi C (2010) Datenqualität erfolgreich steuern – Praxislösungen für Business-Intelligence-Projekte. 2., vollständig überarbeitete und erweiterte Auflage. Carl Hanser, München
Book Google Scholar
Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley, Harlow
Google Scholar
Bergroth L, Hakonen H, Raita T (2000) A survey of longest common subsequence algorithms. In: SPIRE (String Processing and Information Retrieval), S 39–48
Google Scholar
Charras C, Lecroq T (2004) Handbook of exact string matching algorithms. King’s College Publications, London
MATH Google Scholar
Damerau FJ (1964) A technique for computer detection and correction of spelling errors. Commun ACM 7(3):171–176
Article Google Scholar
Dean J, Ghemawat, Sanjay (2004) MapReduce: simplified data processing on large clusters. Google labs. OSDI’04: Sixth Symposium on Operating System Design and Implementation, San Francisco
Google Scholar
Dice LR (1945) Measures of the amount of ecologic association between species. Ecology 26(3):297–302
Article Google Scholar
Discogs (2015) Datenbank und Marktplatz für Musik auf Schallplatte, CD, Kassette und anderen Formaten. http://www.discogs.com. Zugegriffen: 2. Nov. 2015
Google Scholar
Gotoh O (1982) An improved algorithm for matching biological sequences. J Mol Biol 162:705–708
Article Google Scholar
Hamming RW (1950) Error-detecting and error-correcting codes. Bell Syst Tech J 29(2):147–160
Article MathSciNet Google Scholar
Jaccard P (1901) Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bull Soc Vaudoise Des Sci Nat 37:547–579
Google Scholar
Jaro MA (1989) Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J Am Stat Assoc 84(406):414–420
Article Google Scholar
Jaro MA (1995) Probabilistic linkage of large public health data files. Stat Med 14(5–7):491–498
Article Google Scholar
Levenshtein VI (1965) Binary codes capable of correcting deletions, insertions, and reversals. Dokl Akad Nauk SSSR 163(4):845–848 (Russisch, Englische Übersetzung in: Soviet Physics Doklady, 10(8) pp. 707–710, 1966)
MathSciNet MATH Google Scholar
Mahout (2016) Apache mahout. https://mahout.apache.org. Zugegriffen: 15. Febr. 2016
Google Scholar
Mandl P, Bauer N, Döschl A, Grimm M, Wickertsheim L (2015) Die Verwertung von Online-Musiknutzungen – Herausforderungen für die IT. HMD Prax Wirtschaftsinform 53(1):126–138. https://doi.org/10.1365/s40702-015-0191-8
Article Google Scholar
Monge AE, Elkan CP (1996) The field matching problem: algorithms and applications. Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining, S 267–270
Google Scholar
MusicBrainz (2015) MusicBrainz. https://www.musicbrainz.org. Zugegriffen: 2. Nov. 2015
Google Scholar
Naumann F, Herschel M (2010) An introduction to duplicate detection. Morgan and Claypool, San Rafael
MATH Google Scholar
Navarro G (1999) A guided tour to approximate string matching. ACM Comput Surv. https://doi.org/10.1145/375360.375365
Google Scholar
Schnell R (2010) Record linkage from a technical point of view. In: German Data Forum (RatSWD) (Hrsg) Building on progress: expanding the research infrastructure for the social, economic, and behavioral sciences, Bd. 1. Budrich UniPress, Opladen, S 531–545
Google Scholar
Schöning U (2001) Algorithmik, 13. Aufl. Spektrum Akademischer Verlag, Heidelberg
MATH Google Scholar
SimMetrics (2016) Die verwendete SimMetrics-Bibliothek. https://github.com/Simmetrics/simmetrics. Zugegriffen: 8. Febr. 2016
Google Scholar
Singhal A (2001) Modern information retrieval: a brief overview. Bull IEEE Comput Soc Tech Comm Data Eng 24(4):35–43
Google Scholar
Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147:195–197
Article Google Scholar
Sørensen T (1948) A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons. K Dan Videnskab Selsk 5(4):1–34
Google Scholar
Strengholt B, Brobbel M (2013) Acceleration of the Smith-Waterman algorithm for DNA sequence alignment using an FPGA platform. Delft University of Technology, Delft
Google Scholar
Winkler WE (1990) String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. Proceedings of the Section on Survey Research Methods (American Statistical Association), S 354–359
Google Scholar
Winkler WE, Thibaudeau Y (1991) n application of the Fellegi-Sunter model of record linkage to the 1990 U.S. Census. Technical report, US bureau of the census
Google Scholar

Download references

Danksagung

Die Forschungsarbeit wurde im Rahmen des durch die GEMA (Gesellschaft für musikalische Aufführungs- und mechanische Vervielfältigungsrechte) und durch das CCWI (Competence Center Wirtschaftsinformatik) der Hochschule München initiierten Forschungsprojekts MPI (=Massively parallel Processing of Internet events) durchgeführt. Das Projekt beschäftigt sich mit der massiv parallelen Verarbeitung von Musiknutzungsdaten.

Author information

Authors and Affiliations

Fakultät für Informatik und Mathematik, Competence Center Wirtschaftsinformatik, Hochschule für angewandte Wissenschaften München, Lothstraße 34, 80334, München, Deutschland
Tobias Lindner, Peter Mandl & Nikolai Bauer
IT4IPM – IT for Intellectual Property Management GmbH, Rosenheimer Straße 11, 81667, München, Deutschland
Markus Grimm

Authors

Tobias Lindner
View author publications
You can also search for this author in PubMed Google Scholar
Peter Mandl
View author publications
You can also search for this author in PubMed Google Scholar
Nikolai Bauer
View author publications
You can also search for this author in PubMed Google Scholar
Markus Grimm
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tobias Lindner.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lindner, T., Mandl, P., Bauer, N. et al. Erkennung von Duplikaten in Big Data am Fallbeispiel der digitalen Musiknutzung. HMD 55, 581–600 (2018). https://doi.org/10.1365/s40702-017-0387-1

Download citation

Received: 01 August 2017
Accepted: 09 December 2017
Published: 11 January 2018
Issue Date: June 2018
DOI: https://doi.org/10.1365/s40702-017-0387-1

Schlüsselwörter

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Erkennung von Duplikaten in Big Data am Fallbeispiel der digitalen Musiknutzung

Zusammenfassung

Abstract

Access this article

Notes

Literatur

Danksagung

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Schlüsselwörter

Keywords

Search

Navigation