Advertisement

Big Data Competence Center ScaDS Dresden/Leipzig: Overview and selected research activities

  • Erhard Rahm
  • Wolfgang E. Nagel
  • Eric PeukertEmail author
  • René Jäkel
  • Fabian Gärtner
  • Peter F. Stadler
  • Daniel Wiegreffe
  • Dirk Zeckzer
  • Wolfgang Lehner
Fachbeitrag
  • 18 Downloads

Abstract

Since its launch in October 2014, the Competence Center for Scalable Data Services and Solutions (ScaDS) Dresden/Leipzig carries out collaborative research on Big Data methods and their use in challenging data science applications of different domains, leading to both general, and application-specific solutions and services. In this article, we give an overview about the structure of the competence center, its primary goals and research directions. Furthermore, we outline selected research results on scalable data platforms, distributed graph analytics, data augmentation and integration and visual analytics. We also briefly report on planned activities for the second funding period (2018-2021) of the center.

Keywords

Big Data Data science Data management 

Notes

Acknowledgements

ScaDS Dresden/Leipzig is funded by the German Federal Ministry of Education and Research under grant BMBF 01IS14014B.

References

  1. 1.
    Asch M et al (2018) Big data and extreme-scale computing: Pathways to convergence-toward a shaping strategy for a future software and data ecosystem for scientific inquiry. Int J High Perform Comput Appl 32(4):435–479CrossRefGoogle Scholar
  2. 2.
    Benedyczak K, Schuller B, Petrova-El Sayed M, Rybicki J, Grunzke R (2016) Unicore 7 middleware services for distributed and federated computing. Proc High Perform Comput Simul (hpcs) Ieee Pp.  https://doi.org/10.1109/HPCSim.2016.7568392 Google Scholar
  3. 3.
    Berthold MR, Cebron N, Dill F, Gabriel TR, Kötter T, Meinl T, Ohl P, Thiel K, Wiswedel B (2009) KNIME-the Konstanz information miner: version 2.0 and beyond. Acm Sigkdd Explor Newsl 11(1):26–31CrossRefGoogle Scholar
  4. 4.
    Boden C, Rabl T, Markl V (2018) The Berlin Big Data Center (BBDC). it Inf Technol 60(5-6):321–326CrossRefGoogle Scholar
  5. 5.
    Brunst H, Knüpfer A (2011) Vampir. Encyclopedia of Parallel Computing. Springer, pp 2125–2129.  https://doi.org/10.1007/978-0-387-09766-4_60 CrossRefGoogle Scholar
  6. 6.
    Dienst S, Beseler J (2016) Automatic anomaly detection in offshore wind SCADA data. ProcWindEurope Summit, HamburgGoogle Scholar
  7. 7.
    Eberius J, Werner C, Thiele M, Braunschweig K, Dannecker L, Lehner W (2013) DeExcelerator: a framework for extracting relational data from partially structured documents. In: CIKM, pp 2477–2480, https://doi.org/10.1145/2505515.2508210CrossRefGoogle Scholar
  8. 8.
    Eberius J, Thiele M, Braunschweig K, Lehner W (2015a) DrillBeyond: processing multi-result open world SQL queries. Proc 27th Int Conf on Scientific and Statistical Database. Management.  https://doi.org/10.1145/2791347.2791370 Google Scholar
  9. 9.
    Eberius J, Thiele M, Braunschweig K, Lehner W (2015b) Top-k entity augmentation using consistent set covering. Proc 27th Int Conf on Scientific and Statistical Database. Management.  https://doi.org/10.1145/2791347.2791353 Google Scholar
  10. 10.
    Franke M, Sehili Z, Rahm E (2018) Parallel Privacy Preserving Record Linkage using LSH-based blocking. Proc 3rd Int. Conf.on Internet of Things, Big Data and Security (IoTBDS), pp 195–203.  https://doi.org/10.5220/0006682701950203 Google Scholar
  11. 11.
    Franke M, Gladbach M, Sehili Z, Rohde F, Rahm E (2019) ScaDS research on scalable privacy-preserving record linkage. Datenbank Spektrum 19(1)Google Scholar
  12. 12.
    Frenzel J, Feldhoff K, Jäkel R, Müller-Pfefferkorn R (2018) Tracing of multi-threaded Java applications inScore-P using bytecode instrumentation, Proc. ARCS Workshop, pp 1–8Google Scholar
  13. 13.
    Frenzel J, Sastri Y, Lehmann C, Lazariv T, Jäkel R, Nagel W (2018) A generalized service infrastructure for data analytics. In: Proc. IEEE 4th Int. Conf. on Big Data Computing Service and Applications (BigDataService), pp 25–32,  https://doi.org/10.1109/BigDataService.2018.00013 Google Scholar
  14. 14.
    Gärtner F, zu Siederdissen C, Müller L, Stadler PF (2018) Coordinate systems for supergenomes. Algorithms for Molecular Biology 13(1):15Google Scholar
  15. 15.
    Gawad C, Koh W, Quake SR (2016) Single-cell genome sequencing: current state of the science. Nat Rev Genet 17(3):175–188CrossRefGoogle Scholar
  16. 16.
    Grunzke R, Jug F, Schuller B, Jäkel R, Myers G, Nagel WE (2016) Seamless HPC integration of data-intensive KNIME workflows via UNICORE. In: European Conf. on Parallel Processing, Springer, pp 480–491.  https://doi.org/10.1007/978-3-319-58943-5_39 Google Scholar
  17. 17.
    Hahmann M, Hartmann C, Kegel L, Lehner W (2019) Large-scale time series analytics – novel approaches for generation and prediction. Datenbank Spektrum 19(1)Google Scholar
  18. 18.
    Herbig A, Jäger G, Battke F, Nieselt K (2012) GenomeRing: alignment visualization based on SuperGenome coordinates. Bioinformatics 28(i7):i15Google Scholar
  19. 19.
    Herold H, Hecht R, Meinel G (2016) Old maps for land use change monitoring – analysing historical maps for long-term land use change monitoring. Proc Int Workshop Exploring Old Maps. EOM 201(6):11–12Google Scholar
  20. 20.
    Heyer G, Tiepmar J (2019) A Big Data case study in Digital Humanities: Creating a performance benchmark for Canonical Text Services. Datenbank Spektrum 19(1)Google Scholar
  21. 21.
    Hoehne R, Staib J (2016) Multi-scale visualisation – key to an enhanced understanding of materials. Carbon Compos Mag 4:20–21 (ISSN 2366-8024)Google Scholar
  22. 22.
    Hoffmann J, Zeckzer D, Bogdan M (2016) Using FPGAs to accelerate Myers bit-vector algoriththm. In: XIV Mediterranian Conf. Med Biol Eng Comput, pp 529–535.  https://doi.org/10.1007/978-3-319-32703-7_104 Google Scholar
  23. 23.
    Jäkel R, Müller-Pfefferkorn R, Kluge M, Grunzke R, Nagel WE (2014) Architectural implications for Exascale based on Big Data workflow requirements. In: High Performance Computing Workshop, IOS Press, Advances in Parallel Computing, vol 26, pp 101–113Google Scholar
  24. 24.
    Jäkel R, Müller-Pfefferkorn R, Kluge M, Grunzke R, Nagel WE (2015) Architectural implications for Exascale-based on Big Data workflow requirements. Advances in Parallel Computing vol 26, pp 101–113Google Scholar
  25. 25.
    Jäkel R, Peukert E, Nagel WE, Rahm E (2018) ScaDS Dresden/Leipzig – a competence center for collaborative Big Data research. it Inf Technol 60(5-6):327–334CrossRefGoogle Scholar
  26. 26.
    Junghanns M, Petermann A, Gómez K, Rahm E (2015) GRADOOP: scalable graph data management and analytics with Hadoop. Arxiv Prepr Arxiv 150600548Google Scholar
  27. 27.
    Junghanns M, Petermann A, Teichmann N, Gómez K, Rahm E (2016) Analyzing extended property graphs with Apache Flink. In: Proc. ACM, SIGMOD Workshop on Network Data Analytics  https://doi.org/10.1145/2980523.2980527 CrossRefGoogle Scholar
  28. 28.
    Junghanns M, Kießling M, Averbuch A, Petermann A, Rahm E (2017a) Cypher-based graph pattern matching in GRADOOP. In: Proc. 5th Int. Workshop on Graph Data Management Experiences & Systems (GRADES),  https://doi.org/10.1145/3078447.3078450 Google Scholar
  29. 29.
    Junghanns M, Petermann A, Neumann M, Rahm E (2017b) Management and analysis of big graph data: current systems and open challenges. In: Handbook of Big Data Technologies. Springer, Cham, pp 457–505  https://doi.org/10.1007/978-3-319-49340-4_14 Google Scholar
  30. 30.
    Junghanns M, Petermann A, Rahm E (2017c) Distributed grouping of property graphs with GRADOOP. Proc Database systems for Business, Technology and Web (BTW)Google Scholar
  31. 31.
    Junghanns M, Kießling M, Teichmann N, Gómez K, Petermann A, Rahm E (2018) Declarative and distributed graph analytics with GRADOOP. Proc VLDB Endowment. PVLDB 11(12):2006–2009.  https://doi.org/10.14778/3229863.3236246 Google Scholar
  32. 32.
    Keim D, Andrienko G, Fekete JD, Görg C, Kohlhammer J, Melançon G (2008) Visual analytics: Definition, process, and challenges. In: Information visualization. Springer, Berlin, Heidelberg, pp 154–175.  https://doi.org/10.1007/978-3-540-70956-5_7 Google Scholar
  33. 33.
    Koci E, Thiele M, Romero O, Lehner W (2016) A machine learning approach for layout inference in spreadsheets. In: Proc. KDIR ’16.  https://doi.org/10.5220/0006052200770088 CrossRefGoogle Scholar
  34. 34.
    Koci E, Thiele M, Romero O, Lehner W (2017) Table identification and reconstruction in spreadsheets. In: Proc. 29th Int. Conf. on Advanced Information Systems Engineering (CAiSE),  https://doi.org/10.1007/978331959536833 Google Scholar
  35. 35.
    Kolb L, Rahm E (2013) Parallel entity resolution with DeDoop. Datenbank Spektrum 13(1):23–32CrossRefGoogle Scholar
  36. 36.
    Kolb L, Thor A, Rahm E (2012) DeDoop: efficient deduplication with Hadoop. PVLDB 5(12).  https://doi.org/10.14778/2367502.2367527 Google Scholar
  37. 37.
    Kricke M, Peukert E, Rahm E (2019) Graph data transformations in GRADOOP. Proc. BTW, confGoogle Scholar
  38. 38.
    Lüttgau J, Kuhn M, Duwe K, Alforov Y, Betke E, Kunkel J, Ludwig T (2018) A Survey of Storage Systems for High-Performance Computing. Supercomputing Frontiers and. Innovations:31–58.  https://doi.org/10.14529/jsfi180103 Google Scholar
  39. 39.
    McCune RR, Weninger T, Madey G (2015) Thinking like a vertex: a survey of vertex-centric frameworks for large-scale distributed graph processing. ACM Comput Surv 48(2):25CrossRefGoogle Scholar
  40. 40.
    Müller L, Gerighausen D, Farman M, Zeckzer D (2016) Sierra Platinum: A Fast and Robust Multiple-Replicate Peak Caller With Visual Quality-Control and -Steering. BMC Bioinformatics 17(1):1–13CrossRefGoogle Scholar
  41. 41.
    Nagel WE, Jäkel R, Müller-Pfefferkorn R (2015) Execution environments for Big Data: Challenges for user centric scenarios. In: BDEC white paper BDEC. Proc. Int. Workshop on Extreme Scale Scientific Computing (Big Data and Extreme Computing, BDEC), Barcelona, 2015Google Scholar
  42. 42.
    Nentwig M, Rahm E (2018) Incremental clustering on linked data. In: Proc. IEEE, Int. Conf. on Data Mining Workshops (ICDMW)Google Scholar
  43. 43.
    Nentwig M, Groß A, Rahm E (2016) Holistic entity clustering for linked data. In: Proc. Data Mining Workshops (ICDMW), IEEE, pp 194–201,  https://doi.org/10.1109/ICDMW.2016.0035 Google Scholar
  44. 44.
    Otto C, Stadler PF, Hoffmann S (2014) Lacking alignments? The next-generation sequencing mapper Segemehl revisited. Bioinformatics 30(13), pp 1837–1843.  https://doi.org/10.1093/bioinformatics/btu146 Google Scholar
  45. 45.
    Petermann A, Junghanns M, Kemper S, Gómez K, Teichmann N, Rahm E (2016) Graph mining for complex data analytics. In: Data Mining Workshops (ICDMW), IEEE, pp 1316–1319,  https://doi.org/10.1109/ICDMW.2016.0193 Google Scholar
  46. 46.
    Petermann A, Junghanns M, Rahm E (2017) DIMSpan: Transactional frequent subgraph mining with distributed in-memory dataflow systems. In: Proc. 4th IEEE/ACM Int. Conf. on Big Data Computing, Applications and Technologies (BDAT), pp 237–246,  https://doi.org/10.1145/3148055.3148064 Google Scholar
  47. 47.
    Rahm E (2016) The case for holistic data integration. Proc ADBIS, LNCS 9809:11–27.  https://doi.org/10.1007/978-3-319-44039-2_2 Google Scholar
  48. 48.
    Richmond D, Kainmüller D, Yang M, Myers E, Rother C (2016) Mapping auto-context decision forests to deep convnets for semantic segmentation. Proc British Machine Vision Conference. BMVC.  https://doi.org/10.5244/C.30.144 Google Scholar
  49. 49.
    Rostami A, Kricke M, Peukert E, Kühne S, Dienst S, Rahm E (2019) BIGGR: Bringing GRADOOP to applications. Datenbank Spektrum 19(1)Google Scholar
  50. 50.
    Saeedi A, Peukert E, Rahm E (2017) Comparative evaluation of distributed clustering schemes for multi-source entity resolution. In: Advances in Databases and Information Systems. Springer, Cham, pp 278–293  https://doi.org/10.1007/978-3-319-66917-5_19 Google Scholar
  51. 51.
    Saeedi A, Nentwig M, Peukert E, Rahm E (2018a) Scalable matching and clustering of entities with FAMER. Complex Syst Informatics Model Q (CSIMQ) 16:61–83.  https://doi.org/10.7250/csimq.2018-16.04 CrossRefGoogle Scholar
  52. 52.
    Saeedi A, Peukert E, Rahm E (2018b) Using Link Features for Entity Clustering in Knowledge Graphs. In: Proc. ESWC, LNCS 10843, Springer, pp 576–592,  https://doi.org/10.1007/978-3-319-93417-4_37 Google Scholar
  53. 53.
    Schemala D, Schlesinger D, Winkler P, Herold H, Meinel G (2016) Semantic segmentation of settlement patterns in gray-scale map images using RF and. CRF, within an HPC environment. Proc GEOBIAGoogle Scholar
  54. 54.
    Spangenberg N, Augenstein C, Franczyk B, Wagner M, Apitz M, Kenngott H (2017a) Method for intrasurgical phase detection by using real-time medical device data. Proc Int Conf Comput Med Syst.  https://doi.org/10.1109/CBMS.2017.65 Google Scholar
  55. 55.
    Spangenberg N, Roth M, Mutke S, Franczyk B (2017b) Big Data in der Logistik – ein ganzheitlicher Ansatz für die datengetriebene Logistikplanung, -überwachung und -steuerung. In: Industrie 4.0 Management 33(4):43–47Google Scholar
  56. 56.
    Spangenberg N, Wilke M, Franczyk B (2017c) A big data architecture for intra-surgical remaining time predictions. Proc Int Conf Curr Future Trends Inf Commun Technol Healthc (icth).  https://doi.org/10.1016/j.procs.2017.08.332 Google Scholar
  57. 57.
    Staib J, Grottel S, Gumhold S (2015) Visualization of particle-based data with transparency and ambient occlusion. Comput Graph Forum 34:151–160CrossRefGoogle Scholar
  58. 58.
    Staib J, Grottel S, Gumhold S (2016) Enhancing Scatterplots With Multi-dimensional Focal Blur. Comput Graph Forum 35:11–20.  https://doi.org/10.1111/cgf.12877 Google Scholar
  59. 59.
    Staib J, Grottel S, Gumhold S (2017) Temporal focus+context for clusters in particle data. In: Vision, Modeling and Visualization (VMV17)Google Scholar
  60. 60.
    Theodorou V, Abelló A, Thiele M, Lehner W (2015) Poiesis: a tool for quality-aware ETL process redesign. Proc 18th Int Conf on Extending Database Technology. EDBT.  https://doi.org/10.5441/002/edbt.2015.54 Google Scholar
  61. 61.
    Tiepmar J (2014) Release of the MySQL-based implementation of the CTS protocol. In: Proc. 3rd LREC Workshop on Challenges in the Management of Large Corpora, pp 35–43Google Scholar
  62. 62.
    Tiepmar J (2016) CTS text miner – text mining framework based on the canonical text service protocol. In: Proc. 4th LREC Workshop on Challenges in the Management of Large Corpora, pp 1–7Google Scholar
  63. 63.
    Vatsalan D, Sehili Z, Christen P, Rahm E (2017) Privacy-preserving record linkage for Big Data: Current approaches and research challenges. Handb Big Data Technol, pp 851–895.  https://doi.org/10.1007/978-3-319-49340-4_25 Google Scholar
  64. 64.
    Wiegreffe D, Müller L, Steuck J, Zeckzer D, Stadler PF (2018) The Sierra Platinum Service for generating peak-calls for replicated ChIP-seq experiments. BMC Res Notes.  https://doi.org/10.1186/s13104-018-3633-x Google Scholar
  65. 65.
    Zeckzer D, Gerighausen D, Steiner L, Prohaska SJ (2014) Analyzing Chromatin Using Tiled Binned Scatterplot Matrices. IEEE, Symp on Biological Data Visualization (BioVis)Google Scholar
  66. 66.
    Zeckzer D, Gerighausen D, Müller L (2016) Analyzing Histone Modifications in iPS Cells Using Tiled Binned 3D Scatter Plots. In: Proc. Big Data Visual Analytics (BDVA), pp 1–8,  https://doi.org/10.1109/BDVA.2016.7787042 Google Scholar
  67. 67.
    Zeckzer D, Wiegreffe D, Müller L (2018) Analyzing Histone Modifications Using Tiled Binned Clustering and 3D Scatter Plots. J Wscg 26:1–10CrossRefGoogle Scholar

Copyright information

© Gesellschaft für Informatik e.V. and Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  • Erhard Rahm
    • 1
  • Wolfgang E. Nagel
    • 2
  • Eric Peukert
    • 1
    Email author
  • René Jäkel
    • 2
  • Fabian Gärtner
    • 1
  • Peter F. Stadler
    • 1
  • Daniel Wiegreffe
    • 1
  • Dirk Zeckzer
    • 1
  • Wolfgang Lehner
    • 2
  1. 1.Leipzig UniversityLeipzigGermany
  2. 2.Technische Universität DresdenDresdenGermany

Personalised recommendations