Advertisement

From Metadata Catalogs to Distributed Data Processing for Smart City Platforms and Services: A Study on the Interplay of CKAN and Hadoop

  • Robert Scholz
  • Nikolay Tcholtchev
  • Philipp Lämmel
  • Ina Schieferdecker
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 864)

Abstract

Smart Cities are emerging based on the idea of provisioning and processing large amounts of urban data for various use cases. Thereby, Urban Data Platforms are usually employed to accumulate and expose the large amounts of governmental (i.e. public sector), sensor, static and real-time data in order to enable the community to create valuable applications and services for future Smart Cities. Hitherto, the Open Data initiative was seen as the key driver to providing large amounts of data within a city. Open Data platforms employ so-called data registries in order to keep track of the available datasets at various sources spread throughout the city, with CKAN currently being among the most popular data catalog software worldwide. With the emergence of frameworks for large scale distributed computing and storage, such as Hadoop and the belonging distributed file systems (HDFS), there is an inherent need for bridging the worlds of metadata catalogs and distributed data processing towards the goal of providing sophisticated urban ICT services. The current paper constitutes a first attempt on this new field, by prototyping and evaluating components that enable the collaboration and interplay between CKAN and Hadoop/HDFS. This interplay is realized through extensions to CKAN and its harvesting process and its benefits are demonstrated by belonging case studies.

Keywords

Smart Cities Open Data Distributed processing Hadoop CKAN 

References

  1. 1.
    Scholz, R., Tcholtchev, N.,Lämmel, P., Schieferdecker, I.: A CKAN plugin for data harvesting to the Hadoop distributed file system. In: 7th International Conference on Cloud Computing and Services Science (CLOSER) (2017). http://dx.doi.org/10.5220/0006230200470056
  2. 2.
    CKAN Association: CKAN Overview. http://ckan.org
  3. 3.
    Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies, MSST2010 (2010). http://dx.doi.org/10.1109/MSST.2010.5496972
  4. 4.
    Helene, M.: GovData - Das Datenportal für Deutschland. In: Hill, H., Martini, M., Wagner, E. (eds.) Transparenz, Partizipation, Kollaboration: Die digitale Verwaltung neu denken, pp. 109–116. Nomos Verlagsgesellschaft mbH & Co. KG, Baden-Baden (2014)CrossRefGoogle Scholar
  5. 5.
    Bundesministerium des Innern: Nationaler Aktionsplan der Bundesregierung zur Umsetzung der Open-Data-Charta der G8. https://www.bmi.bund.de/SharedDocs/Downloads/DE/Broschueren/2014/aktionsplan-open-data.pdf (2014)
  6. 6.
    Mercader, A., et al.: ckanext-harvest - remote harvesting extension (2012). https://github.com/ckan/ckanext-harvest
  7. 7.
    The Apache Software Foundation: Hadoop Project Webpage. http://hadoop.apache.org/
  8. 8.
    Hunt, P., Konar, M., Junqueira, F.P., Reed, B.: ZooKeeper: wait-free Coordination for Internet-scale systems. In: USENIX Annual Technical Conference, Boston, MA, USA, p. 9 (2010)Google Scholar
  9. 9.
    Dittrich, J., Quiané-Ruiz, J.-A.: Efficient big data processing in Hadoop MapReduce. Proc. VLDB Endow. 5, 2014–2015 (2012).  https://doi.org/10.14778/2367502.2367562CrossRefGoogle Scholar
  10. 10.
    The Apache Software Foundation: Apache Flink: Scalable Stream and Batch Data Processing. https://flink.apache.org/
  11. 11.
    The Apache Software Foundation: Apache Spark - Lightning-Fast Cluster Computing. https://spark.apache.org/
  12. 12.
    Iqbal, M., Soomro, T.: Big Data Analysis: Apache Storm Perspective (2015).  https://doi.org/10.14445/22312803/ijctt-v19p103
  13. 13.
    Avery, C.: Giraph: large-scale graph processing infrastructure on hadoop. Proc. Hadoop Summit. St. Cl. 11, 5–9 (2011)Google Scholar
  14. 14.
    Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a Map-Reduce framework. Proc. VLDB Endow. 2, 1626–1629 (2009).  https://doi.org/10.14778/1687553.1687609CrossRefGoogle Scholar
  15. 15.
    Bittorf, M., Bobrovytsky, T., Erickson, C.C.A.C.J., Hecht, M.G.D., Kuff, M.J.I.J.L., Leblang, D.K.A., Robinson, N.L.I.P.H., Rus, D.R.S., Wanderman, J.R.D.T.S., Yoder, M.M.: Impala: A modern, open-source SQL engine for Hadoop. In: Proceedings of the 7th Biennial Conference on Innovative Data Systems Research (2015)Google Scholar
  16. 16.
    Vora, M.N.: Hadoop-HBase for large-scale data (2011). http://dx.doi.org/10.1109/ICCSNT.2011.6182030
  17. 17.
    National Strategy Office of Information and Communications Technology in Cabinet Secretariat: data.go.jp. http://www.data.go.jp/?lang=english
  18. 18.
    Matheus, R., Vaz, J., Maia Ribeiro, M.: Open Government Data and the Data Usage for Improvement of Public Services in the Rio de Janeiro City (2014). http://dx.doi.org/10.1145/2691195.2691240
  19. 19.
    Socrata: Socrata - The Data Platform for 21st Century Digital Government. https://www.socrata.com/
  20. 20.
    Knoema: knoema.com Webpage. https://knoema.com/
  21. 21.
    Senatsverwaltung für Wirtschaft, E. und B.: Offene Daten Berlin. https://daten.berlin.de/
  22. 22.
    European Commission Directorate-General Communication: European Data Portal. https://www.europeandataportal.eu/en/
  23. 23.
    Lagoze, C., Van de Sompel, H., Nelson, M., Warner, S.: Open Archives Initiative Protocol for Metadata Harvesting (2015)Google Scholar
  24. 24.
    Open Archives Initiative: Object Reuse and Exchange Specifications and User Guides. https://www.openarchives.org/ore/1.0/toc
  25. 25.
    Marienfeld, F.: Open Government Data (OGD) - Die Metadaten-Struktur für Open Government Data in Deutschland. http://open-data.fokus.fraunhofer.de/die-metadaten-struktur-fur-open-government-data-in-deutschland/
  26. 26.
    Bartha, G., Kocsis, S.: Standardization of geographic data: the european inspire directive. Eur. J. Geogr. 2, 79–89 (2011)Google Scholar
  27. 27.
    Weibel, S., Kunze, J., Lagoze, C., Wolf, M.: Dublin core metadata for resource discovery (1998).  https://doi.org/10.17487/rfc2413
  28. 28.
    Coyle, K.: MARC21 as data: a start. Code4Lib J. 14, 1–10 (2011)Google Scholar
  29. 29.
    Liu, Xiaoming, Balakireva, Lyudmila, Hochstenbach, Patrick, Van de Sompel, Herbert: File-based storage of digital objects and constituent datastreams: XMLtapes and Internet Archive ARC files. In: Rauber, Andreas, Christodoulakis, Stavros, Tjoa, A.Min (eds.) ECDL 2005. LNCS, vol. 3652, pp. 254–265. Springer, Heidelberg (2005).  https://doi.org/10.1007/11551362_23CrossRefGoogle Scholar
  30. 30.
    Open science and research initiative: OAI-PMH harvester for CKAN. https://github.com/kata-csc/ckanext-oaipmh
  31. 31.
    Lakshman, A., Malik, P.: Cassandra: a decentralized structured storage system. SIGOPS Oper. Syst. Rev. 44, 35–40 (2010).  https://doi.org/10.1145/1773912.1773922CrossRefGoogle Scholar
  32. 32.
    McGninnis, S., et al.: OpenStack Block Storage Cinder. https://wiki.openstack.org/wiki/Cinder
  33. 33.
    Amazon.com, In.: Amazon Web Services S3 - Simple Cloud Storage ServiceGoogle Scholar
  34. 34.
    Watkins, N., Sevilla, M., Jimenez, I., Maltzahn, C.: Ceph: An Open-Source Software-Defined Storage StackGoogle Scholar
  35. 35.
    Dickinson, J., et al.: OpenStack Object Storage. https://wiki.openstack.org/wiki/Swift
  36. 36.
    Nóbrega, T.: OpenStack Sahara. https://wiki.openstack.org/wiki/Sahara
  37. 37.
    Red Hat Inc.: Using Hadoop with CephFS. http://docs.ceph.com/docs/master/cephfs/hadoop/
  38. 38.
    Tierney, B., Kissel, E., Swany, M., Pouyoul, E.: Efficient data transfer protocols for big data (2012). http://dx.doi.org/10.1109/eScience.2012.6404462
  39. 39.
    Kreps, J., Narkhede, N., Rao, J.: Kafka: a distributed messaging system for log processing. In: Proceedings of the NetDB, pp. 1–7 (2011)Google Scholar
  40. 40.
    Momjian, B.: PostgreSQL: Introduction and Concepts. Addison-Wesley, New York (2001)Google Scholar
  41. 41.
    The Apache Software Foundation: WebHDFS REST API. http://hadoop.apache.org/docs/%0Ar1.0.4/webhdfs.html
  42. 42.
    Alinat, P., Pierrel, J.M.: Esprit II project 5516 Roars: robust analytic speech recognition system (1993)Google Scholar
  43. 43.
    Liu, Z., Li, H., Miao, G.: MapReduce-based Backpropagation Neural Network over large scale mobile data (2010). http://dx.doi.org/10.1109/ICNC.2010.5584323
  44. 44.
  45. 45.
    Klessmann, J., Denker, P., Schieferdecker, I., Schulz, S.: Open government data Deutschland. Eine Studie zu Open Government in Deutschland im Auftrag des Bundesministerium des Innern. Deutschland <Bundesrepublik>/Bundesministerium (2012)Google Scholar
  46. 46.
    Wuebker, J., Ney, H., Zens, R.: Fast and scalable decoding with language model look-ahead for phrase-based statistical machine translation. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers, vol. 2, pp. 28–32. Association for Computational Linguistics, Stroudsburg (2012)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Robert Scholz
    • 1
  • Nikolay Tcholtchev
    • 1
  • Philipp Lämmel
    • 1
  • Ina Schieferdecker
    • 1
  1. 1.Fraunhofer Institute for Open Communication Systems (FOKUS)BerlinGermany

Personalised recommendations