Skip to main content

Dataspace Management for Large Data Sets

  • Chapter
  • First Online:
Innovative Computing Trends and Applications

Abstract

In an ideal case, Big Data analysis will enable us to learn relevant and interesting facts using large interconnected data sets. Dataspace support platforms and dataspace management systems have been proposed to help analysts bring together data related to the analyst’s interests. In this paper, we provide an example of such a platform. In addition to storing data and description of its characteristics, the platform supports verifying compatibility (and eventually summarizability) of the underlying data. This will help the analysts discover mistakes and prevent meaningless aggregations. As an example of utilizing the platform, we present a case of large data sets (tens of millions of observations), describe how the data sets can be used, and study the platform’s performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://comtrade.un.org.

  2. 2.

    https://www.cia.gov/library/publications/the-world-factbook/.

  3. 3.

    Niemi et al. [9] present an algorithm that discovers the correctness of additivity in OLAP queries expressed in an MDX-like query language . However, in addition to the statistical scale and “eventness,” the algorithm uses “measure depends on dimension” information: for instance, a currency measure depends on country dimension. This kind of information is harder to provide in a dataspace application.

  4. 4.

    In its simple form, DAX-related expressions state that a value in table1/columna has a counterpart in table2/columnb like a country’s ISO3 code has a country name counterpart. The graphical tool lets the users connect tables by their columns. For instance, trade data’s exporter is given as an ISO3 code that we connect with country data’s ISO3 column.

  5. 5.

    In more detail: the macro iterates over data sources d and each data source’s each field df. The names of the data sources and the fields are assumed to be the same as in the dataspace system. If a data source field difj is connected to another dkfl, the macro tries to find the following entry in the compatible. xml file: <compatible><file>di</file><field>fj</field><file>dk</file><field>fl</field></compatible>.

  6. 6.

    https://comtrade.un.org/db/mr/rfCommoditiesList.aspx.

References

  1. McFedries, P.: The coming of data deluge. IEEE Spectrum. 48, 19 (2011)

    Article  Google Scholar 

  2. Chen, P., Zhang, C.-Y.: Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf. Sci. 275, 314–347 (2014)

    Article  Google Scholar 

  3. Halevy, A., Franklin, M., Maier, D.: Principles of dataspace systems. In: Proceedings of the Twenty-Fifth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 2006

    Google Scholar 

  4. Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23, 3–13 (2000)

    Google Scholar 

  5. Chaudhuri, S., Dayal, U., Vivek, N.: An overview of business intelligence technology. Comm. ACM. 54(8), 88–98 (2011)

    Article  Google Scholar 

  6. Franklin, M., Halevy, A., Maier, D.: From databases to dataspaces: a new abstraction for information management. ACM Sigmod Rec. 34(4), 27–33 (2005)

    Article  Google Scholar 

  7. Winston, W.: Microsoft Excel Data Analysis and Business Modeling, 5th edn, p. 864. Microsoft Press, Redmond (2016)

    Google Scholar 

  8. Lenz, H.-J., Shoshani, A.: Summarizability in OLAP and statistical data bases. In: Proceedings of the Ninth International Conference on Scientific and Statistical Database Management, 1997

    Google Scholar 

  9. Niemi, T., Niinimäki, M., Thanisch, P., Nummenmaa, J.: Detecting summarizability in OLAP. Data Knowl. Eng. 89, 1–20 (2014)

    Article  Google Scholar 

  10. Harinath, S., Pihlgren, R., Guang-Yeu Lee, D., Sirmon, J., Bruckner, R.R.: Professional Microsoft SQL Server 2012 Analysis Services with MDX and DAX. Wiley, Hoboken (2012)

    Google Scholar 

  11. Dittrich, J.-P.: iMeMex: a platform for personal dataspace management. In: Proceedings of Workshops of International ACM SIGIR Conference on Research and Development in Information Retrieval, 2006

    Google Scholar 

  12. Mirza, H.T., Chen, L., Chen, G.: Practicability of dataspace systems. Int. J. Digital Content Technol. Appl. 4, 3 (2010)

    Google Scholar 

  13. Moilanen, K., Niemi, T., Näppilä, T., Kuru, M.: A visual XML dataspace approach for satisfying ad hoc information needs. J. Assoc. Inf. Sci. Technol. 66(11), 2304–2320 (2015)

    Article  Google Scholar 

  14. Niinimaki, M., Niemi, T.: An ETL process for OLAP using RDF/OWL ontologies. J. Data Semantics. XIII, 97–119 (2009)

    Article  Google Scholar 

  15. Stevens, S.: On the theory of scales of measurement. Science. 103(2684), 677–680 (1947)

    Article  Google Scholar 

  16. Grinberg, M.: Flask Web Development: Developing Web Applications with Python. O’Reilly Media, Sebastopol (2014)

    Google Scholar 

  17. Winston, W.: Microsoft Excel Data Analysis and Business Modeling. Microsoft Press, Redmond (2016)

    Google Scholar 

  18. Hashem, I.A.T., Yaqoob, I., Anuar, N.B., Mokhtar, S., Gani, A., Khan, S.: The rise of “big data” on cloud computing: review and open research issues. Inf. Syst. 47, 98–115 (2015)

    Article  Google Scholar 

  19. Cusumano, M.: Cloud computing and SaaS as new computing platforms. Commun. ACM. 53(4), 27–29 (2010)

    Article  Google Scholar 

Download references

Acknowledgments

The authors wish to thank COMTRADE for the access to their export data and Dr. Leslie Klieb for comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marko Niinimaki .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Niinimaki, M., Thanisch, P. (2019). Dataspace Management for Large Data Sets. In: Vasant, P., Litvinchev, I., Marmolejo-Saucedo, J. (eds) Innovative Computing Trends and Applications. EAI/Springer Innovations in Communication and Computing. Springer, Cham. https://doi.org/10.1007/978-3-030-03898-4_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-03898-4_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-03897-7

  • Online ISBN: 978-3-030-03898-4

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics