Abstract
In an ideal case, Big Data analysis will enable us to learn relevant and interesting facts using large interconnected data sets. Dataspace support platforms and dataspace management systems have been proposed to help analysts bring together data related to the analyst’s interests. In this paper, we provide an example of such a platform. In addition to storing data and description of its characteristics, the platform supports verifying compatibility (and eventually summarizability) of the underlying data. This will help the analysts discover mistakes and prevent meaningless aggregations. As an example of utilizing the platform, we present a case of large data sets (tens of millions of observations), describe how the data sets can be used, and study the platform’s performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
Niemi et al. [9] present an algorithm that discovers the correctness of additivity in OLAP queries expressed in an MDX-like query language . However, in addition to the statistical scale and “eventness,” the algorithm uses “measure depends on dimension” information: for instance, a currency measure depends on country dimension. This kind of information is harder to provide in a dataspace application.
- 4.
In its simple form, DAX-related expressions state that a value in table1/columna has a counterpart in table2/columnb like a country’s ISO3 code has a country name counterpart. The graphical tool lets the users connect tables by their columns. For instance, trade data’s exporter is given as an ISO3 code that we connect with country data’s ISO3 column.
- 5.
In more detail: the macro iterates over data sources d and each data source’s each field df. The names of the data sources and the fields are assumed to be the same as in the dataspace system. If a data source field difj is connected to another dkfl, the macro tries to find the following entry in the compatible. xml file: <compatible><file>di</file><field>fj</field><file>dk</file><field>fl</field></compatible>.
- 6.
References
McFedries, P.: The coming of data deluge. IEEE Spectrum. 48, 19 (2011)
Chen, P., Zhang, C.-Y.: Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf. Sci. 275, 314–347 (2014)
Halevy, A., Franklin, M., Maier, D.: Principles of dataspace systems. In: Proceedings of the Twenty-Fifth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 2006
Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23, 3–13 (2000)
Chaudhuri, S., Dayal, U., Vivek, N.: An overview of business intelligence technology. Comm. ACM. 54(8), 88–98 (2011)
Franklin, M., Halevy, A., Maier, D.: From databases to dataspaces: a new abstraction for information management. ACM Sigmod Rec. 34(4), 27–33 (2005)
Winston, W.: Microsoft Excel Data Analysis and Business Modeling, 5th edn, p. 864. Microsoft Press, Redmond (2016)
Lenz, H.-J., Shoshani, A.: Summarizability in OLAP and statistical data bases. In: Proceedings of the Ninth International Conference on Scientific and Statistical Database Management, 1997
Niemi, T., Niinimäki, M., Thanisch, P., Nummenmaa, J.: Detecting summarizability in OLAP. Data Knowl. Eng. 89, 1–20 (2014)
Harinath, S., Pihlgren, R., Guang-Yeu Lee, D., Sirmon, J., Bruckner, R.R.: Professional Microsoft SQL Server 2012 Analysis Services with MDX and DAX. Wiley, Hoboken (2012)
Dittrich, J.-P.: iMeMex: a platform for personal dataspace management. In: Proceedings of Workshops of International ACM SIGIR Conference on Research and Development in Information Retrieval, 2006
Mirza, H.T., Chen, L., Chen, G.: Practicability of dataspace systems. Int. J. Digital Content Technol. Appl. 4, 3 (2010)
Moilanen, K., Niemi, T., Näppilä, T., Kuru, M.: A visual XML dataspace approach for satisfying ad hoc information needs. J. Assoc. Inf. Sci. Technol. 66(11), 2304–2320 (2015)
Niinimaki, M., Niemi, T.: An ETL process for OLAP using RDF/OWL ontologies. J. Data Semantics. XIII, 97–119 (2009)
Stevens, S.: On the theory of scales of measurement. Science. 103(2684), 677–680 (1947)
Grinberg, M.: Flask Web Development: Developing Web Applications with Python. O’Reilly Media, Sebastopol (2014)
Winston, W.: Microsoft Excel Data Analysis and Business Modeling. Microsoft Press, Redmond (2016)
Hashem, I.A.T., Yaqoob, I., Anuar, N.B., Mokhtar, S., Gani, A., Khan, S.: The rise of “big data” on cloud computing: review and open research issues. Inf. Syst. 47, 98–115 (2015)
Cusumano, M.: Cloud computing and SaaS as new computing platforms. Commun. ACM. 53(4), 27–29 (2010)
Acknowledgments
The authors wish to thank COMTRADE for the access to their export data and Dr. Leslie Klieb for comments.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Niinimaki, M., Thanisch, P. (2019). Dataspace Management for Large Data Sets. In: Vasant, P., Litvinchev, I., Marmolejo-Saucedo, J. (eds) Innovative Computing Trends and Applications. EAI/Springer Innovations in Communication and Computing. Springer, Cham. https://doi.org/10.1007/978-3-030-03898-4_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-03898-4_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-03897-7
Online ISBN: 978-3-030-03898-4
eBook Packages: EngineeringEngineering (R0)