ReX: Extrapolating Relational Data in a Representative Way

Buda, Teodora Sandra; Cerqueus, Thomas; Murphy, John; Kristiansen, Morten

doi:10.1007/978-3-319-20424-6_10

Teodora Sandra Buda¹⁴,
Thomas Cerqueus¹⁵,
John Murphy¹⁴ &
…
Morten Kristiansen¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9147))

Included in the following conference series:

British International Conference on Databases

1500 Accesses
2 Citations

Abstract

Generating synthetic data is useful in multiple application areas (e.g., database testing, software testing). Nevertheless, existing synthetic data generators generally lack the necessary mechanism to produce realistic data, unless a complex set of inputs are given from the user, such as the characteristics of the desired data. An automated and efficient technique is needed for generating realistic data. In this paper, we propose ReX, a novel extrapolation system targeting relational databases that aims to produce a representative extrapolated database given an original one and a natural scaling rate. Furthermore, we evaluate our system in comparison with an existing realistic scaling method, UpSizeR, by measuring the representativeness of the extrapolated database to the original one, the accuracy for approximate query answering, the database size, and their performance. Results show that our solution significantly outperforms the compared method for all considered dimensions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Representative eXtrapolation System, https://github.com/tbuda/ReX.
2.
http://comp.nus.edu.sg/~upsizer/#download.
3.
lisp.vse.cz/pkdd99/Challenge/berka.htm.
4.
tpc.org/tpch.
5.
sqledit.com/dg, spawner.sourceforge.net, dgmaster.sourceforge.net, generatedata.com.
6.
cse.ust.hk/graphgen.
7.
ibmquestdatagen.sourceforge.net.

References

Arasu, A., Kaushik, R. Li, J.: Data generation using declarative constraints. In: SIGMOD, pp. 685–696 (2011)
Google Scholar
Binnig, C., Kossmann, D., Lo, E., Özsu, M.T.: Qagen: Generating query-aware test databases. In: SIGMOD, pp. 341–352 (2007)
Google Scholar
Bruno, N., Chaudhuri, S.: Flexible database generators. In: VLDB, pp. 1097–1107 (2005)
Google Scholar
Buda, T.S., Cerqueus, T., Murphy, J., Kristiansen, M.: CoDS: a representative sampling method for relational databases. In: Decker, H., Lhotská, L., Link, S., Basl, J., Tjoa, A.M. (eds.) DEXA 2013, Part I. LNCS, vol. 8055, pp. 342–356. Springer, Heidelberg (2013)
Chapter Google Scholar
Buda, T.S., Cerqueus, T., Murphy, J., Kristiansen, M.: VFDS: Very fast database sampling system. In: IEEE IRI, pp. 153–160 (2013)
Google Scholar
Chays, D., Shahid, J., Frankl, P.G.: Query-based test generation for database applications. In: DBTest, pp. 1–6 (2008)
Google Scholar
Deng, Y., Frankl, P., Chays, D.: Testing database transactions with agenda. In: ICSE, pp. 78–87 (2005)
Google Scholar
Gemulla, R., Rösch, P., Lehner, W.: Linked bernoulli synopses: sampling along foreign keys. In: Ludäscher, B., Mamoulis, N. (eds.) SSDBM 2008. LNCS, vol. 5069, pp. 6–23. Springer, Heidelberg (2008)
Chapter Google Scholar
Gray, J., Sundaresan, P., Englert, S., Baclawski, K., Weinberger, P.J.: Quickly generating billion-record synthetic databases. SIGMOD Rec. 23(2), 243–252 (1994)
Article Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD 11(1), 10–18 (2009)
Article Google Scholar
Hoag, J.E., Thompson, C.W.: A parallel general-purpose synthetic data generator. SIGMOD Rec. 36(1), 19–24 (2007)
Article Google Scholar
Houkjær, K., Torp, K., Wind, R.: Simple and realistic data generation. In: VLDB, pp. 1243–1246 (2006)
Google Scholar
Lo, E., Cheng, N., Hon, W.-K.: Generating databases for query workloads. PVLDB 3(1–2), 848–859 (2010)
Google Scholar
Olston, C., Chopra, S., Srivastava, U.: Generating example data for dataflow programs. In: SIGMOD, pp. 245–256 (2009)
Google Scholar
Pei, Y., Zaane, O.: A synthetic data generator for clustering and outlier analysis. Technical report (2006)
Google Scholar
Ramesh, G., Zaki, M.J., Maniatty, W.A.: Distribution-based synthetic database generation techniques for itemset mining. In: IDEAS, pp. 307–316 (2005)
Google Scholar
Stephens, J.M. Poess, M.: MUDD: a multidimensional data generator. In: WOSP, pp. 104–109 (2004)
Google Scholar
Taneja, K., Zhang, Y., Xie, T.: MODA: Automated test generation for database applications via mock objects. In: ASE, pp. 289–292 (2010)
Google Scholar
Tay, Y., Dai, B.T., Wang, D.T., Sun, E.Y., Lin, Y., Lin, Y.: UpSizeR: synthetically scaling an empirical relational database. Inf. Syst. 38(8), 1168–1183 (2013)
Article Google Scholar

Download references

Acknowledgments

This work was supported, in part, by Science Foundation Ireland grant 10/CE/I1855 to Lero - the Irish Software Engineering Research Centre (www.lero.ie). The authors also acknowledge Dr. Nicola Stokes’ feedback.

Author information

Authors and Affiliations

Lero, Performance Engineering Lab, School of Computer Science and Informatics, University College Dublin, Dublin, Ireland
Teodora Sandra Buda & John Murphy
Université de Lyon, CNRS, INSA-Lyon, LIRIS, UMR5205, 69621, Lyon, France
Thomas Cerqueus
IBM Collaboration Solutions, IBM Software Group, Dublin, Ireland
Morten Kristiansen

Authors

Teodora Sandra Buda
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Cerqueus
View author publications
You can also search for this author in PubMed Google Scholar
John Murphy
View author publications
You can also search for this author in PubMed Google Scholar
Morten Kristiansen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Teodora Sandra Buda .

Editor information

Editors and Affiliations

University of Edinburgh, Edinburgh, United Kingdom
Sebastian Maneth

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Buda, T.S., Cerqueus, T., Murphy, J., Kristiansen, M. (2015). ReX: Extrapolating Relational Data in a Representative Way . In: Maneth, S. (eds) Data Science. BICOD 2015. Lecture Notes in Computer Science(), vol 9147. Springer, Cham. https://doi.org/10.1007/978-3-319-20424-6_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-20424-6_10
Published: 11 June 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-20423-9
Online ISBN: 978-3-319-20424-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics