Skip to main content

ReX: Extrapolating Relational Data in a Representative Way

  • Conference paper
  • First Online:
Data Science (BICOD 2015)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9147))

Included in the following conference series:

Abstract

Generating synthetic data is useful in multiple application areas (e.g., database testing, software testing). Nevertheless, existing synthetic data generators generally lack the necessary mechanism to produce realistic data, unless a complex set of inputs are given from the user, such as the characteristics of the desired data. An automated and efficient technique is needed for generating realistic data. In this paper, we propose ReX, a novel extrapolation system targeting relational databases that aims to produce a representative extrapolated database given an original one and a natural scaling rate. Furthermore, we evaluate our system in comparison with an existing realistic scaling method, UpSizeR, by measuring the representativeness of the extrapolated database to the original one, the accuracy for approximate query answering, the database size, and their performance. Results show that our solution significantly outperforms the compared method for all considered dimensions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Representative eXtrapolation System, https://github.com/tbuda/ReX.

  2. 2.

    http://comp.nus.edu.sg/~upsizer/#download.

  3. 3.

    lisp.vse.cz/pkdd99/Challenge/berka.htm.

  4. 4.

    tpc.org/tpch.

  5. 5.

    sqledit.com/dg, spawner.sourceforge.net, dgmaster.sourceforge.net, generatedata.com.

  6. 6.

    cse.ust.hk/graphgen.

  7. 7.

    ibmquestdatagen.sourceforge.net.

References

  1. Arasu, A., Kaushik, R. Li, J.: Data generation using declarative constraints. In: SIGMOD, pp. 685–696 (2011)

    Google Scholar 

  2. Binnig, C., Kossmann, D., Lo, E., Özsu, M.T.: Qagen: Generating query-aware test databases. In: SIGMOD, pp. 341–352 (2007)

    Google Scholar 

  3. Bruno, N., Chaudhuri, S.: Flexible database generators. In: VLDB, pp. 1097–1107 (2005)

    Google Scholar 

  4. Buda, T.S., Cerqueus, T., Murphy, J., Kristiansen, M.: CoDS: a representative sampling method for relational databases. In: Decker, H., Lhotská, L., Link, S., Basl, J., Tjoa, A.M. (eds.) DEXA 2013, Part I. LNCS, vol. 8055, pp. 342–356. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  5. Buda, T.S., Cerqueus, T., Murphy, J., Kristiansen, M.: VFDS: Very fast database sampling system. In: IEEE IRI, pp. 153–160 (2013)

    Google Scholar 

  6. Chays, D., Shahid, J., Frankl, P.G.: Query-based test generation for database applications. In: DBTest, pp. 1–6 (2008)

    Google Scholar 

  7. Deng, Y., Frankl, P., Chays, D.: Testing database transactions with agenda. In: ICSE, pp. 78–87 (2005)

    Google Scholar 

  8. Gemulla, R., Rösch, P., Lehner, W.: Linked bernoulli synopses: sampling along foreign keys. In: Ludäscher, B., Mamoulis, N. (eds.) SSDBM 2008. LNCS, vol. 5069, pp. 6–23. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  9. Gray, J., Sundaresan, P., Englert, S., Baclawski, K., Weinberger, P.J.: Quickly generating billion-record synthetic databases. SIGMOD Rec. 23(2), 243–252 (1994)

    Article  Google Scholar 

  10. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. SIGKDD 11(1), 10–18 (2009)

    Article  Google Scholar 

  11. Hoag, J.E., Thompson, C.W.: A parallel general-purpose synthetic data generator. SIGMOD Rec. 36(1), 19–24 (2007)

    Article  Google Scholar 

  12. Houkjær, K., Torp, K., Wind, R.: Simple and realistic data generation. In: VLDB, pp. 1243–1246 (2006)

    Google Scholar 

  13. Lo, E., Cheng, N., Hon, W.-K.: Generating databases for query workloads. PVLDB 3(1–2), 848–859 (2010)

    Google Scholar 

  14. Olston, C., Chopra, S., Srivastava, U.: Generating example data for dataflow programs. In: SIGMOD, pp. 245–256 (2009)

    Google Scholar 

  15. Pei, Y., Zaane, O.: A synthetic data generator for clustering and outlier analysis. Technical report (2006)

    Google Scholar 

  16. Ramesh, G., Zaki, M.J., Maniatty, W.A.: Distribution-based synthetic database generation techniques for itemset mining. In: IDEAS, pp. 307–316 (2005)

    Google Scholar 

  17. Stephens, J.M. Poess, M.: MUDD: a multidimensional data generator. In: WOSP, pp. 104–109 (2004)

    Google Scholar 

  18. Taneja, K., Zhang, Y., Xie, T.: MODA: Automated test generation for database applications via mock objects. In: ASE, pp. 289–292 (2010)

    Google Scholar 

  19. Tay, Y., Dai, B.T., Wang, D.T., Sun, E.Y., Lin, Y., Lin, Y.: UpSizeR: synthetically scaling an empirical relational database. Inf. Syst. 38(8), 1168–1183 (2013)

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported, in part, by Science Foundation Ireland grant 10/CE/I1855 to Lero - the Irish Software Engineering Research Centre (www.lero.ie). The authors also acknowledge Dr. Nicola Stokes’ feedback.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Teodora Sandra Buda .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Buda, T.S., Cerqueus, T., Murphy, J., Kristiansen, M. (2015). ReX: Extrapolating Relational Data in a Representative Way . In: Maneth, S. (eds) Data Science. BICOD 2015. Lecture Notes in Computer Science(), vol 9147. Springer, Cham. https://doi.org/10.1007/978-3-319-20424-6_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-20424-6_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-20423-9

  • Online ISBN: 978-3-319-20424-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics