Skip to main content

Deterministic Data Sampling Based on Neighborhood Analysis

  • Conference paper
  • 1818 Accesses

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 297))

Abstract

The amount of large-scale real data around us is increasing in size very quickly, as is the necessity to reduce its size by obtaining a representative sample. Such sample allows us to use a great variety of analytical methods, the direct application of which on original data would be unfeasible. There are many methods used for different purposes and with different results. In this paper, we outline a simple, flexible and straightforward approach based on analyzing the nearest neighbors that is generally applicable. This feature is illustrated in experiments with synthetic and real-world datasets. The properties of the representative sample show that the presented approach maintains very well internal data structures (e.g. clusters and density). The key technical parameters of the approach are low complexity and high scalability.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Barbar’a, D., DuMouchel, W., Faloutsos, C., Haas, P.J., Hellerstein, J.M., Ioannidis, Y., Jagadish, H., Johnson, T., Ng, R., Poosala, V., et al.: The new jersey data reduction report. In: IEEE Data Engineering Bulletin. Citeseer (1997)

    Google Scholar 

  2. Ernvall, J., Nevalainen, O.: An algorithm for unbiased random sampling. The Computer Journal 25(1), 45–47 (1982)

    Article  Google Scholar 

  3. Kerdprasop, K., Kerdprasop, N., Sattayatham, P.: Weighted k-means for density-biased clustering. In: Tjoa, A.M., Trujillo, J. (eds.) DaWaK 2005. LNCS, vol. 3589, pp. 488–497. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  4. Kivinen, J., Mannila, H.: The power of sampling in knowledge discovery. In: Proceedings of the Thirteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 77–85. ACM (1994)

    Google Scholar 

  5. Kollios, G., Gunopulos, D., Koudas, N., Berchtold, S.: Efficient biased sampling for approximate clustering and outlier detection in large data sets. IEEE Transactions on Knowledge and Data Engineering 15(5), 1170–1187 (2003)

    Article  Google Scholar 

  6. Manku, G.S., Rajagopalan, S., Lindsay, B.G.: Random sampling techniques for space efficient online computation of order statistics of large datasets. ACM SIGMOD Record 28, 251–262 (1999)

    Article  Google Scholar 

  7. Nanopoulos, A., Manolopoulos, Y., Theodoridis, Y.: An efficient and effective algorithm for density biased sampling. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 398–404. ACM (2002)

    Google Scholar 

  8. Palmer, C.R., Faloutsos, C.: Density biased sampling: an improved method for data mining and clustering, vol. 29. ACM (2000)

    Google Scholar 

  9. Toivonen, H., et al.: Sampling large databases for association rules. In: VLDB, vol. 96, pp. 134–145 (1996)

    Google Scholar 

  10. Vitter, J.S.: Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS) 11(1), 37–57 (1985)

    Article  MATH  MathSciNet  Google Scholar 

  11. Vitter, J.S.: Faster methods for random sampling. Communications of the ACM 27(7), 703–718 (1984)

    Article  MATH  MathSciNet  Google Scholar 

  12. Zhang, T., Ramakrishnan, R., Livny, M.: Birch: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery 1(2), 141–182 (1997)

    Article  Google Scholar 

  13. Zhou, S., Zhou, A., Cao, J., Wen, J., Fan, Y., Hu, Y.: Combining sampling technique with dbscan algorithm for clustering large spatial databases. In: Terano, T., Liu, H., Chen, A.L.P. (eds.) PAKDD 2000. LNCS, vol. 1805, pp. 169–172. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sarka Zehnalova .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Zehnalova, S., Kudelka, M., Platos, J. (2014). Deterministic Data Sampling Based on Neighborhood Analysis. In: Pan, JS., Snasel, V., Corchado, E., Abraham, A., Wang, SL. (eds) Intelligent Data analysis and its Applications, Volume I. Advances in Intelligent Systems and Computing, vol 297. Springer, Cham. https://doi.org/10.1007/978-3-319-07776-5_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-07776-5_6

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-07775-8

  • Online ISBN: 978-3-319-07776-5

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics