Abstract
The amount of large-scale real data around us is increasing in size very quickly, as is the necessity to reduce its size by obtaining a representative sample. Such sample allows us to use a great variety of analytical methods, the direct application of which on original data would be unfeasible. There are many methods used for different purposes and with different results. In this paper, we outline a simple, flexible and straightforward approach based on analyzing the nearest neighbors that is generally applicable. This feature is illustrated in experiments with synthetic and real-world datasets. The properties of the representative sample show that the presented approach maintains very well internal data structures (e.g. clusters and density). The key technical parameters of the approach are low complexity and high scalability.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Barbar’a, D., DuMouchel, W., Faloutsos, C., Haas, P.J., Hellerstein, J.M., Ioannidis, Y., Jagadish, H., Johnson, T., Ng, R., Poosala, V., et al.: The new jersey data reduction report. In: IEEE Data Engineering Bulletin. Citeseer (1997)
Ernvall, J., Nevalainen, O.: An algorithm for unbiased random sampling. The Computer Journal 25(1), 45–47 (1982)
Kerdprasop, K., Kerdprasop, N., Sattayatham, P.: Weighted k-means for density-biased clustering. In: Tjoa, A.M., Trujillo, J. (eds.) DaWaK 2005. LNCS, vol. 3589, pp. 488–497. Springer, Heidelberg (2005)
Kivinen, J., Mannila, H.: The power of sampling in knowledge discovery. In: Proceedings of the Thirteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 77–85. ACM (1994)
Kollios, G., Gunopulos, D., Koudas, N., Berchtold, S.: Efficient biased sampling for approximate clustering and outlier detection in large data sets. IEEE Transactions on Knowledge and Data Engineering 15(5), 1170–1187 (2003)
Manku, G.S., Rajagopalan, S., Lindsay, B.G.: Random sampling techniques for space efficient online computation of order statistics of large datasets. ACM SIGMOD Record 28, 251–262 (1999)
Nanopoulos, A., Manolopoulos, Y., Theodoridis, Y.: An efficient and effective algorithm for density biased sampling. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 398–404. ACM (2002)
Palmer, C.R., Faloutsos, C.: Density biased sampling: an improved method for data mining and clustering, vol. 29. ACM (2000)
Toivonen, H., et al.: Sampling large databases for association rules. In: VLDB, vol. 96, pp. 134–145 (1996)
Vitter, J.S.: Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS) 11(1), 37–57 (1985)
Vitter, J.S.: Faster methods for random sampling. Communications of the ACM 27(7), 703–718 (1984)
Zhang, T., Ramakrishnan, R., Livny, M.: Birch: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery 1(2), 141–182 (1997)
Zhou, S., Zhou, A., Cao, J., Wen, J., Fan, Y., Hu, Y.: Combining sampling technique with dbscan algorithm for clustering large spatial databases. In: Terano, T., Liu, H., Chen, A.L.P. (eds.) PAKDD 2000. LNCS, vol. 1805, pp. 169–172. Springer, Heidelberg (2000)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Zehnalova, S., Kudelka, M., Platos, J. (2014). Deterministic Data Sampling Based on Neighborhood Analysis. In: Pan, JS., Snasel, V., Corchado, E., Abraham, A., Wang, SL. (eds) Intelligent Data analysis and its Applications, Volume I. Advances in Intelligent Systems and Computing, vol 297. Springer, Cham. https://doi.org/10.1007/978-3-319-07776-5_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-07776-5_6
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-07775-8
Online ISBN: 978-3-319-07776-5
eBook Packages: EngineeringEngineering (R0)