Deterministic Data Sampling Based on Neighborhood Analysis

Zehnalova, Sarka; Kudelka, Milos; Platos, Jan

doi:10.1007/978-3-319-07776-5_6

Deterministic Data Sampling Based on Neighborhood Analysis

Sarka Zehnalova⁷,
Milos Kudelka⁷ &
Jan Platos⁷

Conference paper

1818 Accesses

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 297))

Abstract

The amount of large-scale real data around us is increasing in size very quickly, as is the necessity to reduce its size by obtaining a representative sample. Such sample allows us to use a great variety of analytical methods, the direct application of which on original data would be unfeasible. There are many methods used for different purposes and with different results. In this paper, we outline a simple, flexible and straightforward approach based on analyzing the nearest neighbors that is generally applicable. This feature is illustrated in experiments with synthetic and real-world datasets. The properties of the representative sample show that the presented approach maintains very well internal data structures (e.g. clusters and density). The key technical parameters of the approach are low complexity and high scalability.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Barbar’a, D., DuMouchel, W., Faloutsos, C., Haas, P.J., Hellerstein, J.M., Ioannidis, Y., Jagadish, H., Johnson, T., Ng, R., Poosala, V., et al.: The new jersey data reduction report. In: IEEE Data Engineering Bulletin. Citeseer (1997)
Google Scholar
Ernvall, J., Nevalainen, O.: An algorithm for unbiased random sampling. The Computer Journal 25(1), 45–47 (1982)
Article Google Scholar
Kerdprasop, K., Kerdprasop, N., Sattayatham, P.: Weighted k-means for density-biased clustering. In: Tjoa, A.M., Trujillo, J. (eds.) DaWaK 2005. LNCS, vol. 3589, pp. 488–497. Springer, Heidelberg (2005)
Chapter Google Scholar
Kivinen, J., Mannila, H.: The power of sampling in knowledge discovery. In: Proceedings of the Thirteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 77–85. ACM (1994)
Google Scholar
Kollios, G., Gunopulos, D., Koudas, N., Berchtold, S.: Efficient biased sampling for approximate clustering and outlier detection in large data sets. IEEE Transactions on Knowledge and Data Engineering 15(5), 1170–1187 (2003)
Article Google Scholar
Manku, G.S., Rajagopalan, S., Lindsay, B.G.: Random sampling techniques for space efficient online computation of order statistics of large datasets. ACM SIGMOD Record 28, 251–262 (1999)
Article Google Scholar
Nanopoulos, A., Manolopoulos, Y., Theodoridis, Y.: An efficient and effective algorithm for density biased sampling. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 398–404. ACM (2002)
Google Scholar
Palmer, C.R., Faloutsos, C.: Density biased sampling: an improved method for data mining and clustering, vol. 29. ACM (2000)
Google Scholar
Toivonen, H., et al.: Sampling large databases for association rules. In: VLDB, vol. 96, pp. 134–145 (1996)
Google Scholar
Vitter, J.S.: Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS) 11(1), 37–57 (1985)
Article MATH MathSciNet Google Scholar
Vitter, J.S.: Faster methods for random sampling. Communications of the ACM 27(7), 703–718 (1984)
Article MATH MathSciNet Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M.: Birch: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery 1(2), 141–182 (1997)
Article Google Scholar
Zhou, S., Zhou, A., Cao, J., Wen, J., Fan, Y., Hu, Y.: Combining sampling technique with dbscan algorithm for clustering large spatial databases. In: Terano, T., Liu, H., Chen, A.L.P. (eds.) PAKDD 2000. LNCS, vol. 1805, pp. 169–172. Springer, Heidelberg (2000)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

VSB - Technical University of Ostrava, Czech Republic, 17. listopadu 15, 708 33, Ostrava, Czech Republic
Sarka Zehnalova, Milos Kudelka & Jan Platos

Authors

Sarka Zehnalova
View author publications
You can also search for this author in PubMed Google Scholar
Milos Kudelka
View author publications
You can also search for this author in PubMed Google Scholar
Jan Platos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sarka Zehnalova .

Editor information

Editors and Affiliations

National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan
Jeng-Shyang Pan
Department of Computer Science Faculty of Elec. Eng. & Comp. Sci., VSB-Technical University of Ostrava, Ostrava-Poruba, Czech Republic
Vaclav Snasel
Departamento de Informáca y Automática, Facultad de Biología, University of Salamanca, Salamanca, Spain
Emilio S. Corchado
Scientific Network for Innovation and Research Excellence, Machine Intelligence Research Labs (MIR Labs), Auburn, Washington, USA
Ajith Abraham
Department of Information Management, National University of Kaohsiung, Kaohsiung, Taiwan
Shyue-Liang Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zehnalova, S., Kudelka, M., Platos, J. (2014). Deterministic Data Sampling Based on Neighborhood Analysis. In: Pan, JS., Snasel, V., Corchado, E., Abraham, A., Wang, SL. (eds) Intelligent Data analysis and its Applications, Volume I. Advances in Intelligent Systems and Computing, vol 297. Springer, Cham. https://doi.org/10.1007/978-3-319-07776-5_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-07776-5_6
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-07775-8
Online ISBN: 978-3-319-07776-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics