Abstract
The problem of identifying the most frequent items across multiple datasets has received considerable attention over the last few years. When storage is a scarce resource, the topic is already a challenge; yet, its complexity may be further exacerbated not only by the many independent data sources, but also by the dynamism of the data, i.e., the fact that new items may appear and old ones disappear at any time. In this work, we provide a novel approach to the problem by using an existing gossip-based algorithm for identifying the k most frequent items over a distributed collection of datasets, in ways that deal with the dynamic nature of the data. The algorithm has been thoroughly analyzed through trace-based simulations and compared to state-of-the-art decentralized solutions, showing better precision at reduced communication overhead.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Last.fm, http://www.lastfm.com
Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. In: Proc. of STOC 1996, pp. 20–29. ACM (1996)
Arackaparambil, C., Brody, J., Chakrabarti, A.: Functional monitoring without monotonicity. In: Albers, S., Marchetti-Spaccamela, A., Matias, Y., Nikoletseas, S., Thomas, W. (eds.) ICALP 2009, Part I. LNCS, vol. 5555, pp. 95–106. Springer, Heidelberg (2009)
Arlitt, M., Jin, T.: 1998 World Cup web site access logs (August 1998), http://www.acm.org/sigcomm/ITA/
Babcock, B., Olston, C.: Distributed top-k monitoring. In: Proc. of SIGMOD 2003, pp. 28–39 (2003)
Cao, P., Wang, Z.: Efficient top-k query calculation in distributed networks. In: Proc. of PODC 2004, pp. 206–215. ACM (2004)
Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. Theoretical Computer Science 312(1), 3–15 (2004)
Chi, Y., Wang, H., Yu, P., Muntz, R.: Moment: Maintaining closed frequent itemsets over a stream sliding window. In: Proc. of ICDM 2004. IEEE (2004)
Cormode, G.: Continuous distributed monitoring: A short survey. In: Proc. of AlMoDEP 2011, pp. 1–10. ACM (2011)
Cormode, G., Garofalakis, M.N.: Sketching probabilistic data streams. In: Proc. of SIGMOD 2007, pp. 281–292 (2007)
Cormode, G., Muthukrishnan, S., Yi, K.: Algorithms for distributed functional monitoring. ACM Transactions on Algorithms 7(2), 21 (2011)
Gibbons, P.B., Matias, Y.: Synopsis data structures for massive data sets. In: External Memory Algorithms, pp. 39–70. American Mathematical Society (1999)
Jelasity, M., Montresor, A., Babaoglu, O.: Gossip-based aggregation in large dynamic networks. ACM TOCS 23(3), 219–252 (2005)
Jelasity, M., Voulgaris, S., Guerraoui, R., Kermarrec, A.-M., van Steen, M.: Gossip-based peer sampling. ACM TOCSÂ 25(3) (August 2007)
Karp, R., Shenker, S., Papadimitriou, C.: A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst. 28(1), 51–55 (2003)
Lahiri, B., Tirthapura, S.: Identifying frequent items in a network using gossip. J. Parallel Distrib. Computing 70(12), 1241–1253 (2010)
Manjhi, A., Shkapenyuk, V., Dhamdhere, K., Olston, C.: Finding (recently) frequent items in distributed data streams. In: Proc. of ICDE 2005. IEEE (2005)
Montresor, A., Jelasity, M.: PeerSim: A scalable P2P simulator. In: Proc. of P2P 2009, pp. 99–100 (September 2009)
Sacha, J., Montresor, A.: Identifying frequent items in distributed data sets. Computing 95(4), 289–307 (2013)
Tirthapura, S., Woodruff, D.P.: Optimal random sampling from distributed streams revisited. In: Peleg, D. (ed.) DISC 2011. LNCS, vol. 6950, pp. 283–297. Springer, Heidelberg (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Guerrieri, A., Montresor, A., Velegrakis, Y. (2014). Top-k Item Identification on Dynamic and Distributed Datasets. In: Silva, F., Dutra, I., Santos Costa, V. (eds) Euro-Par 2014 Parallel Processing. Euro-Par 2014. Lecture Notes in Computer Science, vol 8632. Springer, Cham. https://doi.org/10.1007/978-3-319-09873-9_23
Download citation
DOI: https://doi.org/10.1007/978-3-319-09873-9_23
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09872-2
Online ISBN: 978-3-319-09873-9
eBook Packages: Computer ScienceComputer Science (R0)