Abstract
Suppose there are n users and m items, and the preference of each user for the items is revealed only upon probing, which takes time and is therefore costly. How can we quickly discover all the frequent items that are favored individually by at least a given number of users? This new problem not only has strong connections with several well-known problems, such as the frequent item mining problem, it also finds applications in fields such as sponsored search and marketing surveys. Unlike traditional frequent item mining, however, our problem assumes no prior knowledge of users’ preferences, and thus obtaining the support of an item becomes costly. Although our problem can be settled naively by probing the preferences of all n users, the number of users is typically enormous, and each probing itself can also incur a prohibitive cost. We present a sampling algorithm that drastically reduces the number of users needed to probe to \(O(\log m)\)—regardless of the number of users—as long as slight inaccuracy in the output is permitted. For reasonably sized input, our algorithm needs to probe only \(0.5\%\) of the users, whereas the naive approach needs to probe all of them.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
\(\varepsilon \) is the error parameter and \(\delta \) is the confidence parameter. They are fully discussed in Sect. 4.1.
References
About IMDb (2019). http://www.imdb.com/pressroom/about/
IMDb Charts (2019). http://www.imdb.com/chart/top
Netflix datasets (2019). www.netflixprize.com
Yahoo! datasets (2019). https://webscope.sandbox.yahoo.com/
Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proceedings of 20th International Conference on Very Large Data Bases, VLDB, vol. 1215, pp. 487–499 (1994)
Chakrabarty, D., Zhou, Y., Lukose, R.: Budget constrained bidding in keyword auctions and online knapsack problems. In: Workshop on Sponsored Search Auctions, WWW 2007 (2007)
Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. Proc. VLDB Endow. 1(2), 1530–1541 (2008)
Dimitropoulos, X., Hurley, P., Kind, A.: Probabilistic lossy counting: an efficient algorithm for finding heavy hitters. ACM SIGCOMM Comput. Commun. Rev. 38(1), 5 (2008)
Dubhashi, D.P., Panconesi, A.: Concentration of Measure for the Analysis of Randomized Algorithms. Cambridge University Press, New York (2009)
Franklin, M.J., Kossmann, D., Kraska, T., Ramesh, S., Xin, R.: CrowdDB: answering queries with crowd sourcing. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, pp. 61–72. ACM (2011)
Garfield, E.: Premature discovery or delayed recognition-why? Current Contents 21, 5–10 (1980)
Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier, New York (2011)
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. ACM SIGMOD Rec. 29, 1–12 (2000)
Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58(301), 13–30 (1963)
Ilyas, I.F., Beskales, G., Soliman, M.A.: A survey of top-k query processing techniques in relational database systems. ACM Comput. Surv. (CSUR) 40(4), 11 (2008)
Kessler Faulkner, T., Brackenbury, W., Lall, A.: k-regret queries with nonlinear utilities. Proc. VLDB Endow. 8(13), 2098–2109 (2015)
Manku, G.S., Motwani, R.: Approximate frequency counts over data streams. In: Proceedings of the 28th International Conference on Very Large Data Bases, pp. 346–357. VLDB Endowment (2002)
Peng, P., Wong, R.C.-W.: k-hit query: top-k query with probabilistic utility function. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (2015)
Shanbhag, A., Pirk, H., Madden, S.: Efficient top-k query processing on massively parallel hardware. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1557–1570. ACM (2018)
Solanki, S.K., Patel, J.T.: A survey on association rule mining. In: 2015 Fifth International Conference on Advanced Computing & Communication Technologies, pp. 212–216. IEEE (2015)
Thompson, S.K.: Sample size for estimating multinomial proportions. Am. Stat. 41(1), 42–46 (1987)
Thompson, S.K.: Sampling. Wiley, Hoboken (2012)
Walpole, R.E., Myers, R.H., Myers, S.L., Ye, K.: Probability and Statistics for Engineers and Scientists. Macmillan, New York (2011)
Zhang, W., Zhang, Y., Gao, B., Yu, Y., Yuan, X., Liu, T.-Y.: Joint optimization of bid and budget allocation in sponsored search. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1177–1185. ACM (2012)
Zhang, Z., Jin, C., Kang, Q.: Reverse k-ranks query. Proc. VLDB Endow. 7, 785–796 (2014)
Acknowledgement
The research is supported by HKRGC GRF 14205117.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
A Appendix: Probabilistic Inequalities
We outline several probabilistic inequalities that will be employed in our proofs.
Proposition 3
(Chernoff’s bound [9] and Hoeffding’s inequality [14]). Let \(X_1,X_2,\dots ,X_s\) be independent Bernoulli variables and \(\mathbf{Pr }\left[ X_i=1\right] =p\), where \(0<p<1\), for \(i=1,2,\dots ,s\). Let \(X=\sum _{i=1}^{s}X_i\) and \(\mathbf E \left[ X\right] =\mu \) such that \(\mu _\ell \le \mu \le \mu _h\), where \(\mu _\ell ,\mu _h\in \mathbb {R}\). Then, for any \(\varepsilon <1\),
B Appendix: Proofs
Proof
(Lemma 1). Let X be the discovered support of a given frequent item \(t\) (i.e., an item whose support is at least pn). For \(i=1,2,\dots ,s\), let \(X_i\) be the indicator random variable such that \(X_i=1\) if the ith sampled user favors item \(t\), and \(X_i=0\) otherwise. We first note that \(X=\sum _{i=1}^{s}X_i\). In addition, since our algorithm samples users independently and uniformly at random, we have So, we conclude that \(\mathbf E \left[ X\right] =\mathbf E \left[ \sum _{i=1}^{s}X_i\right] =\sum _{i=1}^{s}\mathbf E \left[ X_i\right] \ge ps.\)
Now, we bound the probability that item t is not returned. According to our algorithm, this event occurs if and only if the discovered support of item \(t\) is less than \((p-\varepsilon /2)s\):
The inequality above is established by Chernoff’s bound (Proposition 3). Now, without loss of generality, suppose that there are k frequent items in the item set \(\mathcal {T}\), where \(1\le k\le m\) is unknown. As will be clear shortly, the value of k is irrelevant to the analysis of our algorithm. By using the union bound (a.k.a Boole’s inequality), therefore, the probability that at least one of the k frequent items is not returned by our algorithm is bounded above by \(\square \)
Proof
(Lemma 2). Now, let Y be the discovered support of a given infrequent item (i.e., an item whose support is less than \((p-\varepsilon )n\)). By using the same line of argument for random variable X in Lemma 1, we can show that \(\mathbf E \left[ Y\right] <(p-\varepsilon )s\).
Now, we bound the probability that the given infrequent item is returned. By the design of our algorithm, this event occurs if and only if the discovered support of that item is at least \((p-\varepsilon /2)s\):
The inequality is established by Chernoff’s bound (Proposition 3). Since there are k items whose support is at least pn, it follows that there are at most \(m-k\) items whose support is less than \((p-\varepsilon )n\) (because there are m items). By the union bound, the probability that our algorithm returns at least one of the infrequent items is at most \((m-k)e^{-\frac{s\varepsilon ^2}{12(p-\varepsilon )}}~.\) \(\square \)
Proof
(Lemma 3). From Lemmas 1 and 2, we know that the probability that our algorithm fails to achieve Property P1 (resp. Property P2) of the \(\varepsilon \)-approximation guarantee is at most \(ke^{-\frac{s\varepsilon ^2}{8p}}\) (resp. \((m-k)e^{-\frac{s\varepsilon ^2}{12(p-\varepsilon )}}\)). By appealing to the union bound again, the probability that our algorithm fails to achieve the \(\varepsilon \)-approximation guarantee is at most \(ke^{-\frac{s\varepsilon ^2}{8p}} + (m-k)e^{-\frac{s\varepsilon ^2}{12(p-\varepsilon )}}~. \) We simplify the expression to identify the sample size s:
By referring to Eq. (3), we know that \(e^{-\frac{s\varepsilon ^2}{8p}}> e^{-\frac{s\varepsilon ^2}{12(p-\varepsilon )}}\). Hence
Again by Eq. (3), we have
Taking the maximum of both cases, we finish the proof. \(\square \)
Proof
(Theorem 1). We use the result of Lemma 3.
In this case, the probability that Support-Sampling fails is at most \(me^{-\frac{s\varepsilon ^2}{8p}}\). By setting this quantity to \(\delta \) and then solve for s, we have \(s=\frac{8p}{\varepsilon ^2}\ln \frac{m}{\delta }~.\)
Similarly, the probability that Support-Sampling fails in this case is at most \(me^{-\frac{s\varepsilon ^2}{12(p-\varepsilon )}}\). By setting this quantity to \(\delta \) and then solve for s, we have \(s=\frac{12(p-\varepsilon )}{\varepsilon ^2}\ln \frac{m}{\delta }~.\) Taking the maximum of both cases, we finish the proof. \(\square \)
Proof
(Lemma 4). For \(i=1,2,\dots ,s\), let \(X_i\) and \(Z_i\) be Bernouilli distributions with mean \(p_h\) and \(p_\ell /p_h\), respectively. Let \(Y_i:=X_iZ_i\) so that \(Y_i\le X_i\) with probability 1, and that \(Y_i\) is a Bernouilli distribution with mean \(p_\ell \). Now, notice that \(X=\sum _{i=1}^{n}X_i\) and \(Y=\sum _{i=1}^{n}Y_i\). Also, since \(Y_i\le X_i\), it follows that \(Y\le X\) with probability 1. Therefore, the event \(Y\ge c\) implies the event \(X\ge c\). Consequently, we have \(\mathbf{Pr }\left[ Y\ge c\right] \le \mathbf{Pr }\left[ X\ge c\right] \). \(\square \)
Proof
(Proposition 1). Let Y be the discovered support of a given potentially-frequent item \(t\) with \(\text {support}(t)=pn-r\), where \(\varepsilon n/2<r\le \varepsilon n\). Then, using the same argument in Lemma 1, we can show that \(\mathbf E \left[ Y\right] =\left( \frac{pn-r}{n}\right) s=\left( p-\frac{r}{n}\right) s.\) Hence, the probability that item \(t\) is returned by Support-Sampling is given by \(\mathbf{Pr }\left[ Y\ge \left( p-\frac{\varepsilon }{2}\right) s\right] = \mathbf{Pr }\left[ Y\ge \left( p-\frac{\varepsilon }{2}+\frac{r}{n}-\frac{r}{n}\right) s\right] = \mathbf{Pr }\left[ Y\ge \left( p-\frac{r}{n}\right) s+\left( \frac{r}{n}-\frac{ \varepsilon }{2}\right) s\right] = \mathbf{Pr }\left[ Y\ge \mathbf E \left[ Y\right] +\left( \frac{r}{n}-\frac{ \varepsilon }{2}\right) s\right] .\) Now, since \(r>\frac{\varepsilon n}{2}\), it follows that \(\left( \frac{r}{n}-\frac{\varepsilon }{2}\right) s>0\). Hence, we can apply Hoeffding’s inequality (Proposition 3) to bound the equation: \(\mathbf{Pr }\left[ Y\ge \mathbf E \left[ Y\right] +\left( \frac{r}{n}-\frac{ \varepsilon }{2}\right) s\right] \le e^{- \frac{2\left( \frac{r}{n}-\frac{ \varepsilon }{2}\right) ^2s^2}{s} } = e^{-2s\left( \frac{r}{n}-\frac{\varepsilon }{2}\right) ^2}.\) \(\square \)
Proof
(Proposition 2). The proof is similar to that of Proposition 1 and is omitted due to the space constraint. \(\square \)
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Lin, J.WH., Wong, R.CW. (2019). Frequent Item Mining When Obtaining Support Is Costly. In: Ordonez, C., Song, IY., Anderst-Kotsis, G., Tjoa, A., Khalil, I. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2019. Lecture Notes in Computer Science(), vol 11708. Springer, Cham. https://doi.org/10.1007/978-3-030-27520-4_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-27520-4_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27519-8
Online ISBN: 978-3-030-27520-4
eBook Packages: Computer ScienceComputer Science (R0)