Skip to main content

Frequent Item Mining When Obtaining Support Is Costly

  • Conference paper
  • First Online:
Big Data Analytics and Knowledge Discovery (DaWaK 2019)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11708))

Included in the following conference series:

  • 1508 Accesses

Abstract

Suppose there are n users and m items, and the preference of each user for the items is revealed only upon probing, which takes time and is therefore costly. How can we quickly discover all the frequent items that are favored individually by at least a given number of users? This new problem not only has strong connections with several well-known problems, such as the frequent item mining problem, it also finds applications in fields such as sponsored search and marketing surveys. Unlike traditional frequent item mining, however, our problem assumes no prior knowledge of users’ preferences, and thus obtaining the support of an item becomes costly. Although our problem can be settled naively by probing the preferences of all n users, the number of users is typically enormous, and each probing itself can also incur a prohibitive cost. We present a sampling algorithm that drastically reduces the number of users needed to probe to \(O(\log m)\)—regardless of the number of users—as long as slight inaccuracy in the output is permitted. For reasonably sized input, our algorithm needs to probe only \(0.5\%\) of the users, whereas the naive approach needs to probe all of them.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    \(\varepsilon \) is the error parameter and \(\delta \) is the confidence parameter. They are fully discussed in Sect. 4.1.

References

  1. About IMDb (2019). http://www.imdb.com/pressroom/about/

  2. IMDb Charts (2019). http://www.imdb.com/chart/top

  3. Netflix datasets (2019). www.netflixprize.com

  4. Yahoo! datasets (2019). https://webscope.sandbox.yahoo.com/

  5. Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proceedings of 20th International Conference on Very Large Data Bases, VLDB, vol. 1215, pp. 487–499 (1994)

    Google Scholar 

  6. Chakrabarty, D., Zhou, Y., Lukose, R.: Budget constrained bidding in keyword auctions and online knapsack problems. In: Workshop on Sponsored Search Auctions, WWW 2007 (2007)

    Google Scholar 

  7. Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. Proc. VLDB Endow. 1(2), 1530–1541 (2008)

    Article  Google Scholar 

  8. Dimitropoulos, X., Hurley, P., Kind, A.: Probabilistic lossy counting: an efficient algorithm for finding heavy hitters. ACM SIGCOMM Comput. Commun. Rev. 38(1), 5 (2008)

    Article  Google Scholar 

  9. Dubhashi, D.P., Panconesi, A.: Concentration of Measure for the Analysis of Randomized Algorithms. Cambridge University Press, New York (2009)

    Book  Google Scholar 

  10. Franklin, M.J., Kossmann, D., Kraska, T., Ramesh, S., Xin, R.: CrowdDB: answering queries with crowd sourcing. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, pp. 61–72. ACM (2011)

    Google Scholar 

  11. Garfield, E.: Premature discovery or delayed recognition-why? Current Contents 21, 5–10 (1980)

    Google Scholar 

  12. Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier, New York (2011)

    MATH  Google Scholar 

  13. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. ACM SIGMOD Rec. 29, 1–12 (2000)

    Article  Google Scholar 

  14. Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58(301), 13–30 (1963)

    Article  MathSciNet  Google Scholar 

  15. Ilyas, I.F., Beskales, G., Soliman, M.A.: A survey of top-k query processing techniques in relational database systems. ACM Comput. Surv. (CSUR) 40(4), 11 (2008)

    Article  Google Scholar 

  16. Kessler Faulkner, T., Brackenbury, W., Lall, A.: k-regret queries with nonlinear utilities. Proc. VLDB Endow. 8(13), 2098–2109 (2015)

    Article  Google Scholar 

  17. Manku, G.S., Motwani, R.: Approximate frequency counts over data streams. In: Proceedings of the 28th International Conference on Very Large Data Bases, pp. 346–357. VLDB Endowment (2002)

    Google Scholar 

  18. Peng, P., Wong, R.C.-W.: k-hit query: top-k query with probabilistic utility function. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (2015)

    Google Scholar 

  19. Shanbhag, A., Pirk, H., Madden, S.: Efficient top-k query processing on massively parallel hardware. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1557–1570. ACM (2018)

    Google Scholar 

  20. Solanki, S.K., Patel, J.T.: A survey on association rule mining. In: 2015 Fifth International Conference on Advanced Computing & Communication Technologies, pp. 212–216. IEEE (2015)

    Google Scholar 

  21. Thompson, S.K.: Sample size for estimating multinomial proportions. Am. Stat. 41(1), 42–46 (1987)

    MathSciNet  Google Scholar 

  22. Thompson, S.K.: Sampling. Wiley, Hoboken (2012)

    Book  Google Scholar 

  23. Walpole, R.E., Myers, R.H., Myers, S.L., Ye, K.: Probability and Statistics for Engineers and Scientists. Macmillan, New York (2011)

    MATH  Google Scholar 

  24. Zhang, W., Zhang, Y., Gao, B., Yu, Y., Yuan, X., Liu, T.-Y.: Joint optimization of bid and budget allocation in sponsored search. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1177–1185. ACM (2012)

    Google Scholar 

  25. Zhang, Z., Jin, C., Kang, Q.: Reverse k-ranks query. Proc. VLDB Endow. 7, 785–796 (2014)

    Article  Google Scholar 

Download references

Acknowledgement

The research is supported by HKRGC GRF 14205117.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joe Wing-Ho Lin .

Editor information

Editors and Affiliations

Appendices

A Appendix: Probabilistic Inequalities

We outline several probabilistic inequalities that will be employed in our proofs.

Proposition 3

(Chernoff’s bound [9] and Hoeffding’s inequality [14]). Let \(X_1,X_2,\dots ,X_s\) be independent Bernoulli variables and \(\mathbf{Pr }\left[ X_i=1\right] =p\), where \(0<p<1\), for \(i=1,2,\dots ,s\). Let \(X=\sum _{i=1}^{s}X_i\) and \(\mathbf E \left[ X\right] =\mu \) such that \(\mu _\ell \le \mu \le \mu _h\), where \(\mu _\ell ,\mu _h\in \mathbb {R}\). Then, for any \(\varepsilon <1\),

$$\begin{aligned} \mathbf{Pr }\left[ X\ge (1+\varepsilon )\mu _h\right]\le & {} e^{\frac{-\mu _{\!h}\varepsilon ^2}{3}}, \quad \mathbf{Pr }\left[ X\le (1-\varepsilon )\mu _\ell \right] \le e^{\frac{-\mu _{\!\ell }\varepsilon ^2}{2}}, \end{aligned}$$
(1)
$$\begin{aligned} \mathbf{Pr }\left[ X\ge \mu +\varepsilon \right]\le & {} e^{-\frac{2\varepsilon ^2}{s}}, \quad \mathbf{Pr }\left[ X\le \mu -\varepsilon \right] \le e^{-\frac{2\varepsilon ^2}{s}}~. \end{aligned}$$
(2)

B Appendix: Proofs

Proof

(Lemma 1). Let X be the discovered support of a given frequent item \(t\) (i.e., an item whose support is at least pn). For \(i=1,2,\dots ,s\), let \(X_i\) be the indicator random variable such that \(X_i=1\) if the ith sampled user favors item \(t\), and \(X_i=0\) otherwise. We first note that \(X=\sum _{i=1}^{s}X_i\). In addition, since our algorithm samples users independently and uniformly at random, we have So, we conclude that \(\mathbf E \left[ X\right] =\mathbf E \left[ \sum _{i=1}^{s}X_i\right] =\sum _{i=1}^{s}\mathbf E \left[ X_i\right] \ge ps.\)

Now, we bound the probability that item t is not returned. According to our algorithm, this event occurs if and only if the discovered support of item \(t\) is less than \((p-\varepsilon /2)s\):

$$\begin{aligned} \mathbf{Pr }\left[ X<\left( p-\frac{\varepsilon }{2}\right) s\right]= & {} \mathbf{Pr }\left[ X<\left( 1-\frac{\varepsilon }{2p}\right) ps\right] \le e^{-\frac{ps\left( \frac{\varepsilon }{2p}\right) ^2}{2}} = e^{-\frac{s\varepsilon ^2}{8p}} ~. \end{aligned}$$

The inequality above is established by Chernoff’s bound (Proposition 3). Now, without loss of generality, suppose that there are k frequent items in the item set \(\mathcal {T}\), where \(1\le k\le m\) is unknown. As will be clear shortly, the value of k is irrelevant to the analysis of our algorithm. By using the union bound (a.k.a Boole’s inequality), therefore, the probability that at least one of the k frequent items is not returned by our algorithm is bounded above by    \(\square \)

Proof

(Lemma 2). Now, let Y be the discovered support of a given infrequent item (i.e., an item whose support is less than \((p-\varepsilon )n\)). By using the same line of argument for random variable X in Lemma 1, we can show that \(\mathbf E \left[ Y\right] <(p-\varepsilon )s\).

Now, we bound the probability that the given infrequent item is returned. By the design of our algorithm, this event occurs if and only if the discovered support of that item is at least \((p-\varepsilon /2)s\):

$$\begin{aligned} \mathbf{Pr }\left[ Y\ge \left( p-\frac{\varepsilon }{2}\right) s\right]= & {} \mathbf{Pr }\left[ Y\ge \left( 1+\frac{\varepsilon }{2(p-\varepsilon )}\right) (p-\varepsilon )s \right] \le e^{-\frac{(p-\varepsilon )s \left[ \frac{\varepsilon }{2(p-\varepsilon )}\right] ^2 }{3}}\\= & {} e^{-\frac{s\varepsilon ^2}{12(p-\varepsilon )}} ~. \end{aligned}$$

The inequality is established by Chernoff’s bound (Proposition 3). Since there are k items whose support is at least pn, it follows that there are at most \(m-k\) items whose support is less than \((p-\varepsilon )n\) (because there are m items). By the union bound, the probability that our algorithm returns at least one of the infrequent items is at most \((m-k)e^{-\frac{s\varepsilon ^2}{12(p-\varepsilon )}}~.\)    \(\square \)

Proof

(Lemma 3). From Lemmas 1 and 2, we know that the probability that our algorithm fails to achieve Property P1 (resp. Property P2) of the \(\varepsilon \)-approximation guarantee is at most \(ke^{-\frac{s\varepsilon ^2}{8p}}\) (resp. \((m-k)e^{-\frac{s\varepsilon ^2}{12(p-\varepsilon )}}\)). By appealing to the union bound again, the probability that our algorithm fails to achieve the \(\varepsilon \)-approximation guarantee is at most \(ke^{-\frac{s\varepsilon ^2}{8p}} + (m-k)e^{-\frac{s\varepsilon ^2}{12(p-\varepsilon )}}~. \) We simplify the expression to identify the sample size s:

$$\begin{aligned} e^{-\frac{s\varepsilon ^2}{8p}} \le e^{-\frac{s\varepsilon ^2}{12(p-\varepsilon )}} \iff 12(p-\varepsilon )\ge 8p \iff p\ge 3\varepsilon ~. \end{aligned}$$
(3)

By referring to Eq. (3), we know that \(e^{-\frac{s\varepsilon ^2}{8p}}> e^{-\frac{s\varepsilon ^2}{12(p-\varepsilon )}}\). Hence

$$\begin{aligned} ke^{-\frac{s\varepsilon ^2}{8p}} + (m-k)e^{-\frac{s\varepsilon ^2}{12(p-\varepsilon )}}< & {} ke^{-\frac{s\varepsilon ^2}{8p}} + (m-k)e^{-\frac{s\varepsilon ^2}{8p}} = me^{-\frac{s\varepsilon ^2}{8p}}. \end{aligned}$$

Again by Eq. (3), we have

$$\begin{aligned} ke^{-\frac{s\varepsilon ^2}{8p}} + (m-k)e^{-\frac{s\varepsilon ^2}{12(p-\varepsilon )}}\le & {} ke^{-\frac{s\varepsilon ^2}{12(p-\varepsilon )}} + (m-k)e^{-\frac{s\varepsilon ^2}{12(p-\varepsilon )}} = me^{-\frac{s\varepsilon ^2}{12(p-\varepsilon )}}. \end{aligned}$$

Taking the maximum of both cases, we finish the proof.   \(\square \)

Proof

(Theorem 1). We use the result of Lemma 3.

In this case, the probability that Support-Sampling fails is at most \(me^{-\frac{s\varepsilon ^2}{8p}}\). By setting this quantity to \(\delta \) and then solve for s, we have \(s=\frac{8p}{\varepsilon ^2}\ln \frac{m}{\delta }~.\)

Similarly, the probability that Support-Sampling fails in this case is at most \(me^{-\frac{s\varepsilon ^2}{12(p-\varepsilon )}}\). By setting this quantity to \(\delta \) and then solve for s, we have \(s=\frac{12(p-\varepsilon )}{\varepsilon ^2}\ln \frac{m}{\delta }~.\) Taking the maximum of both cases, we finish the proof.   \(\square \)

Proof

(Lemma 4). For \(i=1,2,\dots ,s\), let \(X_i\) and \(Z_i\) be Bernouilli distributions with mean \(p_h\) and \(p_\ell /p_h\), respectively. Let \(Y_i:=X_iZ_i\) so that \(Y_i\le X_i\) with probability 1, and that \(Y_i\) is a Bernouilli distribution with mean \(p_\ell \). Now, notice that \(X=\sum _{i=1}^{n}X_i\) and \(Y=\sum _{i=1}^{n}Y_i\). Also, since \(Y_i\le X_i\), it follows that \(Y\le X\) with probability 1. Therefore, the event \(Y\ge c\) implies the event \(X\ge c\). Consequently, we have \(\mathbf{Pr }\left[ Y\ge c\right] \le \mathbf{Pr }\left[ X\ge c\right] \).    \(\square \)

Proof

(Proposition 1). Let Y be the discovered support of a given potentially-frequent item \(t\) with \(\text {support}(t)=pn-r\), where \(\varepsilon n/2<r\le \varepsilon n\). Then, using the same argument in Lemma 1, we can show that \(\mathbf E \left[ Y\right] =\left( \frac{pn-r}{n}\right) s=\left( p-\frac{r}{n}\right) s.\) Hence, the probability that item \(t\) is returned by Support-Sampling is given by \(\mathbf{Pr }\left[ Y\ge \left( p-\frac{\varepsilon }{2}\right) s\right] = \mathbf{Pr }\left[ Y\ge \left( p-\frac{\varepsilon }{2}+\frac{r}{n}-\frac{r}{n}\right) s\right] = \mathbf{Pr }\left[ Y\ge \left( p-\frac{r}{n}\right) s+\left( \frac{r}{n}-\frac{ \varepsilon }{2}\right) s\right] = \mathbf{Pr }\left[ Y\ge \mathbf E \left[ Y\right] +\left( \frac{r}{n}-\frac{ \varepsilon }{2}\right) s\right] .\) Now, since \(r>\frac{\varepsilon n}{2}\), it follows that \(\left( \frac{r}{n}-\frac{\varepsilon }{2}\right) s>0\). Hence, we can apply Hoeffding’s inequality (Proposition 3) to bound the equation: \(\mathbf{Pr }\left[ Y\ge \mathbf E \left[ Y\right] +\left( \frac{r}{n}-\frac{ \varepsilon }{2}\right) s\right] \le e^{- \frac{2\left( \frac{r}{n}-\frac{ \varepsilon }{2}\right) ^2s^2}{s} } = e^{-2s\left( \frac{r}{n}-\frac{\varepsilon }{2}\right) ^2}.\)    \(\square \)

Proof

(Proposition 2). The proof is similar to that of Proposition 1 and is omitted due to the space constraint.   \(\square \)

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lin, J.WH., Wong, R.CW. (2019). Frequent Item Mining When Obtaining Support Is Costly. In: Ordonez, C., Song, IY., Anderst-Kotsis, G., Tjoa, A., Khalil, I. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2019. Lecture Notes in Computer Science(), vol 11708. Springer, Cham. https://doi.org/10.1007/978-3-030-27520-4_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-27520-4_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-27519-8

  • Online ISBN: 978-3-030-27520-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics