Frequent Item Mining When Obtaining Support Is Costly

Lin, Joe Wing-Ho; Wong, Raymond Chi-Wing

doi:10.1007/978-3-030-27520-4_4

Joe Wing-Ho Lin¹³ &
Raymond Chi-Wing Wong¹³

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11708))

Included in the following conference series:

International Conference on Big Data Analytics and Knowledge Discovery

1508 Accesses

Abstract

Suppose there are n users and m items, and the preference of each user for the items is revealed only upon probing, which takes time and is therefore costly. How can we quickly discover all the frequent items that are favored individually by at least a given number of users? This new problem not only has strong connections with several well-known problems, such as the frequent item mining problem, it also finds applications in fields such as sponsored search and marketing surveys. Unlike traditional frequent item mining, however, our problem assumes no prior knowledge of users’ preferences, and thus obtaining the support of an item becomes costly. Although our problem can be settled naively by probing the preferences of all n users, the number of users is typically enormous, and each probing itself can also incur a prohibitive cost. We present a sampling algorithm that drastically reduces the number of users needed to probe to $O(\log m)$—regardless of the number of users—as long as slight inaccuracy in the output is permitted. For reasonably sized input, our algorithm needs to probe only $0.5\%$ of the users, whereas the naive approach needs to probe all of them.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
$\varepsilon $ is the error parameter and $\delta $ is the confidence parameter. They are fully discussed in Sect. 4.1.

References

About IMDb (2019). http://www.imdb.com/pressroom/about/
IMDb Charts (2019). http://www.imdb.com/chart/top
Netflix datasets (2019). www.netflixprize.com
Yahoo! datasets (2019). https://webscope.sandbox.yahoo.com/
Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proceedings of 20th International Conference on Very Large Data Bases, VLDB, vol. 1215, pp. 487–499 (1994)
Google Scholar
Chakrabarty, D., Zhou, Y., Lukose, R.: Budget constrained bidding in keyword auctions and online knapsack problems. In: Workshop on Sponsored Search Auctions, WWW 2007 (2007)
Google Scholar
Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. Proc. VLDB Endow. 1(2), 1530–1541 (2008)
Article Google Scholar
Dimitropoulos, X., Hurley, P., Kind, A.: Probabilistic lossy counting: an efficient algorithm for finding heavy hitters. ACM SIGCOMM Comput. Commun. Rev. 38(1), 5 (2008)
Article Google Scholar
Dubhashi, D.P., Panconesi, A.: Concentration of Measure for the Analysis of Randomized Algorithms. Cambridge University Press, New York (2009)
Book Google Scholar
Franklin, M.J., Kossmann, D., Kraska, T., Ramesh, S., Xin, R.: CrowdDB: answering queries with crowd sourcing. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, pp. 61–72. ACM (2011)
Google Scholar
Garfield, E.: Premature discovery or delayed recognition-why? Current Contents 21, 5–10 (1980)
Google Scholar
Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier, New York (2011)
MATH Google Scholar
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. ACM SIGMOD Rec. 29, 1–12 (2000)
Article Google Scholar
Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58(301), 13–30 (1963)
Article MathSciNet Google Scholar
Ilyas, I.F., Beskales, G., Soliman, M.A.: A survey of top-k query processing techniques in relational database systems. ACM Comput. Surv. (CSUR) 40(4), 11 (2008)
Article Google Scholar
Kessler Faulkner, T., Brackenbury, W., Lall, A.: k-regret queries with nonlinear utilities. Proc. VLDB Endow. 8(13), 2098–2109 (2015)
Article Google Scholar
Manku, G.S., Motwani, R.: Approximate frequency counts over data streams. In: Proceedings of the 28th International Conference on Very Large Data Bases, pp. 346–357. VLDB Endowment (2002)
Google Scholar
Peng, P., Wong, R.C.-W.: k-hit query: top-k query with probabilistic utility function. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (2015)
Google Scholar
Shanbhag, A., Pirk, H., Madden, S.: Efficient top-k query processing on massively parallel hardware. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1557–1570. ACM (2018)
Google Scholar
Solanki, S.K., Patel, J.T.: A survey on association rule mining. In: 2015 Fifth International Conference on Advanced Computing & Communication Technologies, pp. 212–216. IEEE (2015)
Google Scholar
Thompson, S.K.: Sample size for estimating multinomial proportions. Am. Stat. 41(1), 42–46 (1987)
MathSciNet Google Scholar
Thompson, S.K.: Sampling. Wiley, Hoboken (2012)
Book Google Scholar
Walpole, R.E., Myers, R.H., Myers, S.L., Ye, K.: Probability and Statistics for Engineers and Scientists. Macmillan, New York (2011)
MATH Google Scholar
Zhang, W., Zhang, Y., Gao, B., Yu, Y., Yuan, X., Liu, T.-Y.: Joint optimization of bid and budget allocation in sponsored search. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1177–1185. ACM (2012)
Google Scholar
Zhang, Z., Jin, C., Kang, Q.: Reverse k-ranks query. Proc. VLDB Endow. 7, 785–796 (2014)
Article Google Scholar

Download references

Acknowledgement

The research is supported by HKRGC GRF 14205117.

Author information

Authors and Affiliations

The Hong Kong University of Science and Technology, Kowloon, Hong Kong
Joe Wing-Ho Lin & Raymond Chi-Wing Wong

Authors

Joe Wing-Ho Lin
View author publications
You can also search for this author in PubMed Google Scholar
Raymond Chi-Wing Wong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Joe Wing-Ho Lin .

Editor information

Editors and Affiliations

University of Houston, Houston, TX, USA
Carlos Ordonez
Drexel University, Philadelphia, PA, USA
Il-Yeol Song
Johannes Kepler University of Linz, Linz, Austria
Gabriele Anderst-Kotsis
Software Competence Center Hagenberg, Hagenberg im Mühlkreis, Austria
A Min Tjoa
Johannes Kepler University of Linz, Linz, Austria
Ismail Khalil

Appendices

A Appendix: Probabilistic Inequalities

We outline several probabilistic inequalities that will be employed in our proofs.

Proposition 3

(Chernoff’s bound [9] and Hoeffding’s inequality [14]). Let $X_1,X_2,\dots ,X_s$ be independent Bernoulli variables and $\mathbf{Pr }\left[ X_i=1\right] =p$, where $0<p<1$, for $i=1,2,\dots ,s$. Let $X=\sum _{i=1}^{s}X_i$ and $\mathbf E \left[ X\right] =\mu $ such that $\mu _\ell \le \mu \le \mu _h$, where $\mu _\ell ,\mu _h\in \mathbb {R}$. Then, for any $\varepsilon <1$,

$$\begin{aligned} \mathbf{Pr }\left[ X\ge (1+\varepsilon )\mu _h\right]\le & {} e^{\frac{-\mu _{\!h}\varepsilon ^2}{3}}, \quad \mathbf{Pr }\left[ X\le (1-\varepsilon )\mu _\ell \right] \le e^{\frac{-\mu _{\!\ell }\varepsilon ^2}{2}}, \end{aligned}$$

(1)

$$\begin{aligned} \mathbf{Pr }\left[ X\ge \mu +\varepsilon \right]\le & {} e^{-\frac{2\varepsilon ^2}{s}}, \quad \mathbf{Pr }\left[ X\le \mu -\varepsilon \right] \le e^{-\frac{2\varepsilon ^2}{s}}~. \end{aligned}$$

(2)

B Appendix: Proofs

Proof

(Lemma 1). Let X be the discovered support of a given frequent item $t$ (i.e., an item whose support is at least pn). For $i=1,2,\dots ,s$, let $X_i$ be the indicator random variable such that $X_i=1$ if the ith sampled user favors item $t$, and $X_i=0$ otherwise. We first note that $X=\sum _{i=1}^{s}X_i$. In addition, since our algorithm samples users independently and uniformly at random, we have So, we conclude that $\mathbf E \left[ X\right] =\mathbf E \left[ \sum _{i=1}^{s}X_i\right] =\sum _{i=1}^{s}\mathbf E \left[ X_i\right] \ge ps.$

Now, we bound the probability that item t is not returned. According to our algorithm, this event occurs if and only if the discovered support of item $t$ is less than $(p-\varepsilon /2)s$:

$$\begin{aligned} \mathbf{Pr }\left[ X<\left( p-\frac{\varepsilon }{2}\right) s\right]= & {} \mathbf{Pr }\left[ X<\left( 1-\frac{\varepsilon }{2p}\right) ps\right] \le e^{-\frac{ps\left( \frac{\varepsilon }{2p}\right) ^2}{2}} = e^{-\frac{s\varepsilon ^2}{8p}} ~. \end{aligned}$$

The inequality above is established by Chernoff’s bound (Proposition 3). Now, without loss of generality, suppose that there are k frequent items in the item set $\mathcal {T}$, where $1\le k\le m$ is unknown. As will be clear shortly, the value of k is irrelevant to the analysis of our algorithm. By using the union bound (a.k.a Boole’s inequality), therefore, the probability that at least one of the k frequent items is not returned by our algorithm is bounded above by $\square $

Proof

(Lemma 2). Now, let Y be the discovered support of a given infrequent item (i.e., an item whose support is less than $(p-\varepsilon )n$). By using the same line of argument for random variable X in Lemma 1, we can show that $\mathbf E \left[ Y\right] <(p-\varepsilon )s$.

Now, we bound the probability that the given infrequent item is returned. By the design of our algorithm, this event occurs if and only if the discovered support of that item is at least $(p-\varepsilon /2)s$:

$$\begin{aligned} \mathbf{Pr }\left[ Y\ge \left( p-\frac{\varepsilon }{2}\right) s\right]= & {} \mathbf{Pr }\left[ Y\ge \left( 1+\frac{\varepsilon }{2(p-\varepsilon )}\right) (p-\varepsilon )s \right] \le e^{-\frac{(p-\varepsilon )s \left[ \frac{\varepsilon }{2(p-\varepsilon )}\right] ^2 }{3}}\\= & {} e^{-\frac{s\varepsilon ^2}{12(p-\varepsilon )}} ~. \end{aligned}$$

The inequality is established by Chernoff’s bound (Proposition 3). Since there are k items whose support is at least pn, it follows that there are at most $m-k$ items whose support is less than $(p-\varepsilon )n$ (because there are m items). By the union bound, the probability that our algorithm returns at least one of the infrequent items is at most $(m-k)e^{-\frac{s\varepsilon ^2}{12(p-\varepsilon )}}~.$ $\square $

Proof

(Lemma 3). From Lemmas 1 and 2, we know that the probability that our algorithm fails to achieve Property P1 (resp. Property P2) of the $\varepsilon $-approximation guarantee is at most $ke^{-\frac{s\varepsilon ^2}{8p}}$ (resp. $(m-k)e^{-\frac{s\varepsilon ^2}{12(p-\varepsilon )}}$). By appealing to the union bound again, the probability that our algorithm fails to achieve the $\varepsilon $-approximation guarantee is at most $ke^{-\frac{s\varepsilon ^2}{8p}} + (m-k)e^{-\frac{s\varepsilon ^2}{12(p-\varepsilon )}}~. $ We simplify the expression to identify the sample size s:

$$\begin{aligned} e^{-\frac{s\varepsilon ^2}{8p}} \le e^{-\frac{s\varepsilon ^2}{12(p-\varepsilon )}} \iff 12(p-\varepsilon )\ge 8p \iff p\ge 3\varepsilon ~. \end{aligned}$$

(3)

By referring to Eq. (3), we know that $e^{-\frac{s\varepsilon ^2}{8p}}> e^{-\frac{s\varepsilon ^2}{12(p-\varepsilon )}}$. Hence

$$\begin{aligned} ke^{-\frac{s\varepsilon ^2}{8p}} + (m-k)e^{-\frac{s\varepsilon ^2}{12(p-\varepsilon )}}< & {} ke^{-\frac{s\varepsilon ^2}{8p}} + (m-k)e^{-\frac{s\varepsilon ^2}{8p}} = me^{-\frac{s\varepsilon ^2}{8p}}. \end{aligned}$$

Again by Eq. (3), we have

$$\begin{aligned} ke^{-\frac{s\varepsilon ^2}{8p}} + (m-k)e^{-\frac{s\varepsilon ^2}{12(p-\varepsilon )}}\le & {} ke^{-\frac{s\varepsilon ^2}{12(p-\varepsilon )}} + (m-k)e^{-\frac{s\varepsilon ^2}{12(p-\varepsilon )}} = me^{-\frac{s\varepsilon ^2}{12(p-\varepsilon )}}. \end{aligned}$$

Taking the maximum of both cases, we finish the proof. $\square $

Proof

(Theorem 1). We use the result of Lemma 3.

In this case, the probability that Support-Sampling fails is at most $me^{-\frac{s\varepsilon ^2}{8p}}$. By setting this quantity to $\delta $ and then solve for s, we have $s=\frac{8p}{\varepsilon ^2}\ln \frac{m}{\delta }~.$

Similarly, the probability that Support-Sampling fails in this case is at most $me^{-\frac{s\varepsilon ^2}{12(p-\varepsilon )}}$. By setting this quantity to $\delta $ and then solve for s, we have $s=\frac{12(p-\varepsilon )}{\varepsilon ^2}\ln \frac{m}{\delta }~.$ Taking the maximum of both cases, we finish the proof. $\square $

Proof

(Lemma 4). For $i=1,2,\dots ,s$, let $X_i$ and $Z_i$ be Bernouilli distributions with mean $p_h$ and $p_\ell /p_h$, respectively. Let $Y_i:=X_iZ_i$ so that $Y_i\le X_i$ with probability 1, and that $Y_i$ is a Bernouilli distribution with mean $p_\ell $. Now, notice that $X=\sum _{i=1}^{n}X_i$ and $Y=\sum _{i=1}^{n}Y_i$. Also, since $Y_i\le X_i$, it follows that $Y\le X$ with probability 1. Therefore, the event $Y\ge c$ implies the event $X\ge c$. Consequently, we have $\mathbf{Pr }\left[ Y\ge c\right] \le \mathbf{Pr }\left[ X\ge c\right] $. $\square $

Proof

(Proposition 1). Let Y be the discovered support of a given potentially-frequent item $t$ with $\text {support}(t)=pn-r$, where $\varepsilon n/2<r\le \varepsilon n$. Then, using the same argument in Lemma 1, we can show that $\mathbf E \left[ Y\right] =\left( \frac{pn-r}{n}\right) s=\left( p-\frac{r}{n}\right) s.$ Hence, the probability that item $t$ is returned by Support-Sampling is given by $\mathbf{Pr }\left[ Y\ge \left( p-\frac{\varepsilon }{2}\right) s\right] = \mathbf{Pr }\left[ Y\ge \left( p-\frac{\varepsilon }{2}+\frac{r}{n}-\frac{r}{n}\right) s\right] = \mathbf{Pr }\left[ Y\ge \left( p-\frac{r}{n}\right) s+\left( \frac{r}{n}-\frac{ \varepsilon }{2}\right) s\right] = \mathbf{Pr }\left[ Y\ge \mathbf E \left[ Y\right] +\left( \frac{r}{n}-\frac{ \varepsilon }{2}\right) s\right] .$ Now, since $r>\frac{\varepsilon n}{2}$, it follows that $\left( \frac{r}{n}-\frac{\varepsilon }{2}\right) s>0$. Hence, we can apply Hoeffding’s inequality (Proposition 3) to bound the equation: $\mathbf{Pr }\left[ Y\ge \mathbf E \left[ Y\right] +\left( \frac{r}{n}-\frac{ \varepsilon }{2}\right) s\right] \le e^{- \frac{2\left( \frac{r}{n}-\frac{ \varepsilon }{2}\right) ^2s^2}{s} } = e^{-2s\left( \frac{r}{n}-\frac{\varepsilon }{2}\right) ^2}.$ $\square $

Proof

(Proposition 2). The proof is similar to that of Proposition 1 and is omitted due to the space constraint. $\square $

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lin, J.WH., Wong, R.CW. (2019). Frequent Item Mining When Obtaining Support Is Costly. In: Ordonez, C., Song, IY., Anderst-Kotsis, G., Tjoa, A., Khalil, I. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2019. Lecture Notes in Computer Science(), vol 11708. Springer, Cham. https://doi.org/10.1007/978-3-030-27520-4_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-27520-4_4
Published: 03 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27519-8
Online ISBN: 978-3-030-27520-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Frequent Item Mining When Obtaining Support Is Costly

Abstract

Access this chapter

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

A Appendix: Probabilistic Inequalities

Proposition 3

B Appendix: Proofs

Proof

Proof

Proof

Proof

Proof

Proof

Proof

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation