Discrete Stochastic Search and Its Application to Feature-Selection for Deep Relational Machines

Dash, Tirtharaj; Srinivasan, Ashwin; Joshi, Ramprasad S.; Baskar, A.

doi:10.1007/978-3-030-30484-3_3

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11728))

Included in the following conference series:

International Conference on Artificial Neural Networks

3947 Accesses
6 Citations

Abstract

We use a model for discrete stochastic search in which one or more objects (“targets”) are to be found by a search over n locations (“boxes”), where n is infinitely large. Each box has some probability that it contains a target, resulting in a distribution H over boxes. We model the search for the targets as a stochastic procedure that draws boxes using some distribution S. We derive first a general expression on the expected number of misses $\text {E}[Z]$ made by the search procedure in terms of H and S. We then obtain an expression for an optimal distribution $S^{*}$ to minimise $\text {E}[Z]$. This results in a relation between: the entropy of H and the KL-divergence between H and $S^{*}$. This result induces a 2-partitions over the boxes consisting of those boxes with H probability greater than $\frac{1}{n}$ and the rest. We use this result to devise a stochastic search procedure for the practical situation when H is unknown. We present results from simulations that agree with theoretical predictions; and demonstrate that the expected misses by the optimal seeker decreases as the entropy of H decreases, with a maximum obtained for uniform H. Finally, we demonstrate applications of this stochastic search procedure with a coarse assumption about H. The theoretical results and the procedure are applicable to stochastic search over any aspect of machine learning that involves a discrete search-space: for example, choice over features, structures or discretized parameter-selection. In this work, the procedure is used to select features for Deep Relational Machines (DRMs) which are Deep Neural Networks (DNNs) defined in terms of domain-specific knowledge and built with features selected from large, potentially infinite-attribute space. Empirical results obtained across over 70 real-world datasets show that using the stochastic search procedure results in significantly better performances than the state-of-the-art.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We note that a similar argument is used in [7] to identify possibly good solutions in discrete event simulations; and is proposed for use in Inductive Logic Programming (ILP) in [17]. Both do not explicitly relate this to a distribution model, as is done here.
2.
https://www.cancer.gov/.

References

Abadi, M., Agarwal, A., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). https://www.tensorflow.org/
Ando, H.Y., Dehaspe, L., Luyten, W., Van Craenenbroeck, E., Vandecasteele, H., Van Meervelt, L.: Discovering H-bonding rules in crystals with inductive logic programming. Mol. Pharm. 3(6), 665–674 (2006). https://doi.org/10.1021/mp060034z
Article Google Scholar
Blum, A.: Learning boolean functions in an infinite attribute space. Mach. Learn. 9(4), 373–386 (1992). https://doi.org/10.1007/BF00994112
Article MATH Google Scholar
Chollet, F., et al.: Keras (2015). https://keras.io
Dash, T., Srinivasan, A., Vig, L., Orhobor, O.I., King, R.D.: Large-scale assessment of deep relational machines. In: Riguzzi, F., Bellodi, E., Zese, R. (eds.) ILP 2018. LNCS (LNAI), vol. 11105, pp. 22–37. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99960-9_2
Chapter Google Scholar
Fog, A.: Sampling methods for wallenius’ and fisher’s noncentral hypergeometric distributions. Commun. Stat. Simul. Comput.® 37(2), 241–257 (2008). https://doi.org/10.1080/03610910701790236
Article MathSciNet MATH Google Scholar
Ho, Y.C., Zhao, Q.C., Jia, Q.S.: Ordinal Optimization: Soft Optimization for Hard Problems. Springer, Boston (2007). https://doi.org/10.1007/978-0-387-68692-9
Book MATH Google Scholar
Kelly, F.: On optimal search with unknown detection probabilities. J. Math. Anal. Appl. 88(2), 422–432 (1982)
Article MathSciNet Google Scholar
King, R.D., Muggleton, S.H., Srinivasan, A., Sternberg, M.J.: Structure-activity relationships derived by machine learning: the use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming. Proc. Nat. Acad. Sci. U.S.A. 93(1), 438–42 (1996). https://doi.org/10.1073/pnas.93.1.438
Article Google Scholar
King, R.D., Muggleton, S.H., Srinivasan, A., Sternberg, M.: Structure-activity relationships derived by machine learning: the use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming. Proc. Nat. Acad. Sci. 93(1), 438–442 (1996). https://doi.org/10.1073/pnas.93.1.438
Article Google Scholar
Kinga, D., Adam, J.B.: A method for stochastic optimization. In: International Conference on Learning Representations (ICLR), vol. 5 (2015)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Lidbetter, T., Lin, K.: Searching for multiple objects in multiple locations. arXiv preprint arXiv:1710.05332 (2017)
Lodhi, H.: Deep relational machines. In: Lee, M., Hirose, A., Hou, Z.-G., Kil, R.M. (eds.) ICONIP 2013. LNCS, vol. 8227, pp. 212–219. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-42042-9_27
Chapter Google Scholar
Muggleton, S., De Raedt, L.: Inductive logic programming: theory and methods. J. Logic Program. 19, 629–679 (1994). https://doi.org/10.1016/0743-1066(94)90035-3
Article MathSciNet MATH Google Scholar
Ruckle, W.H.: A discrete search game. In: Raghavan, T.E.S., Ferguson, T.S., Parthasarathy, T., Vrieze, O.J. (eds.) Theory and Decision Library, pp. 29–43. Springer, Netherlands (1991). https://doi.org/10.1007/978-94-011-3760-7_4
Chapter Google Scholar
Srinivasan, A.: A study of two probabilistic methods for searching large spaces with ILP. Technical report PRG-TR-16-00, Oxford University Computing Laboratory, Oxford (2000)
Google Scholar
Stone, L.D.: Theory of Optimal Search, vol. 118. Elsevier, Amsterdam (1976)
Google Scholar
Subelman, E.J.: A hide-search game. J. Appl. Probab. 18(3), 628–640 (1981). https://doi.org/10.2307/3213317
Article MathSciNet MATH Google Scholar
Van Craenenbroeck, E., Vandecasteele, H., Dehaspe, L.: Dmax’s functional group and ring library. https://dtai.cs.kuleuven.be/software/dmax/ (2002)
Vig, L., Srinivasan, A., Bain, M., Verma, A.: An investigation into the role of domain-knowledge on the use of embeddings. In: Lachiche, N., Vrain, C. (eds.) ILP 2017. LNCS (LNAI), vol. 10759, pp. 169–183. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-78090-0_12
Chapter Google Scholar
Šourek, G., Aschenbrenner, V., Železny, F., Kuželka, O.: Lifted relational neural networks. In: Proceedings of the 2015th International Conference on Cognitive Computation: Integrating Neural and Symbolic Approaches, vol. 1583, pp. 52–60. COCO 2015. CEUR-WS.org, Aachen, Germany, Germany (2015). http://dl.acm.org/citation.cfm?id=2996831.2996838

Download references

Acknowledgments

The second author (A.S.) is a Visiting Professorial Fellow, School of CSE, UNSW Sydney. This work is partially supported by DST-SERB grant EMR/2016/002766, Government of India.

Author information

Authors and Affiliations

Department of Computer Science and Information Systems, BITS Pilani, K. K. Birla Goa Campus, Goa, 403726, India
Tirtharaj Dash, Ashwin Srinivasan, Ramprasad S. Joshi & A. Baskar

Authors

Tirtharaj Dash
View author publications
You can also search for this author in PubMed Google Scholar
Ashwin Srinivasan
View author publications
You can also search for this author in PubMed Google Scholar
Ramprasad S. Joshi
View author publications
You can also search for this author in PubMed Google Scholar
A. Baskar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Tirtharaj Dash or Ashwin Srinivasan .

Editor information

Editors and Affiliations

Helmholtz Zentrum München - Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), Neuherberg, Germany
Igor V. Tetko
Institute of Computer Science, Czech Academy of Sciences, Prague 8, Czech Republic
Věra Kůrková
Helmholtz Zentrum München - Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), Neuherberg, Germany
Pavel Karpov
Helmholtz Zentrum München - Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH), Neuherberg, Germany
Fabian Theis

Appendix: Proofs

Proof of Lemma 1

Proof

The ideal case is $\mathrm {E}[Z]=0$. That is, on average, the search opens the correct box k on its first attempt. Now $\mathrm {P}(Z=0| \text{ the } \text{ ball } \text{ is } \text{ in } \text{ box } k) = h_k s_k = h_k (1-s_k)^0 s_k$. Since the ball can be in any of the n boxes, $\mathrm {P}(Z=0) = \sum _{k=1}^{n} h_k (1-s_k)^0 s_k$. More generally, for $Z=j$, the search opens wrong boxes j times, and $\mathrm {P}(Z=j) = \sum _{k=1}^{n} h_k (1-s_k)^j s_k$. The expected number of misses can now be computed:

$$\begin{aligned} \begin{aligned} \mathrm {E}[Z]|_{H,S}&= \sum _{j=0}^{\infty }{j\times \mathrm {P}(Z=j)}\\&= \sum _{j=0}^{\infty }{j\sum _{k=1}^{n} h_k (1-s_k)^j s_k} \\ \end{aligned} \end{aligned}$$

Swapping the summations over j and k, we get

$$\begin{aligned} \begin{aligned} \mathrm {E}[Z]|_{H,S}&= \sum _{k=1}^{n} h_k s_k \sum _{j=0}^{\infty }{j (1-s_k)^j} \\&= \sum _{k=1}^{n}{h_k s_k \frac{1-s_k}{s_k ^ 2}} \end{aligned} \end{aligned}$$

This simplifies to:

$$\begin{aligned} \mathrm {E}[Z]|_{H,S} = \sum _{k=1}^{n}{\frac{h_k}{s_k}} - 1 \end{aligned}$$

Proof of Lemma 2

Proof

This extends the Lemma 1 (Expected Cost of Misses by the Seeker) to a general case of multiple (K) stationary hiders. The number of ways the K hiders can choose to hide in n boxes is $^nP_K$ and let $\mathsf {P}(n,K)$ denote a set of all such permutations. For example, $\mathsf {P}(3,2) = \{(1,2), (1,3), (2,1), (2,3), (3,1), (3,2)\}$.

All the K hiders can hide in any one of these choices with probability $\left( h_{\sigma (i)}^{(1)} h_{\sigma (i)}^{(2)} \dots h_{\sigma (i)}^{(K)}\right) $, where $h_{\sigma (i)}^{(k)}$ denotes the probability of the hider in kth place in the selected choice of $\sigma (i)$. Analogously, the seeker can find any one of these hiders with probability $\left( s_{\sigma (i)}^{(1)} + s_{\sigma (i)}^{(2)} + \dots + s_{\sigma (i)}^{(K)}\right) $, and would not find the hider once is $1 - \left( s_{\sigma (i)}^{(1)} + s_{\sigma (i)}^{(2)} + \dots + s_{\sigma (i)}^{(K)}\right) $. If the hider makes j such misses, then it is $\left\{ 1-\left( s_{\sigma (i)}^{(1)} + s_{\sigma (i)}^{(2)} + \dots s_{\sigma (i)}^{(K)}\right) \right\} ^j$. Now, the expected misses for this multiple hider formulation is given as

$$\begin{aligned} \text {E}[Z] = \sum _{i, \sigma (i) \in \mathsf {P}(n,K)} \prod _{k=1}^{K} h_{\sigma (i)}^{(k)} \sum _{j = 0}^{\infty }{j \left( 1-\sum _{k=1}^{K}s_{\sigma (i)}^{(k)}\right) ^j \sum _{k=1}^{K}s_{\sigma (i)}^{(k)}} \end{aligned}$$

This further simplifies to

$$\begin{aligned} \text {E}[Z]|_{H,S} = \sum _{i, \sigma (i) \in \mathsf {P}(n,K)} \prod _{k=1}^{K} h_{\sigma (i)}^{(k)} \left( \frac{1}{\sum _{k=1}^{K}s_{\sigma (i)}^{(k)}} - 1 \right) \end{aligned}$$

Proof of Theorem 1

Proof

The problem can be posed as a constrained optimisation problem in which the objective function that is to be minimized is

$$ f = \sum _{i=1}^{n}{\frac{h_i}{s_i} - 1} $$

Our objective is to minimize the function f given any hider distribution H. Let us represent $\mathbf {\nabla } f = \left( \dfrac{\partial f}{\partial s_1}, \dfrac{\partial f}{\partial s_2}, \ldots , \dfrac{\partial f}{\partial s_n}\right) $. In this problem, $ \mathbf {\nabla } f = \left( -\frac{h_1}{s_1^2}, -\frac{h_2}{s_2^2}, \ldots , -\frac{h_n}{s_n^2} \right) $. Now, computing the double derivative $\mathbf {\nabla }^2 f$, we get

$$ \mathbf {\nabla }^2 f = \mathbf {\nabla } (\mathbf {\nabla } f) = 2\left( \frac{h_1}{s_1^3}, \frac{h_2}{s_2^3}, \dots , \frac{h_n}{s_n^3}\right) $$

Since, $\forall i, h_i \ge 0, s_i \ge 0$, we can claim that $\mathbf {\nabla }^2 f$ has all non-negative second derivative components. And, therefore f is convex.

Proof of Theorem 2

Proof

We will write $\text {E}[Z]|_{H,S}$ as a function of S i.e. f(S). Our objective is to minimise $f(S) = \sum _{i=1}^{n}{\frac{h_i}{s_i}}$ subject to constraint $\sum _{i=1}^{n}s_i = 1$. The corresponding dual form (unconstrained) of this minimisation problem can be written as

$$ g(S,\lambda ) = \sum _{i=1}^{n}{\frac{h_i}{s_i}} + \lambda \left( 1 - \sum _{i=1}^{n}{s_i} \right) $$

To obtain the optimal values of S and $\lambda $, we set $\frac{\partial g}{\partial s_i} = 0$ for $i = 1,\ldots ,n$, and $\frac{\partial g}{\partial \lambda } = 0$. This gives: $-\frac{h_i}{s_i^2} - \lambda = 0~\text {and}~ \sum _{i=1}^{n}s_i = 1$. From this: $s_i = -\frac{\sqrt{h_i}}{\sqrt{\lambda }},~\forall i$. Applying this quantity for $s_i$ in $\sum _{i=1}^{n}s_i = 1$ and the value of the parameter $\lambda = -\frac{h_i}{s_i^2}$, we get: $-\frac{\sum _{i=1}^{n}\sqrt{h_i}}{-\frac{\sqrt{h_i}}{s_i}} = 1$. Simplifying the above, we obtain the desired optimal seeker distribution $S^*$: $s^*_i = \frac{\sqrt{h_i}}{\sum _{j=1}^{n}{\sqrt{h_j}}},~\forall i \in \{1,\ldots ,n\}$.

Proof of Corollary 1

Proof

If S is non-uniform with $\forall s_i>0$, we have $\text {E}[Z]|{H,S} = \frac{1}{n}\sum _{i=1}^n{\frac{1}{s_i}} - 1 \ge \frac{n}{\sum _{i=1}^n{s_i}} - 1$ and the denominator is 1 because S is a distribution. So, $S^*$ must be a uniform distribution and in this case, the quantity $\text {E}[Z]|_{H,S} = \sum _{i=1}^{n}{\frac{1/n}{1/n}} - 1 = \sum _{i=1}^{n}1 - 1 = n - 1.$

Proof of Corollary 2

Proof

The proof is as follows:

$$\begin{aligned} \sum _{i=1}^{n}\frac{h_i}{s^*_i} = \sum _{i=1}^{n}\frac{h_i}{\left( \frac{\sqrt{h_i}}{\sum _{j=1}^{j}{\sqrt{h_j}}}\right) } = \sum _{i=1}^{n}\sqrt{h_i}\sum _{j=1}^{j}{\sqrt{h_j}} = \sum _{i=1}^{n}\left( {\sqrt{h_i}}\right) ^2 \end{aligned}$$

Hence, the result follows.

Proof of Theorem 3

Proof

The KL-divergence between the two distribution H and $S^*$ is defined as:

$$\begin{aligned} \text {KLD}(H\Vert S^*)&= \sum _{i=1}^n{h_i\log _2{\frac{h_i}{s^*_i}}} \\&= \sum _{i=1}^n{h_i\log _2{h_i}} - \sum _{i=1}^n{h_i\log _2{\frac{\sqrt{h_i}}{\sum _{j=1}^n {\sqrt{h_j}}}}} ~~\text {(using Theorem}~2)\\&= \frac{1}{2}\sum _{i=1}^n{h_i\log _2{h_i}} + \log _2\left( \sum _{j=1}^n{\sqrt{h_j}}\right) \left( \sum _{i=1}^n h_i\right) \\&= -\frac{1}{2}Entropy(H) + \log _2\left( \sum _{j=1}^n{\sqrt{h_j}}\right) \\&= -\frac{1}{2}Entropy(H) + \log _2\left( \text {E}[Z]|_{H,S^*} + 1\right) ^{\frac{1}{2}} ~~\text {(using Corollary}~2)\\&= \frac{1}{2}\left[ -Entropy(H) + \log _2\left( \text {E}[Z]|_{H,S^*} + 1\right) \right] \end{aligned}$$

Simplifying, we get:

$$ \text {E}[Z]|_{H,S^*} = 2^{2\text {KLD}(H||S^*)+Entropy(H)} - 1 $$

Proof of Lemma 3

Proof

The probability that a randomly drawn box is not in the U partition is $(1-p)$. The probability that in a sample of s boxes, none are from the U partition is $(1-p)^s$, and therefore the probability that there is at least 1 box amongst the s from the U partition is $1 - (1-p)^s$. We want this probability to be at least $\alpha $. That is:

$$ 1-(1-p)^s \ge \alpha $$

With some simple arithmetic, it follows that

$$ s \ge \frac{\mathrm {log}(1-\alpha )}{\mathrm {log}(1-p)} $$

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dash, T., Srinivasan, A., Joshi, R.S., Baskar, A. (2019). Discrete Stochastic Search and Its Application to Feature-Selection for Deep Relational Machines. In: Tetko, I., Kůrková, V., Karpov, P., Theis, F. (eds) Artificial Neural Networks and Machine Learning – ICANN 2019: Deep Learning. ICANN 2019. Lecture Notes in Computer Science(), vol 11728. Springer, Cham. https://doi.org/10.1007/978-3-030-30484-3_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-30484-3_3
Published: 09 September 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30483-6
Online ISBN: 978-3-030-30484-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Discrete Stochastic Search and Its Application to Feature-Selection for Deep Relational Machines

Abstract

Access this chapter

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Appendix: Proofs

Appendix: Proofs

Proof

Proof

Proof

Proof

Proof

Proof

Proof

Proof

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation