Skip to main content

Discrete Stochastic Search and Its Application to Feature-Selection for Deep Relational Machines

  • Conference paper
  • First Online:
Artificial Neural Networks and Machine Learning – ICANN 2019: Deep Learning (ICANN 2019)

Abstract

We use a model for discrete stochastic search in which one or more objects (“targets”) are to be found by a search over n locations (“boxes”), where n is infinitely large. Each box has some probability that it contains a target, resulting in a distribution H over boxes. We model the search for the targets as a stochastic procedure that draws boxes using some distribution S. We derive first a general expression on the expected number of misses \(\text {E}[Z]\) made by the search procedure in terms of H and S. We then obtain an expression for an optimal distribution \(S^{*}\) to minimise \(\text {E}[Z]\). This results in a relation between: the entropy of H and the KL-divergence between H and \(S^{*}\). This result induces a 2-partitions over the boxes consisting of those boxes with H probability greater than \(\frac{1}{n}\) and the rest. We use this result to devise a stochastic search procedure for the practical situation when H is unknown. We present results from simulations that agree with theoretical predictions; and demonstrate that the expected misses by the optimal seeker decreases as the entropy of H decreases, with a maximum obtained for uniform H. Finally, we demonstrate applications of this stochastic search procedure with a coarse assumption about H. The theoretical results and the procedure are applicable to stochastic search over any aspect of machine learning that involves a discrete search-space: for example, choice over features, structures or discretized parameter-selection. In this work, the procedure is used to select features for Deep Relational Machines (DRMs) which are Deep Neural Networks (DNNs) defined in terms of domain-specific knowledge and built with features selected from large, potentially infinite-attribute space. Empirical results obtained across over 70 real-world datasets show that using the stochastic search procedure results in significantly better performances than the state-of-the-art.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We note that a similar argument is used in [7] to identify possibly good solutions in discrete event simulations; and is proposed for use in Inductive Logic Programming (ILP) in [17]. Both do not explicitly relate this to a distribution model, as is done here.

  2. 2.

    https://www.cancer.gov/.

References

  1. Abadi, M., Agarwal, A., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). https://www.tensorflow.org/

  2. Ando, H.Y., Dehaspe, L., Luyten, W., Van Craenenbroeck, E., Vandecasteele, H., Van Meervelt, L.: Discovering H-bonding rules in crystals with inductive logic programming. Mol. Pharm. 3(6), 665–674 (2006). https://doi.org/10.1021/mp060034z

    Article  Google Scholar 

  3. Blum, A.: Learning boolean functions in an infinite attribute space. Mach. Learn. 9(4), 373–386 (1992). https://doi.org/10.1007/BF00994112

    Article  MATH  Google Scholar 

  4. Chollet, F., et al.: Keras (2015). https://keras.io

  5. Dash, T., Srinivasan, A., Vig, L., Orhobor, O.I., King, R.D.: Large-scale assessment of deep relational machines. In: Riguzzi, F., Bellodi, E., Zese, R. (eds.) ILP 2018. LNCS (LNAI), vol. 11105, pp. 22–37. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99960-9_2

    Chapter  Google Scholar 

  6. Fog, A.: Sampling methods for wallenius’ and fisher’s noncentral hypergeometric distributions. Commun. Stat. Simul. Comput.® 37(2), 241–257 (2008). https://doi.org/10.1080/03610910701790236

    Article  MathSciNet  MATH  Google Scholar 

  7. Ho, Y.C., Zhao, Q.C., Jia, Q.S.: Ordinal Optimization: Soft Optimization for Hard Problems. Springer, Boston (2007). https://doi.org/10.1007/978-0-387-68692-9

    Book  MATH  Google Scholar 

  8. Kelly, F.: On optimal search with unknown detection probabilities. J. Math. Anal. Appl. 88(2), 422–432 (1982)

    Article  MathSciNet  Google Scholar 

  9. King, R.D., Muggleton, S.H., Srinivasan, A., Sternberg, M.J.: Structure-activity relationships derived by machine learning: the use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming. Proc. Nat. Acad. Sci. U.S.A. 93(1), 438–42 (1996). https://doi.org/10.1073/pnas.93.1.438

    Article  Google Scholar 

  10. King, R.D., Muggleton, S.H., Srinivasan, A., Sternberg, M.: Structure-activity relationships derived by machine learning: the use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming. Proc. Nat. Acad. Sci. 93(1), 438–442 (1996). https://doi.org/10.1073/pnas.93.1.438

    Article  Google Scholar 

  11. Kinga, D., Adam, J.B.: A method for stochastic optimization. In: International Conference on Learning Representations (ICLR), vol. 5 (2015)

    Google Scholar 

  12. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  13. Lidbetter, T., Lin, K.: Searching for multiple objects in multiple locations. arXiv preprint arXiv:1710.05332 (2017)

  14. Lodhi, H.: Deep relational machines. In: Lee, M., Hirose, A., Hou, Z.-G., Kil, R.M. (eds.) ICONIP 2013. LNCS, vol. 8227, pp. 212–219. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-42042-9_27

    Chapter  Google Scholar 

  15. Muggleton, S., De Raedt, L.: Inductive logic programming: theory and methods. J. Logic Program. 19, 629–679 (1994). https://doi.org/10.1016/0743-1066(94)90035-3

    Article  MathSciNet  MATH  Google Scholar 

  16. Ruckle, W.H.: A discrete search game. In: Raghavan, T.E.S., Ferguson, T.S., Parthasarathy, T., Vrieze, O.J. (eds.) Theory and Decision Library, pp. 29–43. Springer, Netherlands (1991). https://doi.org/10.1007/978-94-011-3760-7_4

    Chapter  Google Scholar 

  17. Srinivasan, A.: A study of two probabilistic methods for searching large spaces with ILP. Technical report PRG-TR-16-00, Oxford University Computing Laboratory, Oxford (2000)

    Google Scholar 

  18. Stone, L.D.: Theory of Optimal Search, vol. 118. Elsevier, Amsterdam (1976)

    Google Scholar 

  19. Subelman, E.J.: A hide-search game. J. Appl. Probab. 18(3), 628–640 (1981). https://doi.org/10.2307/3213317

    Article  MathSciNet  MATH  Google Scholar 

  20. Van Craenenbroeck, E., Vandecasteele, H., Dehaspe, L.: Dmax’s functional group and ring library. https://dtai.cs.kuleuven.be/software/dmax/ (2002)

  21. Vig, L., Srinivasan, A., Bain, M., Verma, A.: An investigation into the role of domain-knowledge on the use of embeddings. In: Lachiche, N., Vrain, C. (eds.) ILP 2017. LNCS (LNAI), vol. 10759, pp. 169–183. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-78090-0_12

    Chapter  Google Scholar 

  22. Šourek, G., Aschenbrenner, V., Železny, F., Kuželka, O.: Lifted relational neural networks. In: Proceedings of the 2015th International Conference on Cognitive Computation: Integrating Neural and Symbolic Approaches, vol. 1583, pp. 52–60. COCO 2015. CEUR-WS.org, Aachen, Germany, Germany (2015). http://dl.acm.org/citation.cfm?id=2996831.2996838

Download references

Acknowledgments

The second author (A.S.) is a Visiting Professorial Fellow, School of CSE, UNSW Sydney. This work is partially supported by DST-SERB grant EMR/2016/002766, Government of India.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Tirtharaj Dash or Ashwin Srinivasan .

Editor information

Editors and Affiliations

Appendix: Proofs

Appendix: Proofs

Proof of Lemma 1

Proof

The ideal case is \(\mathrm {E}[Z]=0\). That is, on average, the search opens the correct box k on its first attempt. Now \(\mathrm {P}(Z=0| \text{ the } \text{ ball } \text{ is } \text{ in } \text{ box } k) = h_k s_k = h_k (1-s_k)^0 s_k\). Since the ball can be in any of the n boxes, \(\mathrm {P}(Z=0) = \sum _{k=1}^{n} h_k (1-s_k)^0 s_k\). More generally, for \(Z=j\), the search opens wrong boxes j times, and \(\mathrm {P}(Z=j) = \sum _{k=1}^{n} h_k (1-s_k)^j s_k\). The expected number of misses can now be computed:

$$\begin{aligned} \begin{aligned} \mathrm {E}[Z]|_{H,S}&= \sum _{j=0}^{\infty }{j\times \mathrm {P}(Z=j)}\\&= \sum _{j=0}^{\infty }{j\sum _{k=1}^{n} h_k (1-s_k)^j s_k} \\ \end{aligned} \end{aligned}$$

Swapping the summations over j and k, we get

$$\begin{aligned} \begin{aligned} \mathrm {E}[Z]|_{H,S}&= \sum _{k=1}^{n} h_k s_k \sum _{j=0}^{\infty }{j (1-s_k)^j} \\&= \sum _{k=1}^{n}{h_k s_k \frac{1-s_k}{s_k ^ 2}} \end{aligned} \end{aligned}$$

This simplifies to:

$$\begin{aligned} \mathrm {E}[Z]|_{H,S} = \sum _{k=1}^{n}{\frac{h_k}{s_k}} - 1 \end{aligned}$$

Proof of Lemma 2

Proof

This extends the Lemma 1 (Expected Cost of Misses by the Seeker) to a general case of multiple (K) stationary hiders. The number of ways the K hiders can choose to hide in n boxes is \(^nP_K\) and let \(\mathsf {P}(n,K)\) denote a set of all such permutations. For example, \(\mathsf {P}(3,2) = \{(1,2), (1,3), (2,1), (2,3), (3,1), (3,2)\}\).

All the K hiders can hide in any one of these choices with probability \(\left( h_{\sigma (i)}^{(1)} h_{\sigma (i)}^{(2)} \dots h_{\sigma (i)}^{(K)}\right) \), where \(h_{\sigma (i)}^{(k)}\) denotes the probability of the hider in kth place in the selected choice of \(\sigma (i)\). Analogously, the seeker can find any one of these hiders with probability \(\left( s_{\sigma (i)}^{(1)} + s_{\sigma (i)}^{(2)} + \dots + s_{\sigma (i)}^{(K)}\right) \), and would not find the hider once is \(1 - \left( s_{\sigma (i)}^{(1)} + s_{\sigma (i)}^{(2)} + \dots + s_{\sigma (i)}^{(K)}\right) \). If the hider makes j such misses, then it is \(\left\{ 1-\left( s_{\sigma (i)}^{(1)} + s_{\sigma (i)}^{(2)} + \dots s_{\sigma (i)}^{(K)}\right) \right\} ^j\). Now, the expected misses for this multiple hider formulation is given as

$$\begin{aligned} \text {E}[Z] = \sum _{i, \sigma (i) \in \mathsf {P}(n,K)} \prod _{k=1}^{K} h_{\sigma (i)}^{(k)} \sum _{j = 0}^{\infty }{j \left( 1-\sum _{k=1}^{K}s_{\sigma (i)}^{(k)}\right) ^j \sum _{k=1}^{K}s_{\sigma (i)}^{(k)}} \end{aligned}$$

This further simplifies to

$$\begin{aligned} \text {E}[Z]|_{H,S} = \sum _{i, \sigma (i) \in \mathsf {P}(n,K)} \prod _{k=1}^{K} h_{\sigma (i)}^{(k)} \left( \frac{1}{\sum _{k=1}^{K}s_{\sigma (i)}^{(k)}} - 1 \right) \end{aligned}$$

Proof of Theorem 1

Proof

The problem can be posed as a constrained optimisation problem in which the objective function that is to be minimized is

$$ f = \sum _{i=1}^{n}{\frac{h_i}{s_i} - 1} $$

Our objective is to minimize the function f given any hider distribution H. Let us represent \(\mathbf {\nabla } f = \left( \dfrac{\partial f}{\partial s_1}, \dfrac{\partial f}{\partial s_2}, \ldots , \dfrac{\partial f}{\partial s_n}\right) \). In this problem, \( \mathbf {\nabla } f = \left( -\frac{h_1}{s_1^2}, -\frac{h_2}{s_2^2}, \ldots , -\frac{h_n}{s_n^2} \right) \). Now, computing the double derivative \(\mathbf {\nabla }^2 f\), we get

$$ \mathbf {\nabla }^2 f = \mathbf {\nabla } (\mathbf {\nabla } f) = 2\left( \frac{h_1}{s_1^3}, \frac{h_2}{s_2^3}, \dots , \frac{h_n}{s_n^3}\right) $$

Since, \(\forall i, h_i \ge 0, s_i \ge 0\), we can claim that \(\mathbf {\nabla }^2 f\) has all non-negative second derivative components. And, therefore f is convex.

Proof of Theorem 2

Proof

We will write \(\text {E}[Z]|_{H,S}\) as a function of S i.e. f(S). Our objective is to minimise \(f(S) = \sum _{i=1}^{n}{\frac{h_i}{s_i}}\) subject to constraint \(\sum _{i=1}^{n}s_i = 1\). The corresponding dual form (unconstrained) of this minimisation problem can be written as

$$ g(S,\lambda ) = \sum _{i=1}^{n}{\frac{h_i}{s_i}} + \lambda \left( 1 - \sum _{i=1}^{n}{s_i} \right) $$

To obtain the optimal values of S and \(\lambda \), we set \(\frac{\partial g}{\partial s_i} = 0\) for \(i = 1,\ldots ,n\), and \(\frac{\partial g}{\partial \lambda } = 0\). This gives: \(-\frac{h_i}{s_i^2} - \lambda = 0~\text {and}~ \sum _{i=1}^{n}s_i = 1\). From this: \(s_i = -\frac{\sqrt{h_i}}{\sqrt{\lambda }},~\forall i\). Applying this quantity for \(s_i\) in \(\sum _{i=1}^{n}s_i = 1\) and the value of the parameter \(\lambda = -\frac{h_i}{s_i^2}\), we get: \(-\frac{\sum _{i=1}^{n}\sqrt{h_i}}{-\frac{\sqrt{h_i}}{s_i}} = 1\). Simplifying the above, we obtain the desired optimal seeker distribution \(S^*\): \(s^*_i = \frac{\sqrt{h_i}}{\sum _{j=1}^{n}{\sqrt{h_j}}},~\forall i \in \{1,\ldots ,n\}\).

Proof of Corollary 1

Proof

If S is non-uniform with \(\forall s_i>0\), we have \(\text {E}[Z]|{H,S} = \frac{1}{n}\sum _{i=1}^n{\frac{1}{s_i}} - 1 \ge \frac{n}{\sum _{i=1}^n{s_i}} - 1\) and the denominator is 1 because S is a distribution. So, \(S^*\) must be a uniform distribution and in this case, the quantity \(\text {E}[Z]|_{H,S} = \sum _{i=1}^{n}{\frac{1/n}{1/n}} - 1 = \sum _{i=1}^{n}1 - 1 = n - 1.\)

Proof of Corollary 2

Proof

The proof is as follows:

$$\begin{aligned} \sum _{i=1}^{n}\frac{h_i}{s^*_i} = \sum _{i=1}^{n}\frac{h_i}{\left( \frac{\sqrt{h_i}}{\sum _{j=1}^{j}{\sqrt{h_j}}}\right) } = \sum _{i=1}^{n}\sqrt{h_i}\sum _{j=1}^{j}{\sqrt{h_j}} = \sum _{i=1}^{n}\left( {\sqrt{h_i}}\right) ^2 \end{aligned}$$

Hence, the result follows.

Proof of Theorem 3

Proof

The KL-divergence between the two distribution H and \(S^*\) is defined as:

$$\begin{aligned} \text {KLD}(H\Vert S^*)&= \sum _{i=1}^n{h_i\log _2{\frac{h_i}{s^*_i}}} \\&= \sum _{i=1}^n{h_i\log _2{h_i}} - \sum _{i=1}^n{h_i\log _2{\frac{\sqrt{h_i}}{\sum _{j=1}^n {\sqrt{h_j}}}}} ~~\text {(using Theorem}~2)\\&= \frac{1}{2}\sum _{i=1}^n{h_i\log _2{h_i}} + \log _2\left( \sum _{j=1}^n{\sqrt{h_j}}\right) \left( \sum _{i=1}^n h_i\right) \\&= -\frac{1}{2}Entropy(H) + \log _2\left( \sum _{j=1}^n{\sqrt{h_j}}\right) \\&= -\frac{1}{2}Entropy(H) + \log _2\left( \text {E}[Z]|_{H,S^*} + 1\right) ^{\frac{1}{2}} ~~\text {(using Corollary}~2)\\&= \frac{1}{2}\left[ -Entropy(H) + \log _2\left( \text {E}[Z]|_{H,S^*} + 1\right) \right] \end{aligned}$$

Simplifying, we get:

$$ \text {E}[Z]|_{H,S^*} = 2^{2\text {KLD}(H||S^*)+Entropy(H)} - 1 $$

Proof of Lemma 3

Proof

The probability that a randomly drawn box is not in the U partition is \((1-p)\). The probability that in a sample of s boxes, none are from the U partition is \((1-p)^s\), and therefore the probability that there is at least 1 box amongst the s from the U partition is \(1 - (1-p)^s\). We want this probability to be at least \(\alpha \). That is:

$$ 1-(1-p)^s \ge \alpha $$

With some simple arithmetic, it follows that

$$ s \ge \frac{\mathrm {log}(1-\alpha )}{\mathrm {log}(1-p)} $$

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dash, T., Srinivasan, A., Joshi, R.S., Baskar, A. (2019). Discrete Stochastic Search and Its Application to Feature-Selection for Deep Relational Machines. In: Tetko, I., Kůrková, V., Karpov, P., Theis, F. (eds) Artificial Neural Networks and Machine Learning – ICANN 2019: Deep Learning. ICANN 2019. Lecture Notes in Computer Science(), vol 11728. Springer, Cham. https://doi.org/10.1007/978-3-030-30484-3_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-30484-3_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-30483-6

  • Online ISBN: 978-3-030-30484-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics