# Encyclopedia of Social Network Analysis and Mining

Living Edition
| Editors: Reda Alhajj, Jon Rokne

# Probabilistic Logic and Relational Models

• Manfred Jaeger
Living reference work entry
DOI: https://doi.org/10.1007/978-1-4614-7163-9_157-1

## Glossary

First-Order Predicate Logic

Formal system of mathematical logic that supports reasoning about structures consisting of a domain on which certain relations and functions are defined.

Bayesian Network

A graphical representation for the joint probability distribution of a set of random variables. A Bayesian network is specified by a directed acyclic graph whose nodes are the random variables and the conditional probability distributions of the random variables given their parents in the graph.

Markov Network

A graphical representation for the joint probability distribution of a set of random variables. A Markov network is specified by an undirected graph whose nodes are the random variables and potential functions defined on the cliques of the graph.

Horn Clause

A special class of “ifthen” expressions in first-order predicate logic, where the if condition is a conjunction of atomic statements, and the then conclusion is a single atomic statement.

Domain

a set of specific entities (also referred to as objects, nodes, etc.) for which a (probabilistic) model is to be constructed. In the context of probabilistic logic models, domains are mostly assumed to be finite.

Factorization

Representation of a joint probability distribution of multiple random variables as a product of functions (factors) that each depend only on a subset of the random variables.

## Definition

A probabilistic logic model (PLM) is a statistical model for relationally structured data. PLMs are specified in formal probabilistic logic modeling languages (PLMLs), which are accompanied by general algorithmic tools for specifying, analyzing, and learning probabilistic models. Elements of first-order logic syntax and semantics are used to define probability spaces and distributions.

## Introduction

Probabilistic logic modeling languages provide tools to construct probabilistic models for richly structured data, specifically graph and network data, or, generally, data contained in a relational database. Their logic-based representation languages support model specifications at a high level of abstraction and generality, which facilitates the adaptation of a single model to different data domains.

A PLML consists of syntax and semantics of the representation language, inference algorithms for computing the answers to probabilistic queries, and, in most cases, statistical methods for learning a model from data.

PLMs are distinguished from probabilistic logics (e.g., Nilsson 1986; Halpern 1990) in that the latter do not define a unique probabilistic model. Instead, they provide logic-based representation languages that can be used to formulate constraints on a set of possible models. Inference from such a probabilistic logic knowledge base then amounts to infer properties that hold for all admissible models. In the case of simple probabilistic queries for the probability of specific propositions, this means that only intervals of possible values can usually be derived, whereas a PLM will define a unique value.

## Key Points

A PLML is a uniform representation and inference framework for a wide spectrum of probabilistic modeling tasks in different application areas. For example, models for community detection, link prediction, or information diffusion in social network analysis, localization and navigation in robotics, or protein secondary structure prediction in bioinformatics can all be built using the same high-level, interpretable PLML.

Fundamental analysis and learning tasks are supported by general purpose algorithms. PLMLs, thus, enable data analysis by only formulating different model hypotheses (a process that can also be automated), without the need to design suitable inference routines on a case-by-case basis. They are ideally suited for exploratory analysis and rapid prototyping. The high level of flexibility often comes at the cost that for a concrete model designed for a specific application the generic algorithms do not obtain the best possible computational performance. Model-specific customization and optimization of the inference routines may then still be required in order to handle large-scale problems on a PLM platform.

## Historical Background

The development of probabilistic logic modeling languages is based on two main foundations: inductive logic programming and probabilistic graphical models. Inductive logic programming is an area of machine learning that is concerned with learning concept definitions from examples, where both the hypothesis space of concept definitions and the examples as well as available background knowledge are expressed by formulas in first-order logic, specifically by formulas from the restricted class of Horn clauses. Originally purely qualitative, inductive logic programming approaches were extended with quantitative, probabilistic elements to also deal with uncertainty and noisy data (Sato 1995; Muggleton 1996).

Probabilistic graphical models, such as Bayesian networks and Markov networks, provide efficient inference and learning techniques for probability distributions over multivariate state spaces. However, they lack flexibility in the sense that even small changes in the underlying state space will usually require a complete reconstruction of the graphical model. Research in artificial intelligence aimed to provide more adaptable and reusable models based on representation languages that express probabilistic dependencies at a higher level of abstraction than graphical models. From probabilistic knowledgebases in such representation languages one can then automatically generate concrete graphical models for specific state spaces. Logic-based representations, especially Horn clauses, also played an important role in the design of these knowledge-based model construction approaches (Breese 1992; Breese et al. 1994; Haddawy 1994).

In a related development in a more purely statistical context, the BUGS framework was developed to support compact specifications and generic inference modules for complex Bayesian statistical models (Gilks et al. 1994). BUGS model specifications are based on elements of imperative programming languages rather than logic-based knowledge representation. In terms of functionality and semantics, they are nevertheless closely related to PLMLs.

Already the probabilistic Horn abduction framework (Poole 1993) provided a clear synthesis of logic programming and graphical models. However, only from around (Kersting and De Raedt 2001) onward were the research lines of probabilistic inductive logic programming and knowledge-based model construction more broadly merged. Techniques originating in inductive logic programming have a lasting influence on current methods for learning the logical, qualitative dependency structure of probabilistic logic models. Techniques from probabilistic graphical models provide the main tools for numerical computations in learning the numerical parameters of a model and in performing probabilistic inference.

## PLM Semantics

Semantically, most PLMs can be understood as a generalization of the classical Erdős-Rényi random graph model. A random graph model defines for a given finite set of nodes N a probability distribution over all graphs over N. Probabilistic logic models extend this in several ways: first, they define probability distributions not only for a single binary edge relation but for multiple random attributes and relations of arbitrary arities. Second, they can allow the definition of this distribution to be conditioned on a set of already fixed and known attributes and relations. Third, they are not limited to Boolean attributes and relations, but can also define distributions over multivalued and, to a limited extent, also numeric attributes and relations.

Formally, if D = {d1, …, d n } is a finite domain, then an n-ary relation on D is a subset of D n . If n = 1, one usually speaks of an attribute, rather than a relation. In most cases, the domain will be partitioned into sub-domains of objects of different types. In predicate logic, relations are represented by relation symbols. An atom is a relation symbol followed by a list of arguments, which can be object identifiers or variables. A ground atom is an atom with only identifiers as arguments.

As an example, a bibliographic domain may contain entities D P = {p1, …, p k } of type person, and D A = {a1, … , a l } of type article, so that D = D P ∪ D A . A binary relation on D is author, taking one argument of type person, and one argument of type article, i.e., authorD P × D A . Examples of atoms then are author(X, Y), author(X, a3), and author(p1, a3), where X, Y are variable symbols and the last of these atoms is ground. The set of all ground atoms that can be formed using a given set of relations symbols, and identifiers from a given domain, is called the Herbrand base. A ground atom represents a Boolean variable, which in a purely logical context is a propositional variable and in a probabilistic context a Boolean random variable. A truth-value assignment to all atoms in the Herbrand base is a Herbrand interpretation. The strict limitation to Boolean variables is loosened in many PLMLs, so that also ground atoms representing multivalued variables can be allowed. Thus, a ground atom department(p1) could represent an attribute with possible values CS, statistics, and engineering, and a ground atom association(p4, a2) could represent a multivalued relation with possible values author, editor, and reviewer.

In many cases, some structural information expressed by fully known input relations is available for a given domain. Variously also referred to as context, predefined, skeleton, or logical relations, these input relations are not subject to probabilistic modeling. Relations that are not fully known, and for which a PLM then provides a probabilistic model, are probabilistic relations. If Hinp denotes a Herbrand interpretation for the input relations over the domain D, and Hprob a Herbrand interpretation for the probabilistic relations, then a PLM defines conditional probability distributions
$$P\left({H}^{\mathrm{prob}}|{H}^{\mathrm{inp}}, D\right).$$
(1)
Figure 1 shows a structured input domain Hinp consisting of persons, articles, and the author relation between persons and articles. Also assumed to be known is a linear ordering ≺ on the articles corresponding to their temporal order of publication. Probabilistic relations on this domain could be an unobserved department attribute for persons and a binary cites relation between articles. Figure 2 shows one possible Herbrand interpretation Hprob for these probabilistic relations. A PLM will define a probability distribution over all such Hprob.

In most cases, a PLM will be a generic model that defines the conditional distribution (1) for a general class of admissible input structures Hinp. Thus, a PLM for modeling probabilistic cites and department relations would be applicable to any input domain structured like the example in Fig. 1.

Other typical examples for PLMs are models for genetic traits in a pedigree, or models for biomolecular data. In the first case, input domains may consist of family trees defined by a set of individuals and input relations mother and father. The PLM can then define a probability distribution for probabilistic relations modeling uncertain genetic properties and relationships among the individuals. For biomolecular data, an input domain may consist of a protein given by its constituent amino acids and its known linear sequence structure as an input relation. A probabilistic relation can be a binary proximity or contact relation representing which amino acids are close in the three-dimensional folding of the protein.

A PLM can also be more specifically constructed for a single input domain. This is the usual scenario when a PLM models relations in a specific database. A popular example is the Internet Movie Database, consisting of entities of types, movie, actor, director, etc., and known relations such as in_cast and directed_by. Probabilistic relations modeled by a PLM can be any relations that are not fully known, or that one may want to predict for new entities added to the database.

The explicit distinction between input and probabilistic relations is not made in all PLMLs. When the distinction is not made, then all relations, in principle, are probabilistic, and the model simply defines distributions P(Hprob | D) for a class of domains D. Complete knowledge of a distinguished subset of probabilistic relations for a specific input domain can still be integrated into the PLM by conditioning P(Hprob | D) on the observed values of the known relations.

According to the PLM semantics as P(Hprob | Hinp, D), the set D of domain entities is always part of the known input, and the probability model P(Hprob | Hinp, D) only is a distribution over the finite set of different Hprob for a fixed D. Within this semantic framework, it is not possible to express uncertainty about the domain itself, i.e., how many or what entities exist in the domain D. Probabilistic logic models that go beyond the probabilistic Herbrand interpretation semantics in the narrow sense of (1) are probabilistic programming languages that define distributions over infinite outcome spaces defined by program executions (Muggleton 1996; Pfeffer 2001; Goodman et al. 2008; Ng et al. 2008), or infinite classes of Herbrand interpretations with varying domains (Milch et al. 2005). The added expressivity of these types of models comes at the cost of less semantic transparency, because semantics are usually described in procedural terms, whereas semantics for PLMs in the narrower sense can be given in a declarative manner. Furthermore, the added complexity of probabilistic programming languages often leaves stochastic simulation as the only viable inference technique.

## Modeling Languages

Probabilistic logic modeling languages are formal representation languages that typically incorporate some elements of predicate logic, or relational database design. The plethora of existing modeling frameworks can be classified along several dimensions: directed vs. undirected probabilistic models, discriminative vs. generative, and logic oriented vs. database oriented. The most fundamental distinction is between directed and undirected models, which can be understood as procedural and descriptive modeling approaches, respectively.

### Directed Models

In the procedural approach, the model specification corresponds to a direct, constructive specification of a sampling process for Herbrand interpretations. We note that procedural in this sense is to be distinguished from generative models in the usual sense of machine learning: generative there refers to any model that defines a full joint distribution of all random variables, in contrast to discriminative models that only define a conditional distribution for distinguished target variables, given the values of a different set of predictor variables.

The most common procedural approach is to define the distribution over Herbrand interpretations in terms of prior and conditional probabilities for the random variables in the Herbrand base. An interpretation can then be sampled by iteratively sampling truth values of ground atoms. If α1, … , α n is an enumeration of the Herbrand base for Hprob, the model is based on a factorization of (1) as
$$\begin{array}{ll} P\left({\alpha}_1,\dots, {\alpha}_n|{H}^{\mathrm{inp}}, D\right)& ={\prod}_{i=1}^n P\left({\alpha}_i|{\alpha}_1,\dots, {\alpha}_{i-1},{H}^{\mathrm{inp}}, D\right)\hfill \\ {}& ={\prod}_{i=1}^n P\left({\alpha}_i| pa\left({\alpha}_i\right),{H}^{\mathrm{inp}}, D\right),\hfill \end{array}$$
(2)
where pa(α i ) ⊆ {α1, …, αi−1} is the parent set of α i in a graphical model representation. Concrete modeling languages enable the specification of the factors P(α i | pa(α i ), Hinp, D) at a generic level that abstracts from the individual ground Herbrand atoms α i .

#### Rule-Based Representations

A widely used representation approach is based on probabilistic logic rules. The general style of this form of representation is illustrated in Table 1. It consists of a set of rules, where each rule specifies a probabilistic dependency of the probabilistic atom on the left side of the rule (also known as the head of the rule) on the probabilistic atoms on the right (the body of the rule). The dependencies may be restricted by constraints in the input relations.
Table 1

A rule-based PLM

 department(P) = Stats$$\overset{0.3}{\leftarrow }$$ department(P) = CS$$\overset{0.5}{\leftarrow }$$ department(P) = Eng$$\overset{0.2}{\leftarrow }$$ cites(A, A′) $$\overset{0.05}{\leftarrow }$$cites(A″, A′) | A′ ≺ A″ ≺ A cites(A, A′) $$\overset{0.02}{\leftarrow }$$department(P) = Stats, department(P′) = Stats | author(P, A), author(P′, A′) cites(A, A′) $$\overset{0.01}{\leftarrow }$$department(P) = CS, department(P′) = CS | author(P, A), author(P′, A′) cites(A, A′) $$\overset{0.01}{\leftarrow }$$department(P) = Eng, department(P′) = Eng | author(P, A), author(P′, A′) combine(cites) = noisy – or

The first three rules of Table 1 specify the probability distribution for the 3-valued department attribute. The following rules specify the distribution for the cites relation. According to the first cites rule, the probability that an article A′ is cited by an article A increases if there are other articles A″ already citing A′. This rule is constrained via the input relation ≺ to apply only to articles A, A′, and A″ in the correct temporal sequence. The following two rules stipulate that the probability that A cites A′ increases if the authors of the papers belong to the same department, where the level of this increase can be different for different departments.

A ground instance of a rule is obtained by substituting concrete elements from an input domain for the logical variables in the rule. A ground rule instance is a partial specification of the factor P(α i | pa(α i ), Hinp, D), where α i is the ground head of the rule. The specifications from multiple rule groundings with head α i have to be combined using a combining rule. In the example of Table 1, multiple rules for the cites relation are combined using the noisy-or function, which means that each applicable ground rule with head α i is interpreted as describing an independent potential cause for α i to be true.

A set of probabilistic rules defines a direct dependency relation on the probabilistic atoms of an input domain according to α j pa(α i ) if there is a ground instance of a rule with α i as the head and α j in the body. This dependency relation, together with the quantitative specifications in the rules and the combining rule, gives a definition of (1) in the form (2), which can be represented by a Bayesian network. Figure 3 shows a part of the network defined by the PLM of Table 1 for the input domain of Fig. 1.

To obtain a well-defined distribution via the factorization (2), the dependency relation on ground atoms must be acyclic, which, in turn, usually implies certain constraints on admissible input domains for a rule-based model. For the PLM of Table 1, for example, the dependency relation on probabilistic atoms will be acyclic whenever ≺ is a linear order, but may not be acyclic otherwise.

Modeling languages that essentially use rule-based representations include probabilistic knowledge bases (Ngo and Haddawy 1997), relational Bayesian networks (Jaeger 1997), and Bayesian logic programs (Kersting and De Raedt 2001).

#### Graphical Representations

There are several approaches for graph-based representations of PLMs. Network fragments (Laskey and Mahoney 1997; Laskey 2008) is the PLML most closely linked to probabilistic graphical models. The PLM here is represented by means of network templates that are parameterized with logical variables. Similar to the grounding operation of probabilistic logic rules, templates are instantiated by substituting elements from a concrete input domain for the variables. The collection of ground network fragments so obtained then is connected to a ground network as in Fig. 3.

A different approach to graphical representations is derived from a database perspective; it specifies PLMs in terms of probabilistic extensions of entity relationship diagrams. Probabilistic relational models in the sense of Friedman et al. (1999) first introduced this approach, which was subsequently refined and generalized in many ways, and found its most mature expression in the directed acyclic probabilistic entity relationship (DAPER) model (Heckerman et al. 2007).

Figure 4 shows as an example a DAPER representation of a citation model similar to the model of Table 1. In this diagram, boxes represent entity classes, diamond relationship classes, and oval attributes that can be associated with entities or relations. Dashed edges represent fixed associations and solid arrow as probabilistic dependencies. An input domain is given in the DAPER model by a full specification of all entity and relationship classes (in this context called a skeleton structure). Only attributes can be probabilistic in the DAPER framework. A probabilistic relation such as cites is modeled by means of an exists attribute on the objects of the corresponding relationship class. Thus, cites(a1, a2) = true is encoded as exists(cites(a1, a2)) = true. In order for all cites relations to be possibly true, all the cites(a1, a2) objects must exist in the underlying relationship class according to the skeleton structure. This condition is indicated by the label full in the graphical representation. Logical expressions on the edges are used to define the exact structure of the probabilistic dependencies among the attributes for different objects.

The DAPER model does not include a specific representation language for the specification of the conditional probability distributions for the probabilistic attributes. Any suitable way of defining these distributions in the form of tables or functions may be used. As for rule-based PLMs, combination functions may be used to handle many-to-one dependencies. Due to their orientation toward databases, probabilistic relational and DAPER models are somewhat more adapted toward also modeling numerical attributes than the logic-oriented, rule-based approaches. Many-to-one dependencies on numerical attributes are usually defined in terms of an aggregate such as mean or max of multiple numerical parents.

#### Independent Choices

While the rule-based approaches discussed above use syntax inspired from logic programming, they are semantically rooted in probabilistic graphical models. Another type of PLMLs, represented by Prism (Sato 1995), independent choice logic (ICL) (Poole 2008), and ProbLog (Kimmig et al. 2011), also derives their semantics from the theory of logic programs. We refer to them as IC models.

An IC model is given as a set L of logic program clauses labeled with probabilities, where different types of IC models impose different restrictions on the structure of L. The probability labels represent inclusion probabilities for the clauses in a random subset L′L: each clause of L is included in L′ independently of other clauses, with a probability according to its label. Every subset L′ then has a probability P (L′) of being the outcome of this selection process. L′ is a logic program that has a unique least Herbrand model LHM(L′), i.e., the Herbrand interpretation in which all the program clauses are satisfied, and a minimal number of ground atoms are assigned the value true.

The probability distribution over logic programs L′ then leads to a probability distribution over all Herbrand interpretations by
$$P\left({H}^{\mathrm{prob}}\right)=\sum_{L^{\prime }:{H}^{\mathrm{prob}}=\mathrm{LHM}\left({L}^{\prime}\right)} P\left({L}^{\prime}\right).$$
(3)

IC models typically do not use a distinguished set of input relations defining Hinp. Any known structure can be specified within L by ground atoms with selection probability 1.

Table 2 shows a ProbLog example given in Kimmig et al. (2011). The model defines a random graph over a domain of nodes a, …, d. It is given by insertion probabilities for certain candidate edges and the specification of the path relation. As in this example, it is common in IC models that probabilities less than one are only assigned to atomic facts, whereas proper rules are a fixed part of the model, i.e., included with probability 1 in the subset L′. The path relation in LHM(L′) is the transitive closure of the edge atoms included in L′.
Table 2

A ProbLog model

 0.8 :: edge(a, c) 0.6 :: edge(b, c) 0.7 :: edge(a, b) 0.9 :: edge(c, d) 0.8 :: edge(c, e) 0.5 :: edge(e, d) 1.0 :: path(X, Y) ← edge(X, Y) 1.0 :: path(X, Y) ← edge(X, Z), path(Z, Y)

The last clause of Table 2 leads to a cyclic dependency of path atoms, and therefore the model cannot be directly compiled into a directed network representation as in Fig. 3. The model can still be understood as a constructive sampling process for Herbrand interpretations, where now the role that in rule-based representations is played by the acyclic dependency condition on ground atoms is filled by the iterative construction process whose least fixed point defines the least Herbrand model.

### Undirected Models

In descriptive modeling approaches, a probability distribution over the space of Herbrand interpretations Hprob is defined by assigning weights to certain features of interpretations. The probability of an interpretation then is the normalized product of its feature weights.

For a particular input domain D, such a feature-based representation corresponds to a factorization of the joint probability distribution of the ground atoms α1, … , α n in the Herbrand base as
$$P\left({\alpha}_1,\dots, {\alpha}_n\right)=\frac{1}{Z}\prod_{i=1}^l{w}_i^{\varPhi_i\left({\alpha}_{i_1},\dots, {\alpha}_{i_{k_i}}\right)}$$
(4)
$$=\frac{1}{Z} \exp \left(\sum_{i=1}^l \log \left({w}_i\right){\varPhi}_i\left({\alpha}_{i_1},\dots, {\alpha}_{i_{k_i}}\right)\right).$$
(5)

Here the Φ i is 0-/1-valued feature functions that may depend on any subset $${\alpha}_{i_1},\dots, {\alpha}_{i_{k_i}}$$ of atoms, w i is a nonnegative weight associated with Φ i , and Z is the normalizing constant. Often, the log-linear version (5) is used, so that feature weights can be arbitrary reals $${w}_i^{\prime }= \log \left({w}_i\right)$$. The factorization (4) corresponds to a Markov network, whose nodes are the ground atoms α1, …, α n and where there is an edge between α i and α j if α i and α j appear together in one of the feature functions $${\varPhi}_i\left({\alpha}_{i_1},\dots, {\alpha}_{i_{k_i}}\right)$$.

Feature-based PLMLs are closely related to exponential random graph models (Robins et al. 2007). They generalize standard exponential random graph models in that they define distributions over multi-relational structures. More importantly, they provide precise, expressive formal representation languages for defining feature functions Φ by means of logical formulas ϕ(X1, … , X k ). Substitution of domain elements a h for the logical variables X h leads to the ground feature functions Φ(α1, … , α k ).

Table 3 gives an example of a feature-based model specification in the form of a Markov logic network (Richardson and Domingos 2006), with weights set according to the log-linear factorization form (5). The first weighted formula ϕ1(A, A′) expresses the hard constraint that cites must be consistent with the temporal order ≺. A grounding of this formula by substituting domain articles a3, a7 for the variables A, A′ gives the ground feature Φ(cites(a3, a7), a3a7). Herbrand interpretations in which this feature is true are assigned an infinite negative weight and hence probability zero. The second weighted formula is a soft constraint that adds a small negative weight for all true cites atoms and thereby makes interpretations with dense cites relations less probable. The third formula is a soft constraint that increases the probability of cites relations in which the same article A′ is cited by multiple other articles. The last two weighted formulas express that coauthors P, P′ of an article A are more likely to both belong to a CS department, than one belonging to CS, and the other to Stats.
Table 3

A Markov logic network

 −∞ : cites(A, A′) ∧ A ≺ A′ −0.5 : cites(A, A′) 0.8 : cites(A, A′) ∧ cites(A″, A′) 1.2 : department(P) = CS ∧ department(P′) = CS ∧ author(P, A) ∧ author(P′, A) −0.6 : department(P) = CS ∧ department(P′) = Stats ∧ author(P, A) ∧ author(P′, A)

The main strength of the feature-based specification lies in their ability to model mutual dependencies that are not easily factored into an acyclic dependency structure and to construct a model in a modular fashion by a list of constraints, without the need to specify, as in rule-based approaches, a combination function that combines disparate model components into a coherent conditional probability specification. A disadvantage lies in the fact that weights attached to features have no easily understood meaning and can be difficult to calibrate to obtain a probability distribution with the correct or desired properties.

### Discriminative Models

All PLMLs mentioned so far are used to construct generative models in the sense that they define a probability distribution over full Herbrand interpretations. Discriminative PLMs are designed to solve specific classification tasks consisting of the prediction of a class attribute, or class relation. Most attention here has been given to logic-relational extensions of decision tree classifiers. As for decision tree classifiers for conventional attribute-value data, there is only a small step from purely qualitative class-label prediction models to quantitative estimation models for the posterior class-label probability distribution. Qualitative, logic-relational decision tree models have been developed in the inductive logic programming line of research (Blockeel and De Raedt 1998). Probabilistic relational decision trees were introduced in Neville et al. (2003).

Figure 5 shows a probabilistic relational decision tree loosely following the presentation of Neville et al. (2003). In this tree, the estimate for the probability that article A cites article A′ is based on two features: the number of previous articles A″ already citing A′ and the fact whether the authors of A and A′ belong to the same department.

### Inference

PLMs can be used for a wide range of predictive (classification) and descriptive (clustering) inference tasks. As usual for probabilistic models, prediction is performed by computing the posterior probability distribution of unobserved random variables given the values of observed variables, and clustering is performed by predicting a special hidden, or latent, variable. Even though some forms of clustering (e.g., community detection) can be very relevant for the type of relational data modeled with PLMs, it is so far predictive tasks that have received the most attention.

Based on the PLM semantics as distributions over Herbrand interpretations, one can identify particular types of predictive inference tasks. The simplest form is individual classification, i.e., predict the value of a target relation for individual entities in the domain given complete or partial observations of probabilistic attributes and relations. Formally, this means to compute for single ground probabilistic query atoms α q Hprob the posterior distribution
$$P\left({\alpha}^q|{\wedge}_i{\alpha}_i^e={v}_i,{H}^{\mathrm{inp}}, D\right),$$
(6)
where v i is the observed values of ground probabilistic evidence atoms $${\alpha}_i^e$$. For example, in the domain of Fig. 1, it may have been observed that $${\alpha}_1^e$$department(p3) = CS, $${\alpha}_2^e$$: department(p2) = Stats, and $${\alpha}_3^e$$: cites(a8, a5) = true. Conditional on this evidence, one can compute the posterior probability of α q : department(p4). Generative PLMs typically support predictions conditioned on arbitrary partial observations of probabilistic relations. Discriminative models, in contrast, may require observations of a fixed set of evidence atoms. Thus, for example, the model in Fig. 5 requires for the prediction of cites(a5, a1) the values of all atoms cites(a2, a1), cites(a3, a1), and cites(a4, a1).
In the probability distributions modeled by PLMs, attributes and class labels of different but relationally connected entities will often be dependent. In individual classification, information provided by observed class labels can be exploited for the prediction of an unknown class label. The dependencies between class labels can be exploited further by performing collective classification, where prediction of the class labels is performed simultaneously for all entities with unobserved class label. Formally, if Hclass denotes (Herbrand) interpretations of the class relation, then collective classification is performed by computing
$$\underset{H^{\mathrm{class}}}{ \arg \max } P\left({H}^{\mathrm{class}}|{\wedge}_i{\alpha}_i^e={v}_i,{H}^{\mathrm{inp}}, D\right),$$
(7)
where the $${\alpha}_i^e={v}_i$$ are as in (6). Collective classification also is able to take into account the dependencies between unobserved class labels of different entities.

Depending on the nature of the relation being predicted, special types of prediction tasks can be distinguished. When the predicted variable is a binary relation, one often speaks of link prediction. An example is the prediction of the cites relation. When a predicted binary relation represents an identity relation, one speaks of entity resolution. For example, if the entities in the domain are bibliographic records for scientific articles (rather than the articles themselves), then the binary relation same as between records stands for the fact that both records refer to the same article.

#### Techniques

When the distribution defined by a PLM can be represented by a directed or undirected probabilistic graphical model as depicted in Fig. 3, then inference can be performed using standard inference techniques for probabilistic graphical models in such a support network. An individual prediction query (6) corresponds to computing a single-node posterior distribution in the network. A collective classification task corresponds to a maximum a posteriori (MAP) hypothesis inference task, i.e., the computation of the jointly most probable configuration of a specific subset of network nodes. Clustering tasks, too, can be solved by a MAP inference computing the most probable configuration of a latent cluster attribute.

A computationally usually more tractable alternative to MAP inference is to compute the most probable explanation (MPE), which means that instead of (7) on computes
$$\underset{H^{\mathrm{prob}}}{ \arg \max } P\left({H}^{\mathrm{prob}}|{\wedge}_i{\alpha}_i^e={v}_i,{H}^{\mathrm{inp}}, D\right).$$
(8)

Thus, instead of computing the most probable joint configuration of only the selected set of class (or cluster) variables, one computes the most probable configuration of all unobserved probabilistic relations. The value for Hclass induced by the solution of (8) may be different from the one obtained by (7).

Figure 3 shows a part of a support network for answering queries for the PLM of Table 1 and domain from Fig. 1. Even though the number of nodes in a support network is polynomial in the size of the input domain, exact inference on the support network typically is intractable for input domains of realistic size. Approximate inference techniques, notably sampling approaches such as Markov chain Monte Carlo simulation or importance sampling, then have to be used.

The computation of (conditional) probabilities (6) can also be reduced to a weighted model counting problem, i.e., the computation of the sum of weights of all models of a propositional theory, where each model has a weight defined by its truth assignment to the propositional variables (Chavira and Darwiche 2008). The reduction can be based on a given support network. Alternatively, one may also directly compile a PLM query into a data structure used for solving weighted model counting problems, especially when the usual support network construction is not possible, as in some IC frameworks (Fierens et al. 2011).

The inference techniques mentioned so far all operate on the level of ground instances of the PLM, i.e., the high-level representation language only is used to construct low-level models such as depicted in Fig. 3, which are entirely defined in terms of ground atoms. The inference methods used on these low-level models do not take advantage of symmetries in the ground model that are due to the fact that it is constructed out of generic rules, and therefore many of its ground atoms are indistinguishable. Lifted inference techniques for PLMs are developed with the goal to leverage these symmetries and to perform basic inference operations jointly for groups of indistinguishable atoms (Poole 2003; de Salvo Braz et al. 2005). In certain cases this can reduce inference complexity from exponential to linear in the size of the domain. Lifted versions have been developed for a variety of exact and approximate inference methods, including variable elimination (Poole 2003; de Salvo Braz et al. 2005; Milch et al. 2008) and weighted model counting (Gogate and Domingos 2011; Van den Broeck et al. 2011). Theoretical limitations for lifted inference are given by lower complexity bounds for probabilistic inference in PLMs (Jaeger 2000).

## Learning

A PLM is learned from data consisting of observations of input domains and their probabilistic relations. Thus, in a very general form, training data for PLMs can be written as
$$\begin{array}{ll}\hfill & \mathcal{D}=\left(\left({H}_1^{\mathrm{prob}},{H}_1^{\mathrm{inp}},{D}_1\right),\dots, \left({H}_N^{\mathrm{prob}},{H}_N^{\mathrm{inp}},{D}_N\right)\right).\end{array}$$

Often N = 1, i.e., a model is learned from a single relational structure. For example, a PLM for bibliographic data may be learned from a single bibliographic database, such as graphically depicted jointly in Figs. 1 and 2. In this case, the observed random variables, i.e., the probabilistic Herbrand atoms, do not usually obey any assumptions of being independent and/or identically distributed (IID). The fact that learning is not based on IID data often is seen as a main distinguishing feature of PLM learning as opposed to more traditional machine learning scenarios.

When N > 1, then the individual observations $$\left({H}_i^{\mathrm{prob}},{H}_i^{\mathrm{inp}},{D}_i\right)$$ will typically be assumed to be independent. This case is further subdivided into the scenario where $$\left({H}_1^{\mathrm{inp}},{D}_1\right)=\ldots =\left({H}_N^{\mathrm{inp}},{D}_N\right)$$, i.e., data consists of multiple observations of the probabilistic relations for a fixed input domain, and $$\left({H}_1^{\mathrm{inp}},{D}_1\right)\ne \ldots \ne \left({H}_N^{\mathrm{inp}},{D}_N\right)$$, i.e., one observes multiple input domains. An example for the first scenario is that (Hinp, D) represents a fixed social network of a set of nodes D and a social link relation defined by Hinp. The $${H}_N^{\mathrm{inp}}$$ may then contain time-stamped observations of different messages m i that are propagated through the network, which can be encoded as ground atoms has message(d, m i , t), where dD and t are time stamps. From this data, a PLM for information diffusion could then be learned.

A typical example for the second scenario is biomolecular data, where data cases correspond to different molecules, which are described by their constituent atoms D i and known structural properties represented as $${H}_i^{\mathrm{inp}}$$. The probabilistic relations observed in $${H}_i^{\mathrm{prob}}$$ encode uncertain biochemical or structural properties of the molecule.

From such data PLMs for predicting chemical reactivity or (secondary) structure of a molecule could be learned.

Like most types of probabilistic models, a PLM M can be decomposed into a qualitative model structure S and numerical parameters θ:
$$M=\left( S,\boldsymbol{\theta} \right).$$
(9)

For rule-based, IC, or feature-based models, S consists of the set of logical formulas used in the model, and θ comprises the probability or weight parameters. For models based on graphical representations, S consists of the graph structure as in Fig. 4, and θ comprises parameters needed to specify the local probability distributions.

Based on a model decomposition (9), one distinguishes parameter and structure learning. Parameter learning is the task of fitting the parameter vector θ given a model structure S, where S may either be a fixed, manually designed structure, or a current candidate structure within a structure learning procedure. For parameter learning, most methods of statistical machine learning can be adapted to the PLM context, either based on maximizing the likelihood P(Hprob | θ, S, Hinp, D) or, in Bayesian approaches, the posterior probability P(θ | Hprob, S, Hinp, D). Like probabilistic inference, optimization of these score functions usually is performed on the basis of ground instances of the PLM, using existing learning techniques for such ground models. Thus, for example, in the case of directed models, one may construct a support Bayesian network and apply parameter learning techniques for Bayesian networks. These techniques have to be slightly adapted; however seen as a conventional Bayesian network, each node in a support network like the one of Fig. 3 has its own parameter vector, which can only be learned from data containing multiple observations of the node and its parents in the network. PLM parameter learning from a single observation of the nodes is enabled by parameter tying: in the PLM-induced support network, many parameters in different nodes are known to be identical, since they are equal to (functions of) the original parameters in the PLM. Thus, for example, all the nodes department(p1), … , department(p5) share the same parameters θ1 = P (department(p i ) = CS), θ2 = P (department(p i ) = Stats), and θ3 = P (department(p i ) = Eng). Thereby, a single observation of values for all nodes can be sufficient to estimate the model parameters.

Structure learning usually is performed by local search methods that generate candidate structures S1,…, S i , Si+1,…, where Si+1 is obtained from S i by a small, local model modification. In modeling languages based on collections of logical formulas, such local modifications typically consist of addition/deletion of an atomic expression from a formula. Candidate structures are evaluated using a score function
$$\mathrm{score}\left( S,{H}^{\mathrm{prob}},{H}^{\mathrm{inp}}\right)=\mathrm{score}\left( P\left({H}^{\mathrm{prob}}|{\boldsymbol{\theta}}^{\ast }, S,{H}^{\mathrm{inp}}, D\right), S\right),$$
where θ is the parameter vector fitted by parameter learning for the structure S. The score is increasing in the likelihood score P(Hprob|θ, S, Hinp) and decreasing in the model complexity of S. Alternatively, Bayesian scores can be used that incorporate a prior distribution over model structures.

## Key Applications

As mentioned above, a key feature of PLMLs is their generality and flexibility, which leads to a very broad range of possible applications, especially in a prototype development phase. In the context of social network analysis, PLMs have mostly been applied to prediction tasks, including individual and collective classification, and link prediction.

## Future Directions

The development of lifted inference techniques that could provide scalable inference for rich classes of PLMs is an active research area. Various ways of extending the expressivity of PLMLs are also a topic of current research. This includes the step from PLMLs in the somewhat narrower sense to probabilistic programming languages, integration of numerical random attributes and relations into PLMLs, and the extension of PLMLs to decision support models by integrating utility and decision variables.

## References

1. Blockeel H, De Raedt L (1998) Top-down induction of first-order logical decision trees. Artif Intell 101(1–2):285–297
2. Breese JS (1992) Construction of belief and decision networks. Comput Intell 8(4):624–647
3. Breese JS, Goldman RP, Wellman MP (1994) Introduction to the special section on knowledge-based construction of probabilistic decision models. IEEE Trans Syst Man Cybern 24(11):1577–1579Google Scholar
4. Chavira M, Darwiche A (2008) On probabilistic inference by weighted model counting. Artif Intell 172:772–799
5. de Salvo Braz R, Amir E, Roth D (2005) Lifted first-order probabilistic inference. In: Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI-05), pp 1319–1325Google Scholar
6. Fierens D, den Broeck GV, Thon I, Gutmann B, De Raedt L (2011) Inference in probabilistic logic programs using weighted CNF’s. In: Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI 2011). AUAI Press, CorvallisGoogle Scholar
7. Friedman N, Getoor L, Koller D, Pfeffer A (1999) Learning probabilistic relational models. In: Proceedings of the 16th International Joint Conference on Artificial Intelligence (IJCAI-99)Google Scholar
8. Gilks WR, Thomas A, Spiegelhalter DJ (1994) A language and program for complex bayesian modelling. Statistician 43(1):169–177
9. Gogate V, Domingos P (2011) Probabilistic theorem proving. In: Proceedings of the 27th Conference of Uncertainty in Artificial Intelligence (UAI-11). AUAI Press, CorvallisGoogle Scholar
10. Goodman ND, Mansinghka VK, Roy D, Bonawitz K, Tenenbaum JB (2008) Church: a language for generative models. In: Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence (UAI-08). AUAI Press, CorvallisGoogle Scholar
11. Haddawy P (1994) Generating Bayesian networks from probability logic knowledge bases. In: Proceedings of the 10th Conference on Uncertainty in Artificial Intelligence (UAI-94). Morgan Kaufmann, San Francisco, pp 262–269Google Scholar
12. Halpern J (1990) An analysis of first-order logics of probability. Artif Intell 46:311–350
13. Heckerman D, Meek C, Koller D (2007) Probabilistic entity-relationship models, PRMs, and plate models. In: Getoor L, Taskar B (eds) Introduction to statistical relational learning. MIT Press, Cambridge, MAGoogle Scholar
14. Jaeger M (1997) Relational bayesian networks. In: Geiger D, Shenoy PP (eds) Proceedings of the 13th Conference of Uncertainty in Artificial Intelligence (UAI-97). Morgan Kaufmann, San Francisco, pp 266–273Google Scholar
15. Jaeger M (2000) On the complexity of inference about probabilistic relational models. Artif Intell 117:297–308
16. Kersting K, De Raedt L (2001) Towards combining inductive logic programming and bayesian networks. In: Proceedings of the Eleventh International Conference on Inductive Logic Programming (ILP-2001). Springer, Berlin, Heidelberg, pp 118–131Google Scholar
17. Kimmig A, Demoen B, De Raedt L, Santos Costa V, Rocha R (2011) On the implementation of the probabilistic logic programming language problog. Theory Pract Logic Program 11(2–3):235–262
18. Laskey KB (2008) Mebn: a language for first-order bayesian knowledge bases. Artif Intell 172(2–3):140–178. doi:10.1016/j.artint.2007.09.006
19. Laskey KB, Mahoney SM (1997) Network fragments: representing knowledge for constructing probabilistic models. In: Proceedings of the 13th Annual Conference on Uncertainty in Artificial Intelligence (UAI–97). Morgan Kaufmann Publishers, San Francisco, pp 334–341Google Scholar
20. Milch B, Marthi B, Russell S, Sontag D, Ong D, Kolobov A (2005) Blog: probabilistic logic with unknown objects. In: Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI-05), pp 1352–1359Google Scholar
21. Milch B, Zettlemoyer LS, Kersting K, Haimes M, Kaelbling LP (2008) Lifted probabilistic inference with counting formulas. In: Proceedings of the 23rd AAAI Conference on Artificial Intelligence (AAAI-08). AAAI Press, Menlo ParkGoogle Scholar
22. Muggleton S (1996) Stochastic logic programs. In: De Raedt L (ed) Advances in Inductive Logic Programming. IOS Press, Washington, DC, pp 254–264Google Scholar
23. Neville J, Jensen D, Friedland L, Hay M (2003) Learning relational probability trees. In: Proceedings of The 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-03). ACM, New YorkGoogle Scholar
24. Ng KS, Lloyd JW, Uther WTB (2008) Probabilistic modelling, inference and learning using logical theories. Ann Math Artif Intell 54(1–3):159–205
25. Ngo L, Haddawy P (1997) Answering queries from context-sensitive probabilistic knowledge bases. Theor Comput Sci 171:147–177
26. Nilsson N (1986) Probabilistic logic. Artif Intell 28:71–88
27. Pfeffer A (2001) IBAL: a probabilistic rational programming language. In: Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI-01)Google Scholar
28. Poole D (1993) Probabilistic horn abduction and Bayesian networks. Artif Intell 64:81–129
29. Poole D (2003) First-order probabilistic inference. In: Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI-03)Google Scholar
30. Poole D (2008) The independent choice logic and beyond. In: De Raedt L, Frasconi P, Kersting K, Muggleton S (eds) Probabilistic inductive logic programming, lecture notes in artificial intelligence, vol 4911. Springer, Berlin, pp 222–243
31. Richardson M, Domingos P (2006) Markov logic networks. Mach Learn 62(1–2):107–136
32. Robins G, Pattison P, Kalish Y, Lusher D (2007) An introduction to exponential random graph (p*) models for social networks. Soc Networks 29(2):173–191
33. Sato T (1995) A statistical learning method for logic programs with distribution semantics. In: Proceedings of the 12th International Conference on Logic Programming (ICLP’95). MIT Press, Cambridge, pp 715–729Google Scholar
34. Van den Broeck G, Taghipour N, Meert W, Davis J, De Raedt L (2011) Lifted probabilistic inference by first-order knowledge compilation. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI-11)Google Scholar

1. De Raedt L (2008) Logical and relational learning. Springer, Berlin
2. De Raedt L, Frasconi P, Kersting K, Muggleton S (eds) (2008) Probabilistic inductive logic programming, lecture notes in artificial intelligence, vol 4911. Springer, BerlinGoogle Scholar
3. Getoor L, Taskar B (eds) (2007) Introduction to statistical relational learning. MIT Press, Cambridge, MA