# Encyclopedia of Social Network Analysis and Mining

Living Edition
| Editors: Reda Alhajj, Jon Rokne

# Probability Matrices

• Andrew Marchese
• Vasileios Maroulas
Living reference work entry
DOI: https://doi.org/10.1007/978-1-4614-7163-9_158-1

## Glossary

Aperiodic Markov Chain

A Markov chain in which the period of every state is one

Discrete Time Markov Chain

A sequence of random variables $$X={\left\{{X}_n\right\}}_{n=1}^{\infty }$$ taking values in some state space S such that the probability of X n moving to any state only depends upon its current state and the time n

Irreducible Markov Chain

A Markov chain in which there is a positive probability of moving from any state to any other state in a finite amount of time

Positive Recurrent Markov Chain

A Markov chain is called positive recurrent is for every state $$i\in S, {\sum}_{m=1}^{\infty }{nf}_{ii}^{(n)}<\infty$$ where $${f}_{ii}^{(n)}= P\left( \inf \right\{ m\ge 1$$: X m  = i|X 0 = i} = n)

Probability Matrix

A matrix p describing the probability of traveling between states in a Discrete Time Markov Chain

## Definition

This section provides the definition of a probability matrix in the context of Markov chains (Allen 2003). Given a discrete time homogeneous Markov chain $$X={\left\{{X}_n\right\}}_{n=1}^{\infty }$$, the probability matrix (sometimes called the transition matrix or the probability transition matrix) is a matrix p(i,  j) such that for any i, j, i 0, … , i n−1
$$\begin{array}{ll}\hfill & P\left({X}_{n+1}= j|{X}_n= i,\ {X}_{n-1}={i}_{n-1}, \dots,\ {X}_0={i}_0\right)\\ {}& = p\left( i, j\right)\hfill \end{array}$$

In other words, the matrix p contains the probability of moving from state i to state j at any time n, regardless of the previous states of the Markov chain.

## Introduction

This section provides an introduction to Markov chains and basic facts and results about Markov chains and their associated probability matrices (Allen 2003). Suppose we have a sequence of random variables $${\left\{{X}_n\right\}}_{n=1}^{\infty }$$ taking values in some state space S, which may be finite or infinite. The sequence of random variables satisfies the Markov property if for all n ∈  + and all states i n , i n−1, … , i 0 ∈ S
$$P\left({X}_n={i}_n|{X}_{n-1}={i}_{n-1},\dots, {X}_0={i}_0\right)= P\left({X}_n={i}_n|{X}_{n-1}={i}_{n-1}\right)$$
This means the probability of movement from the current state does not depend on any previous values of the sequence, and so a sequence satisfying the Markov property has no memory. As stated above, if a sequence of random variables satisfies such a property, it is called a Markov chain, and the associated probabilities
$$P\left({X}_n={i}_n|{X}_{n-1}={i}_{n-1}\right)= p\left({i}_{n-1}, {i}_n\right)$$
can be stored in a matrix p called the probability matrix. It is possible for the probabilities in a Markov chain to depend on time n. Unless stated otherwise, we assume the Markov chain is homogeneous. That is, the Markov chain probabilities do not depend on the time n.
Note that it can be shown that the multistep probability corresponds to powers of the probability matrix. In particular
$$P\left({X}_{n+ m}={i}_{n+ m}|{X}_n={i}_n\right)={p}^m\left({i}_n,\ {i}_{n+ m}\right).$$

A natural question arising at this point is “What is the behavior of p m as m → ∞?” After defining some key properties of Markov chains, this question will be revisited.

## Key Points

We will focus on basic properties of Markov chains and probability matrices. Using these definitions, we introduce key results related to Markov chains such as existence of a “stationary” or equilibrium distribution of states and convergence to this distribution. This notion of stationary distribution is important in the study of Markov chains and in applications such as webpage ranking systems (Page et al. 1998).

## Historical Background

Markov chains get their name from Russian mathematician Andrey Andreyevich Markov (Hayes 2014). Late in his life, Markov started developing notions regarding chains of probabilities. In 1913, Markov published a paper describing how a poem could be analyzed using a chain-like structure. He counted every pair of letters in 20,000 letters and recording whether they were vowel-vowel, vowel-consonant/consonant-vowel, or consonant-consonant. He then argued that the frequency of vowel-vowel pairs was much less common than chance, and so letter probabilities in language are not independent. Markov also proved that in a small, two-state Markov chain with all positive probabilities, the probability matrix ends up converging to some value. Later in the next section, we will see that this result holds for much more general cases.

In the time since Markov’s introduction of this probability object, Markov chains have been used for a wide variety of applications. These applications range from internet search applications (Page et al. 1998) to musical applications (McAlpine et al. 1999). Later on, an in-depth description will be given of Google’s PageRank algorithm as well as an application of Markov chains to machine learning.

## Probability Matrices

### Definition 1

A Markov chain {X n } with state space S is said to be irreducible if
$$\forall {i}_1,{i}_2\in S, \exists M\;\mathrm{s}.\mathrm{t}. P\left({X}_{1+ M}={i}_2|{X}_1={i}_1\right)>0$$

This means a Markov chain is irreducible if it is possible to travel from any state to any other state in a finite number of time steps. The notion of irreducibility will be a very important notion in a main result below.

Another important notion is the idea of periodicity.

### Definition 2

Let {X n } be a Markov chain with probability matrix p and state space S. Consider all n ∈  + for which p n (i,  i) > 0 for some i ∈ S. The period of i is defended as
$$k(i)=\mathrm{g}.\mathrm{c}.\mathrm{d}\left\{ n|{p}^n\left( i, i\right)>0, n\in {\mathbb{Z}}^{+}\right\}$$

If k(i) > 1, i is periodic of period k(i). Otherwise, if k(i) = 1, i is said to be aperiodic.

The following Theorem is immediately realized.

### Theorem 1

If p(i,  i) > 0, then k(i) = 1.

Proof If p(i,  i) > 0, then 1 ∈ {n| p n (i,  i) > 0,  n ∈  +}, and so the GCD. of this set is 1.

We now consider the notion of when a state will be first visited, or what the probability is that a state will ever be visited.

### Definition 3

Define
$${T}_{i, j}= \inf \left\{ m\ge 1:{X}_m= j|{X}_0= i\right\}$$
and
$${f}_{i j}^{(n)}= P\left({T}_{i, j}= n\right)$$
$${f}_{ij}^{(n)}$$ is the probability that if the Markov chain starts at X 0 = i, the first time state j is reached is at time n. Notice that $${f}_{ii}^{(n)}$$ can be thought of as the probability the Markov chain returns to state i for the first time in n steps.

A state i ∈ S is transient if $${\sum}_{m=1}^{\infty }{f}_{ii}^{(m)}<1$$ and is recurrent if $${\sum}_{m=1}^{\infty }{f}_{ii}^{(m)}=1$$.

Given this definition, it makes sense to consider E[T ii ], or the expected return time to a state.

### Definition 4

The mean recurrence time for a recurrent state i ∈ S is defined as
$$E\left[{T}_{ii}\right]=\sum_{n=1}^{\infty }{ n f}_{ii}^{(n)}$$

A state is called positive recurrent if this expectation is finite and null recurrent otherwise.

In the case of a finite state space, positive recurrence can be shown by checking for irreducibility.

### Theorem 2

A finite, irreducible Markov chain is positive recurrent.

We now consider the notion of stationary distributions, which relate to the limiting behavior of Markov chains under certain conditions.

### Definition 5

Given a Markov chain X with state space (i 1, …) = S, a stationary distribution is a row vector π such that
$$\pi p=\pi$$
such that π = [π 1, …] ,  0 ≤ π i  ≤ 1, and $${\sum}_{m=1}^{\infty }{\pi}_m=1$$. Note this will be a finite sum if the state space is finite.

The stationary distribution can be thought of as an equilibrium of a Markov chain, as demonstrated by the following Theorem (Serfozo 2009).

### Theorem 3

Let X be an irreducible Markov chain. Then X has a stationary distribution if and only if all of its states are positive recurrent. In particular, the stationary distribution is unique and is of the form
$${\pi}_j=\frac{1}{\mu_j}$$
where μ j  = E j [min {n ≥ 1| X n  = j}.

Under certain conditions, we can now characterize the limiting behavior of the probability matrix p m as m approaches infinity (Allen 2003).

### Theorem 4

Suppose a Markov chain X with state space S and probability matrix p is irreducible, positive recurrent, and aperiodic. Then there exists a unique positive stationary distribution π such that
$$\underset{m\to \infty }{ \lim }{p}^m\left( i, j\right)={\pi}_i$$

The result above shows that the probability matrix will converge to the stationary distribution in the limit, explaining the limiting behavior of the chain.

## Examples

### Example 1

Suppose the weather on each day is represented as a Markov chain $$X={\left\{{X}_n\right\}}_{n=1}^{\infty },$$ with state 1 indicating a clear day and state 2 indicating a rainy day. Assume the following probabilities:

$$P\left({X}_1=\mathrm{Clear}|{X}_0=\mathrm{Clear}\right)=.8 P\left({X}_1=\mathrm{Rainy}|{X}_0=\mathrm{Clear}\right)=.2 P\left({X}_1=\mathrm{Clear}|{X}_0=\mathrm{Rainy}\right)=.6 P\left({X}_1=\mathrm{Rainy}|{X}_0=\mathrm{Rainy}\right)=.4$$
This can be represented as the probability matrix P:
$$P=\left[\begin{array}{c}\hfill .8 .2\hfill \\ {}\hfill .6 .4\hfill \end{array}\right]$$

### Example 2

Given the Markov Chain presented in Example 1 from above, what is the probability is will be clear in 3 days if it is clear today? To answer this compute P 3:

$${P}^3=\left[\begin{array}{l}.752 .248\hfill \\ {}.744 .256\hfill \end{array}\right]$$
We see that using this matrix,
$$P\left({X}_3=\mathrm{Clear}|{X}_0=\mathrm{Clear}\right)=.752$$

### Example 3

Consider the two Markov chains X = {X n } and Y = {Y n } given by probability matrices P 1 and P 2, respectively.

$$\begin{array}{l}{P}_1= \left[\begin{array}{l}.1 .9 0\\ {}.5 .5 0\\ {}.3 .3 .4\end{array}\right]\\ {}{P}_2= \left[\begin{array}{l}0 1 0\hfill \\ {}0 0 1\hfill \\ {}1 0 0\hfill \end{array}\right]\end{array}$$

For the Markov chain X, note that states 1 and 2 can only ever get to each other, and so it is impossible to reach state 3 from states 1 and 2. Thus, the Markov chain X is not irreducible.

On the other hand, the Markov chain Y is irreducible. Although state 1 cannot get to state 3 directly, it can get there in finitely many (2) steps, similarly for each other pair of states.

### Example 4

Suppose a Markov chain X has probability matrix
$${P}_1= \left[\begin{array}{l}.1 .6 .3\hfill \\ {}.5 .4 .1\hfill \\ {}.3 .3 .4\hfill \end{array}\right]$$
It is easy to see that $${f}_{11}^{(1)}=.1$$. But what about $${f}_{11}^{(2)}$$? The Markov chain must first leave state 1 and then return in one step. There are two ways for this to happen: one way is to go first from 1 to 2 and then back from 2 to 1, and the other way is to go from 1 to 3 and then 3 to 1. This gives
$${f}_{11}^{(2)}= p\left(1,2\right) p\left(2,1\right)+ p\left(1,3\right) p\left(3,1\right)=(.6)(.5)+(.3)(.3)=.39$$

### Example 5

Consider a Markov chain X with probability matrix
$$P= \left[\begin{array}{l}.1 .6 .3\hfill \\ {}.5 .4 .1\hfill \\ {}.3 .3 .4\hfill \end{array}\right]$$
The stationary distribution is π = [π 1, π 2,  π 3] such that
$$\pi P=\pi \iff \pi P-\pi =0\iff \pi \left( P- I\right)=0$$
This amounts to finding the normalized left eigenvector of P corresponding to the eigenvalue of 1. We can do this manually by solving the system of equations generated by
$$\begin{array}{r}\pi \left\{\begin{array}{ccc}\hfill -.9\hfill & \hfill .6\hfill & \hfill .3\hfill \\ {}\hfill .5\hfill & \hfill -.6\hfill & \hfill .1\hfill \\ {}\hfill .3\hfill & \hfill .3\hfill & \hfill -.6\hfill \end{array}\right\} =0\phantom{\rule{0ex}{3em}}\\ {}\phantom{\rule{0ex}{1.6em}}-.9{\pi}_1+.5{\pi}_2+.3{\pi}_3=0\\ {}.6{\pi}_1-.6{\pi}_2+.3{\pi}_3=0\\ {}.3{\pi}_1+.1{\pi}_2-.6{\pi}_3=0\\ {}{\pi}_1+{\pi}_2+{\pi}_3=1\end{array}$$
which gives a solution of $$\pi =\left[\frac{11}{34},\ \frac{15}{34},\ \frac{8}{34}\right]$$.

## Key Applications

In this section the PageRank algorithm (Page et al. 1998) will be described in terms of a Markov chain. The PageRank algorithm was developed as a method for ranking the importance of webpages in a search result. First we explain the model of the web used below.

For each webpage u, F u will be the set of all pages u links to, and B u will be the set of all pages linking to u. Let N u  = |F u |. For example, in Fig. 2, for u = 4,  F 4 = {1},  B 4 = {1,2,3} and so N 4 = 1. The goal of the PageRank algorithm is to determine how much of an importance each webpage should be given. Intuitively, webpage 4 should be important since it has many links going to it, but in turn, webpage 1 should also be important because the highly referenced webpage 4 links to it.
In order to account for this, PageRank was introduced as follows: For each page u, define
$$R(u)= c\sum_{v\in {B}_u}\frac{R(v)}{N_v}+ cE(u)$$
where c is a normalizing factor which is maximized, ||R||1 = 1,  E(u) is a vector corresponding to a web surfer randomly going to page u from any other page, and for a vector (v 1, …, v n ), $$\left|\right| v\left|\right|{}_1={\sum}_{i=1}^n\left|{v}_i\right|$$. The addition of E is part of the “random surfer model.” For now, think of E as a uniform distribution over all webpages.

Another way to view this is in matrix form. Define a matrix A such that $${A}_{u, v}=\frac{1}{N_u}$$ if there is a link from u to v and 0 otherwise. In other words, a web user on page u will randomly click on a link on that page. With this definition, R can be equivalently defined as R = c(AR + E), or, since ||R||1 = 1, it can be written as R = c(A + E1) where 1 is a vector of all l’s. In this notation, it is clear that R is an eigenvector of c(A + E1) corresponding to 1.

Now we relate this to Markov chains and probability matrices. Let the states of the Markov chain be the webpages, with transitions between the states driven by links and the random surfer model as above. Does this Markov chain have a stationary distribution? If so, how can it be computed?

Referring back to Theorem 4, the powers of the probability matrix converge to a stationary distribution in the limit if the Markov chain is irreducible, positive recurrent, and aperiodic. First we check irreducibility. Since the random surfer model adds in a chance to go to any website from any other website, the chain is irreducible. The random surfer model also adds in a chance that from any given website u, you will randomly go back to u, and so the chain is aperiodic. Since this state space is finite, this Markov chain is also positive recurrent (because it is irreducible, by Theorem 2).

Since all of these properties are fulfilled, the PageRank Markov chain has a stationary distribution, and for the probability matrix p of this Markov chain, p n converges to this stationary distribution as p → ∞. This means the distribution can be estimated in an iterative fashion (Page et al. 1998).

In his original paper, Google founder Larry Page suggests using this stationary distribution to rank search results, and PageRank was an influencing factor in the ranking of Google search results and Google’s early success.

### Text Authorship Classification

The next application involves the machine learning problem of assigning author attribution to pieces of text (Khmelev and Tweedie 2001). In particular, suppose there is a set of authors (A 0, …, A k ) each with a set of works $${N}_i=\left({w}_0^i,\dots,\ {w}_{\left|{N}_i\right|}^i\right)$$ for 0 ≤ i ≤ k. Given a new text $$\widehat{w}$$, how can it be determined which author produced the work?

One method of classification is using Markov chains. Suppose the state space is the set of each letter in the alphabet with the addition of a blank space. For two arbitrary letters, say x and y, we have a transition between states x and y when there is an x followed by a y in the text.

For example, given the string
$$\mathrm{asd}\ \mathrm{ddss}\ \mathrm{aas}$$
Notice that the letter a is followed by the letter a once, the letter s twice, the letter d 0 times, and a space 0 times. Doing this for each letter gives us a frequency matrix:
$$\begin{array}{l}\phantom{\rule{0ex}{-0.5em}} a s d -\\ {}\left[\begin{array}{cccc}\hfill 1\hfill & \hfill 2\hfill & \hfill 0\hfill & \hfill 0\hfill \\ {}\hfill 0\hfill & \hfill 1\hfill & \hfill 1\hfill & \hfill 1\hfill \\ {}\hfill 0\hfill & \hfill 1\hfill & \hfill 1\hfill & \hfill 1\hfill \\ {}\hfill 1\hfill & \hfill 0\hfill & \hfill 1\hfill & \hfill 0\hfill \end{array}\right]\begin{array}{c}\hfill a\hfill \\ {}\hfill s\hfill \\ {}\hfill d\hfill \\ {}\hfill -\hfill \end{array}\end{array}$$
Diving by the total frequency of each letter (ignoring the last letter in the string), we can turn this into a probability matrix:
$$\begin{array}{l} \begin{array}{cccc}\hfill \phantom{\rule{0ex}{-0.5em}} a\hfill & \hfill s\hfill & \hfill d\hfill & \hfill -\hfill \end{array}\\ {}\left[\begin{array}{cccc}\hfill \frac{1}{3}\hfill & \hfill \frac{2}{3}\hfill & \hfill 0\hfill & \hfill 0\hfill \\ {}\hfill 0\hfill & \hfill \frac{1}{3}\hfill & \hfill \frac{1}{3}\hfill & \hfill \frac{1}{3}\hfill \\ {}\hfill 0\hfill & \hfill \frac{1}{3}\hfill & \hfill \frac{1}{3}\hfill & \hfill \frac{1}{3}\hfill \\ {}\hfill \frac{1}{2}\hfill & \hfill 0\hfill & \hfill \frac{1}{2}\hfill & \hfill 0\hfill \end{array}\right]\begin{array}{c}\hfill a\hfill \\ {}\hfill s\hfill \\ {}\hfill d\hfill \\ {}\hfill -\hfill \end{array}\end{array}$$

The classification scheme is constructed as follows: for each author A i , a 27 × 27 probability matrix $${p}_{A_i}$$ is constructed as above using all of the known works N i as in the example above. Then, a new work $$\widehat{w}$$ is considered, and it is compared against each $${p}_{A_i}$$ to determine how likely it is that this work was created by each author. A ranking is then produced in order of likelihood.

In this paper (Khmelev and Tweedie 2001), only bigrams are considered, that is, only sequences of two letters. These bigrams were enough to produce excellent accuracy on sample texts. While more detail can be used (considering the previous two or three or more letters instead), this increases the matrix size exponentially.

## Future Directions

Markov chains and probability matrices can be implemented into a Hidden Markov Model (HMM). In this type of model, the actual states cannot be observed and instead an observation related to the Markov chain is observed at any given time. These types of models have become increasingly prevalent in machine learning applications, especially in the fields of speech recognition (Gales and Young 2007) and gene prediction (Testa et al. 2015).

## References

1. Allen LJS (2003) An introduction to stochastic processes with applications to biology. Pearson/Prentice Hall, Upper Saddle River
2. Gales M, Steve Y (2007) The application of hidden markov models in speech recognition. Found Trends Signal Process 1(3):195–304
3. Hayes B (2014) First links in the markov chain. Sci Am 101(2):92Google Scholar
4. Khmelev DV, Tweedie FJ (2001) Using markov chains for identification of writers. Lit Linguist Comput 16(4):299–307
5. McAlpine K, Eduardo M, Stuart H (1999) Making music with algorithms: a case-study system. Comput Music J 23(2):19–30
6. Page L, Brin S, Motwani R, Winograd T (1998) The pagerank citation ranking: bringing order to the web. Technical Report, Stanford UniversityGoogle Scholar
7. Serfozo R (2009) Basics of applied stochastic processes. Springer, Heidelberg
8. Testa A, James H, Simon E, Richard O (2015) Codingquarry: highly accurate hidden markov model gene prediction in fungal genomes using rna-seq transcripts. BMC Genomics 16(170)Google Scholar