Probability Matrices
Synonyms
Glossary
 Aperiodic Markov Chain

A Markov chain in which the period of every state is one
 Discrete Time Markov Chain

A sequence of random variables \( X={\left\{{X}_n\right\}}_{n=1}^{\infty } \) taking values in some state space S such that the probability of X _{ n } moving to any state only depends upon its current state and the time n
 Irreducible Markov Chain

A Markov chain in which there is a positive probability of moving from any state to any other state in a finite amount of time
 Positive Recurrent Markov Chain

A Markov chain is called positive recurrent is for every state \( i\in S, {\sum}_{m=1}^{\infty }{nf}_{ii}^{(n)}<\infty \) where \( {f}_{ii}^{(n)}= P\left( \inf \right\{ m\ge 1 \): X _{ m } = iX _{0} = i} = n)
 Probability Matrix

A matrix p describing the probability of traveling between states in a Discrete Time Markov Chain
Definition
In other words, the matrix p contains the probability of moving from state i to state j at any time n, regardless of the previous states of the Markov chain.
Introduction
A natural question arising at this point is “What is the behavior of p ^{ m } as m → ∞?” After defining some key properties of Markov chains, this question will be revisited.
Key Points
We will focus on basic properties of Markov chains and probability matrices. Using these definitions, we introduce key results related to Markov chains such as existence of a “stationary” or equilibrium distribution of states and convergence to this distribution. This notion of stationary distribution is important in the study of Markov chains and in applications such as webpage ranking systems (Page et al. 1998).
Historical Background
Markov chains get their name from Russian mathematician Andrey Andreyevich Markov (Hayes 2014). Late in his life, Markov started developing notions regarding chains of probabilities. In 1913, Markov published a paper describing how a poem could be analyzed using a chainlike structure. He counted every pair of letters in 20,000 letters and recording whether they were vowelvowel, vowelconsonant/consonantvowel, or consonantconsonant. He then argued that the frequency of vowelvowel pairs was much less common than chance, and so letter probabilities in language are not independent. Markov also proved that in a small, twostate Markov chain with all positive probabilities, the probability matrix ends up converging to some value. Later in the next section, we will see that this result holds for much more general cases.
In the time since Markov’s introduction of this probability object, Markov chains have been used for a wide variety of applications. These applications range from internet search applications (Page et al. 1998) to musical applications (McAlpine et al. 1999). Later on, an indepth description will be given of Google’s PageRank algorithm as well as an application of Markov chains to machine learning.
Probability Matrices
Definition 1
This means a Markov chain is irreducible if it is possible to travel from any state to any other state in a finite number of time steps. The notion of irreducibility will be a very important notion in a main result below.
Another important notion is the idea of periodicity.
Definition 2
If k(i) > 1, i is periodic of period k(i). Otherwise, if k(i) = 1, i is said to be aperiodic.
The following Theorem is immediately realized.
Theorem 1
If p(i, i) > 0, then k(i) = 1.
Proof If p(i, i) > 0, then 1 ∈ {n p ^{ n }(i, i) > 0, n ∈ ℤ ^{+}}, and so the GCD. of this set is 1.
We now consider the notion of when a state will be first visited, or what the probability is that a state will ever be visited.
Definition 3
A state i ∈ S is transient if \( {\sum}_{m=1}^{\infty }{f}_{ii}^{(m)}<1 \) and is recurrent if \( {\sum}_{m=1}^{\infty }{f}_{ii}^{(m)}=1 \).
Given this definition, it makes sense to consider E[T _{ ii }], or the expected return time to a state.
Definition 4
A state is called positive recurrent if this expectation is finite and null recurrent otherwise.
In the case of a finite state space, positive recurrence can be shown by checking for irreducibility.
Theorem 2
A finite, irreducible Markov chain is positive recurrent.
We now consider the notion of stationary distributions, which relate to the limiting behavior of Markov chains under certain conditions.
Definition 5
The stationary distribution can be thought of as an equilibrium of a Markov chain, as demonstrated by the following Theorem (Serfozo 2009).
Theorem 3
Under certain conditions, we can now characterize the limiting behavior of the probability matrix p ^{ m } as m approaches infinity (Allen 2003).
Theorem 4
The result above shows that the probability matrix will converge to the stationary distribution in the limit, explaining the limiting behavior of the chain.
Examples
Example 1
Suppose the weather on each day is represented as a Markov chain \( X={\left\{{X}_n\right\}}_{n=1}^{\infty }, \) with state 1 indicating a clear day and state 2 indicating a rainy day. Assume the following probabilities:
Example 2
Given the Markov Chain presented in Example 1 from above, what is the probability is will be clear in 3 days if it is clear today? To answer this compute P ^{3}:
Example 3
Consider the two Markov chains X = {X _{ n }} and Y = {Y _{ n }} given by probability matrices P _{1} and P _{2}, respectively.
For the Markov chain X, note that states 1 and 2 can only ever get to each other, and so it is impossible to reach state 3 from states 1 and 2. Thus, the Markov chain X is not irreducible.
On the other hand, the Markov chain Y is irreducible. Although state 1 cannot get to state 3 directly, it can get there in finitely many (2) steps, similarly for each other pair of states.
Example 4
Example 5
Key Applications
Google PageRank Algorithm
In this section the PageRank algorithm (Page et al. 1998) will be described in terms of a Markov chain. The PageRank algorithm was developed as a method for ranking the importance of webpages in a search result. First we explain the model of the web used below.
Another way to view this is in matrix form. Define a matrix A such that \( {A}_{u, v}=\frac{1}{N_u} \) if there is a link from u to v and 0 otherwise. In other words, a web user on page u will randomly click on a link on that page. With this definition, R can be equivalently defined as R = c(AR + E), or, since R_{1} = 1, it can be written as R = c(A + E1) where 1 is a vector of all l’s. In this notation, it is clear that R is an eigenvector of c(A + E1) corresponding to 1.
Now we relate this to Markov chains and probability matrices. Let the states of the Markov chain be the webpages, with transitions between the states driven by links and the random surfer model as above. Does this Markov chain have a stationary distribution? If so, how can it be computed?
Referring back to Theorem 4, the powers of the probability matrix converge to a stationary distribution in the limit if the Markov chain is irreducible, positive recurrent, and aperiodic. First we check irreducibility. Since the random surfer model adds in a chance to go to any website from any other website, the chain is irreducible. The random surfer model also adds in a chance that from any given website u, you will randomly go back to u, and so the chain is aperiodic. Since this state space is finite, this Markov chain is also positive recurrent (because it is irreducible, by Theorem 2).
Since all of these properties are fulfilled, the PageRank Markov chain has a stationary distribution, and for the probability matrix p of this Markov chain, p ^{ n } converges to this stationary distribution as p → ∞. This means the distribution can be estimated in an iterative fashion (Page et al. 1998).
In his original paper, Google founder Larry Page suggests using this stationary distribution to rank search results, and PageRank was an influencing factor in the ranking of Google search results and Google’s early success.
Text Authorship Classification
The next application involves the machine learning problem of assigning author attribution to pieces of text (Khmelev and Tweedie 2001). In particular, suppose there is a set of authors (A _{0}, …, A _{ k }) each with a set of works \( {N}_i=\left({w}_0^i,\dots,\ {w}_{\left{N}_i\right}^i\right) \) for 0 ≤ i ≤ k. Given a new text \( \widehat{w} \), how can it be determined which author produced the work?
One method of classification is using Markov chains. Suppose the state space is the set of each letter in the alphabet with the addition of a blank space. For two arbitrary letters, say x and y, we have a transition between states x and y when there is an x followed by a y in the text.
The classification scheme is constructed as follows: for each author A _{ i }, a 27 × 27 probability matrix \( {p}_{A_i} \) is constructed as above using all of the known works N _{ i } as in the example above. Then, a new work \( \widehat{w} \) is considered, and it is compared against each \( {p}_{A_i} \) to determine how likely it is that this work was created by each author. A ranking is then produced in order of likelihood.
In this paper (Khmelev and Tweedie 2001), only bigrams are considered, that is, only sequences of two letters. These bigrams were enough to produce excellent accuracy on sample texts. While more detail can be used (considering the previous two or three or more letters instead), this increases the matrix size exponentially.
Future Directions
Markov chains and probability matrices can be implemented into a Hidden Markov Model (HMM). In this type of model, the actual states cannot be observed and instead an observation related to the Markov chain is observed at any given time. These types of models have become increasingly prevalent in machine learning applications, especially in the fields of speech recognition (Gales and Young 2007) and gene prediction (Testa et al. 2015).
CrossReferences
References
 Allen LJS (2003) An introduction to stochastic processes with applications to biology. Pearson/Prentice Hall, Upper Saddle RiverzbMATHGoogle Scholar
 Gales M, Steve Y (2007) The application of hidden markov models in speech recognition. Found Trends Signal Process 1(3):195–304CrossRefzbMATHGoogle Scholar
 Hayes B (2014) First links in the markov chain. Sci Am 101(2):92Google Scholar
 Khmelev DV, Tweedie FJ (2001) Using markov chains for identification of writers. Lit Linguist Comput 16(4):299–307CrossRefGoogle Scholar
 McAlpine K, Eduardo M, Stuart H (1999) Making music with algorithms: a casestudy system. Comput Music J 23(2):19–30CrossRefGoogle Scholar
 Page L, Brin S, Motwani R, Winograd T (1998) The pagerank citation ranking: bringing order to the web. Technical Report, Stanford UniversityGoogle Scholar
 Serfozo R (2009) Basics of applied stochastic processes. Springer, HeidelbergCrossRefzbMATHGoogle Scholar
 Testa A, James H, Simon E, Richard O (2015) Codingquarry: highly accurate hidden markov model gene prediction in fungal genomes using rnaseq transcripts. BMC Genomics 16(170)Google Scholar