Encyclopedia of Systems and Control

Living Edition
| Editors: John Baillieul, Tariq Samad

Markov Chains and Ranking Problems in Web Search

  • Hideaki IshiiEmail author
  • Roberto Tempo
Living reference work entry
DOI: https://doi.org/10.1007/978-1-4471-5102-9_135-1

Abstract

Markov chains refer to stochastic processes whose states change according to transition probabilities determined only by the states of the previous time step. They have been crucial for modeling large-scale systems with random behavior in various fields such as control, communications, biology, optimization, and economics. In this entry, we focus on their recent application to the area of search engines, namely, the PageRank algorithm employed at Google, which provides a measure of importance for each page in the web. We present several researches carried out with control theoretic tools such as aggregation, distributed randomized algorithms, and PageRank optimization. Due to the large size of the web, computational issues are the underlying motivation of these studies.

Keywords

Markov chains World wide web Search engines PageRank Aggregation Distributed randomized algorithms Optimization 

Introduction

For various real-world large-scale dynamical systems, reasonable models describing highly complex behaviors can be expressed as stochastic systems, and one of the most well-studied classes of such systems is that of Markov chains. A characteristic feature of Markov chains is that their behavior does not carry any memory. That is, the current state of a chain is completely determined by the state of the previous time step and not at all on the states prior to that step (Norris 1997; Kumar and Varaiya 1986).

Recently, Markov chains have gained renewed interest due to the extremely successful applications in the area of web search. The search engine of Google has been employing an algorithm known as PageRank to assist the ranking of search results. This algorithm models the network of web pages as a Markov chain whose states represent the pages that web surfers with various interests visit in a random fashion. The objective is to find an order among the pages according to their popularity and importance, and this is done by focusing on the structure of hyperlinks among pages.

In this entry, we first provide a brief overview on the basics of Markov chains and then introduce the problem of PageRank computation. We proceed to provide further discussions on control theoretic approaches dealing with PageRank problems. The topics covered include aggregated Markov chains, distributed randomized algorithms, and Markov decision problems for link optimization.

Markov Chains

In the simplest form, a Markov chain takes its states in a finite state space with transitions in the discrete-time domain. The transition from one state to another is characterized completely by the underlying probability distribution.

Let \(\mathcal{X}\) be a finite set given by \(\mathcal{X} :=\{ 1,2,\ldots,n\}\), which is called the state space. Consider a stochastic process \(\{X_{k}\}_{k=0}^{\infty }\) taking values on this set \(\mathcal{X}\). Such a process is called a Markov chain if it exhibits the following Markov property:
$$\begin{array}{rlrlrl} &\mathrm{Prob}{\left \{X_{k+1} = j\vert X_{k} = i_{k},\ X_{k-1} = i_{k-1},\ldots,X_{0} = i_{0}\right \}} =\mathrm{ Prob}{\left \{X_{k+1} = j\vert X_{k} = i_{k}\right \}},&& \end{array}$$
where \(\mathrm{Prob}\{ \cdot \vert \cdot \}\) denotes the conditional probability and \(k \in \mathbb{Z}_{+}\). That is, the state at the next time step depends only on the current state and not those of previous times.
Here, we consider the homogeneous case where the transition probability is constant over time. Thus, we have for each pair \(i,j \in \mathcal{X}\), the probability that the chain goes from state j to state i at time k expressed as
$$p_{ij} :=\mathrm{ Prob}{\left \{X_{k} = i\vert X_{k-1} = j\right \}},\ \ k \in \mathbb{Z}_{+}.$$
In the matrix form, P : = (p ij ) is called the transition probability matrix of the chain. It is obvious that all entries of P are nonnegative, and for each j​, the entries of the j​th column of P sum up to 1, i.e., \(\sum _{i=1}^{n}p_{ij} = 1\) for \(j \in \mathcal{X}\). In this respect, the matrix P is (column) stochastic (Horn and Johnson 1985).
In this entry, we assume that the Markov chain is ergodic, meaning that for any pair of states, the chain can make a transition from one to the other over time. In this case, the chain and the matrix P are called irreducible. This property is known to imply that P has a simple eigenvalue of 1. Thus, there exists a unique steady state probability distribution \(\pi \in {\mathbb{R}}^{n}\) given by
$$\pi = P\pi,\ \ {\mathbf{1}}^{T}\pi = 1,\ \ \pi _{ i} > 0,\ \forall i \in \mathcal{X},$$
where \(\mathbf{1} \in {\mathbb{R}}^{n}\) denotes a vector with entries one. Note that in this distribution π, all entries are positive.

Ranking in Search Engines: PageRank Algorithm

At Google, PageRank is used to quantify the importance of each web page based on the hyperlink structure of the web (Brin and Page 1998; Langville and Meyer 2006). A page is considered important if (i) many pages have links pointing to the page, (ii) such pages having links are important ones, and (iii) the numbers of links that such pages have are limited. Intuitively, these requirements are reasonable. For a web page, its incoming links can be viewed as votes supporting the page, and moreover the quality of the votes count through their importance as well as the number of votes that they make. Even if a minor page (with low PageRank) has many outgoing links, its contribution to the linked pages will not be substantial.

An interesting way to explain the PageRank is through the random surfer model: The random surfer starts from a randomly chosen page. Each time visiting a page, he/she follows a hyperlink in that page chosen at random with uniform probability. Hence, if the current page i has n i outgoing links, then one of them is picked with probability 1 ∕ n i . If it happens that the current page has no outgoing link (e.g., at PDF documents), the surfer will use the back button. This process will be repeated. The PageRank value for a page represents the probability of the surfer visiting the page. It is thus higher for pages visited more often by the surfer.

It is now clear that PageRank is obtained by describing the random surfer model as a Markov chain and then finding its stationary distribution. First, we express the network of web pages as the directed graph \(\mathcal{G} = (\mathcal{V},\mathcal{E})\), where \(\mathcal{V} =\{ 1,2,\ldots,n\}\) is the set of nodes corresponding to web page indices while \(\mathcal{E}\subset \mathcal{V}\times \mathcal{V}\) is the set of edges for links among pages. Node i is connected to node j by an edge, i.e., \((i,j) \in \mathcal{E}\), if page i has an outgoing link to page j​.

Let x i (k) be the distribution of the random surfer visiting page i at time k, and let x(k) be the vector containing all x i (k). Given the initial distribution x(0), which is a probability vector, i.e., \(\sum _{i=1}^{n}x_{i}\!(0) = 1\), the evolution of x(k) can be expressed as
$$x(k + 1) = Ax(k).$$
(1)
The link matrix \(A = (a_{ij}) \in {\mathbb{R}}^{n\times n}\) is given by a ij  = 1 ∕ n j if \((j,i) \in \mathcal{E}\) and 0 otherwise, where n j is the number of outgoing links of page j. Note that this matrix A is the transition probability matrix of the random surfer. Clearly, it is stochastic, and thus x(k) remains a probability vector so that \(\sum _{i=1}^{n}x_{i}(k) = 1\) for all k.
As mentioned above, PageRank is the stationary distribution of the process (1) under the assumption that the limit exists. Hence, the PageRank vector is given by \({x}^{{\ast}} :=\lim _{k\rightarrow \infty }x(k)\). In other words, it is the solution of the linear equation
$${x}^{{\ast}} = A{x}^{{\ast}},\ \ {x}^{{\ast}}\in {[0,1]}^{n},\ \ {\mathbf{1}}^{T}{x}^{{\ast}} = 1.$$
(2)
Notice that the PageRank vector x  ∗  is a nonnegative unit eigenvector for the eigenvalue 1 of A. Such a vector exists since the matrix A is stochastic, but may not be unique; the reason is that A is a reducible matrix since in the web, not every pair of pages can be connected by simply following links. To resolve this issue, a slight modification is necessary in the random surfer model.

The idea of the teleportation model is that the random surfer, after a while, becomes bored and stops following the hyperlinks. At such an instant, the surfer “jumps” to another page not directly connected to the one currently visiting. This page can be in fact completely unrelated in the domains and/or the contents. All n pages in the web have the same probability 1 ∕ n to be reached by a jump.

The probability to make such a jump is denoted by m ∈ (0, 1). The original transition probability matrix A is now replaced with the modified one \(M \in {\mathbb{R}}^{n\times n}\) defined by
$$M := (1 - m)A + \frac{m} {n}{ \mathbf{11}}^{T}.$$
(3)
For the value of m, we take m = 0. 15 as reported in the original algorithm in Brin and Page (1998). Notice that M is a positive stochastic matrix. By Perron’s theorem (Horn and Johnson 1985), the eigenvalue 1 is of multiplicity 1 and is the unique eigenvalue with the maximum modulus. Further, the corresponding eigenvector is positive. Hence, we redefine the vector x  ∗  in (2) by using M instead of A as follows:
$${x}^{{\ast}} = M{x}^{{\ast}},\ \ {x}^{{\ast}}\in {[0,1]}^{n},\ \ {\mathbf{1}}^{T}x_{ i}^{{\ast}} = 1.$$
Due to the large dimension of the link matrix M, the computation of x  ∗  is difficult. The solution employed in practice is based on the power method given by
$$\begin{array}{rlrlrl} x(k + 1) & = Mx(k) = (1 - m)Ax(k) + \frac{m} {n} \mathbf{1}, &\end{array}$$
(4)
where the initial vector \(x(0) \in {\mathbb{R}}^{n}\) is a probability vector. The second equality above follows from the fact \({\mathbf{1}}^{T}x(k) = 1\) for \(k \in \mathbb{Z}_{+}\). For implementation, the form on the far right-hand side is important, using only the sparse matrix A and not the dense matrix M. This method asymptotically finds the value vector as \(x(k) \rightarrow {x}^{{\ast}}\), k → .

Aggregation Methods for Large-Scale Markov Chains

In dealing with large-scale Markov chains, it is often desirable to predict their dynamic behaviors from reduced-order models that are more computationally tractable. This enables us, for example, to analyze the system performance at a macroscale with some approximation under different operating conditions. Aggregation refers to partitioning or grouping the states so that the states in each group can be treated as a whole. The technique of aggregation is especially effective for Markov chains possessing sparse structures with strong interactions among states in the same group and weak interactions among states in different groups. Such methods have been extensively studied, motivated by applications in queueing networks, power systems, etc. (Meyer 1989).

In the context of the PageRank problem, such sparse interconnection can be expressed in the link matrix A with a block-diagonal structure (after some coordinate change, if necessary). The entries of the matrix A are dense along its diagonal in blocks, and those outside the blocks take small values. More concretely, we write
$$A = I + B +\epsilon C,$$
(5)
where B is a block-diagonal matrix given as \(B =\mathrm{ diag}(B_{11},B_{22},\ldots,B_{NN})\); B ii is the \(\tilde{n}_{i} \times \tilde{ n}_{i}\) matrix corresponding to the i​th group with \(\tilde{n}_{i}\) member pages for i = 1, 2, , N; and ε is a small positive parameter. Here, the non-diagonal entries of B ii are the same as those in the same diagonal block of A, but the diagonal entries are chosen such that I + B ii becomes stochastic and thus take nonpositive values. Thus, both B and C have column sums equal to zero. The small ε suggests us that states can be aggregated into N groups with strong interactions within the groups, but connections among different groups are weak. This class of Markov chains is known as nearly completely decomposable. In general, however, it is difficult to uniquely determine the form (5) for a given chain.

To exploit the sparse structure in the computation of stationary probability distributions, one approach is to carry out decomposition or aggregation of the chains. The basic approach here is (i) to compute the local stationary distributions for I + B ii , (ii) to find the global stationary distribution for a chain representing the group interactions, and (iii) to finally use the obtained vectors to compute exact/approximate distribution for the entire chain; for details, see Meyer (1989). By interpreting such methods from the control theoretic viewpoints, in Phillips and Kokotovic (1981) and Aldhaheri and Khalil (1991), singular perturbation approaches have been developed. These methods lead us to the two-time scale decomposition of (controlled) Markov chain recursions.

In the case of PageRank computation, sparsity is a relevant property since it is well known that many links in the web are intra-host ones, connecting pages within the same domains or directories. However, in the real web, it is easy to find pages that have only a few outlinks, but some of them are external ones. Such pages will prevent the link matrix from having small ε when decomposed in the form (5). Hence, the general aggregation methods outlined above are not directly applicable.

An aggregation-based method suitable for PageRank computation is proposed in Ishii et al. (2012). There, the sparsity in the web is expressed by the limited number of external links pointing towards pages in other groups. For each page i, the node parameter δ i  ∈ [0, 1] is given by
$$\delta _{i} := \frac{\# \text {external outgoing links}} {\# \text{total outgoing links}}.$$
Note that smaller δ i implies sparser networks. In this approach, for a given bound δ, the condition δ i  ≤ δ is imposed only in the case page i belongs to a group consisting of multiple members. Thus, a page forming a group by itself is not required to satisfy the condition. This means that we can regroup the pages by first identifying pages that violate this condition in the initial groups and then making them separately as single groups. By repeating these steps, it is always possible to obtain groups for a given web. Once the grouping is settled, an aggregation-based algorithm can be applied, which computes an approximated PageRank vector. A characteristic feature is the tradeoff between the accuracy in PageRank computation and the node parameter δ. More accurate computation requires a larger number of groups and thus a smaller δ.

Distributed Randomized Computation

For large-scale computation, distributed algorithms can be effective by employing multiple processors to compute in parallel. There are several methods of constructing algorithms to find stationary distributions of large Markov chains. In this section, motivated by the current literature on multi-agent systems, sequential distributed randomized approaches of gossip type are described for the PageRank problem.

In gossip-type distributed algorithms, nodes make decisions and transmit information to their neighbors in a random fashion. That is, at any time instant, each node decides whether to communicate or not depending on a random variable. The random property is important to make the communication asynchronous so that simultaneous transmissions resulting in collisions can be avoided. Moreover, there is no need of any centralized decision maker or fixed order among pages.

More precisely, each page \(i \in \mathcal{V}\) is equipped with a random process η i (k) ∈ { 0, 1} for \(k \in \mathbb{Z}_{+}\). If at time k, η i (k) is equal to 1, then page i broadcasts its information to its neighboring pages connected by outgoing links. All pages involved at this time renew their values based on the latest available data. Here, η i (k) is assumed to be an independent and identically distributed (i.i.d.) random process, and its probability distribution is given by \(\mathrm{Prob}\{\eta _{i}(k) = 1\} =\alpha\), \(k \in \mathbb{Z}_{+}\). Hence, all pages are given the same probability α to initiate an update.

One of the proposed randomized approaches is based on the so-called asynchronous iteration algorithms for distributed computation of fixed points in the field of numerical analysis (Bertsekas and Tsitsiklis 1989). The distributed update recursion is given as
$$\check{x}(k + 1) =\check{ M}_{\eta _{1}(k),\ldots,\eta _{n}(k)}\check{x}(k),$$
(6)
where the initial state \(\check{x}(0)\) is a probability vector and the distributed link matrices \(\check{M}_{p_{1},\ldots,p_{n}}\) are given as follows: Its (i, j)th entry is equal to (1 − m)a ij  + m ∕ n if p i  = 1; 1 if p i  = 0 and i = j; and 0 otherwise. Clearly, these matrices keep the rows of the original link matrix M in (3) for pages initiating updates. Other pages just keep their previous values. Thus, these matrices are not stochastic. From this update recursion, the PageRank x  ∗  is probabilistically obtained (in the mean square sense and in probability one), where the convergence speed is exponential in time k. Note that in this scheme (6), due to the way the distributed link matrices are constructed, each page needs to know which pages have links pointing towards it. This implies that popular pages linked by a number of pages must have extra memory to keep the data of such links.

Another recently developed approach Ishii and Tempo (2010) and Zhao et al. (2013) has several notable differences from the asynchronous iteration approach above. First, the pages need to transmit their states only over their outgoing links; the information of such links are by default available locally, and thus, pages are not required to have the extra memory regarding incoming links. Second, it employs stochastic matrices in the update as in the centralized scheme; this aspect is utilized in the convergence analysis. As a consequence, it is established that the PageRank vector x  ∗  is computed in a probabilistic sense through the time average of the states x(0), , x(k) given by \(y(k) := 1/(k + 1)\sum _{\ell=0}^{k}x(\ell)\). The convergence speed in this case is of order 1 ∕ k.

PageRank Optimization via Hyperlink Designs

For owners of websites, it is of particular interest to raise the PageRank values of their web pages. Especially in the area of e-business, this can be critical for increasing the number of visitors to their sites. The values of PageRank can be affected by changing the structure of hyperlinks in the owned pages. Based on the random surfer model, intuitively, it makes sense to arrange the links so that surfers will stay within the domain of the owners as long as possible.

PageRank optimization problems have rigorously been considered in, for example, de Kerchove et al. (2008) and Fercoq et al. (2013). In general, these are combinatorial optimization problems since they deal with the issues on where to place hyperlinks, and thus the computation for solving them can be prohibitive especially when the web data is large. However, the work Fercoq et al. (2013) has shown that the problem can be solved in polynomial time. In what follows, we discuss a simplified discrete version of the problem setup of this work.

Consider a subset \(\mathcal{V}_{0} \subset \mathcal{V}\) of web pages over which a webmaster has control. The objective is to maximize the total PageRank of the pages in this set \(\mathcal{V}_{0}\) by finding the outgoing links from these pages. Each page \(i \in \mathcal{V}_{0}\) may have constraints such as links that must be placed within the page and those that cannot be allowed. All other links, i.e., those that one can decide to have or not, are the design parameters. Hence, the PageRank optimization problem can be stated as
$$\max {\left \{U({x}^{{\ast}},M) :\ {x}^{{\ast}} = M{x}^{{\ast}},\ {x}^{{\ast}}\in {[0,1]}^{n},\ {\mathbf{1}}^{T}{x}^{{\ast}} = 1,\ M \in \mathcal{M}\right \}},$$
where U is the utility function \(U({x}^{{\ast}},M) :=\sum _{i\in \mathcal{V}_{0}}x_{i}^{{\ast}}\) and \(\mathcal{M}\) represents the set of admissible link matrices in accordance with the constraints introduced above.

In Fercoq et al. (2013), an extended continuous problem is also studied where the set \(\mathcal{M}\) of link matrices is a polytope of stochastic matrices and a more general utility function is employed. The motivation for such a problem comes from having weighted links so that webmasters can determine which links should be placed in a more visible location inside their pages to increase clickings on those hyperlinks. Both discrete and continuous problems are shown to be solvable in polynomial time by modeling them as constrained Markov decision processes with ergodic rewards (see, e.g., Puterman 1994).

Summary and Future Directions

Markov chains form one of the simplest classes of stochastic processes but have been found powerful in their capability to model large-scale complex systems. In this entry, we introduced them mainly from the viewpoint of PageRank algorithms in the area of search engines and with a particular emphasis on recent works carried out based on control theoretic tools. Computational issues will remain in this area as major challenges, and further studies will be needed. As we have observed in PageRank-related problems, it is important to pay careful attention to structures of the particular problems.

Bibliography

  1. Aldhaheri R, Khalil H (1991) Aggregation of the policy iteration method for nearly completely decomposable Markov chains. IEEE Trans Autom Control 36:178–187CrossRefzbMATHMathSciNetGoogle Scholar
  2. Bertsekas D, Tsitsiklis J (1989) Parallel and distributed computation: numerical methods. Prentice-Hall, Englewood CliffszbMATHGoogle Scholar
  3. Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Comput Netw ISDN Syst 30:107–117CrossRefGoogle Scholar
  4. de Kerchove C, Ninove L, Van Dooren P (2008) Influence of the outlinks of a page on its PageRank. Linear Algebra Appl 429:1254–1276CrossRefzbMATHMathSciNetGoogle Scholar
  5. Fercoq O, Akian M, Bouhtou M, Gaubert S (2013) Ergodic control and polyhedral approaches to PageRank optimization. IEEE Trans Autom Control 58:134–148CrossRefMathSciNetGoogle Scholar
  6. Horn R, Johnson C (1985) Matrix analysis. Cambridge University Press, CambridgeCrossRefzbMATHGoogle Scholar
  7. Ishii H, Tempo R (2010) Distributed randomized algorithms for the PageRank computation. IEEE Trans Autom Control 55:1987–2002CrossRefMathSciNetGoogle Scholar
  8. Ishii H, Tempo R, Bai EW (2012) A web aggregation approach for distributed randomized PageRank algorithms. IEEE Trans Autom Control 57:2703–2717CrossRefMathSciNetGoogle Scholar
  9. Kumar P, Varaiya P (1986) Stochastic systems: estimation, identification, and adaptive control. Prentice Hall, Englewood CliffszbMATHGoogle Scholar
  10. Langville A, Meyer C (2006) Google’s PageRank and beyond: the science of search engine rankings. Princeton University Press, PrincetonGoogle Scholar
  11. Meyer C (1989) Stochastic complementation, uncoupling Markov chains, and the theory of nearly reducible systems. SIAM Rev 31:240–272CrossRefzbMATHMathSciNetGoogle Scholar
  12. Norris J (1997) Markov chains. Cambridge University Press, CambridgeCrossRefzbMATHGoogle Scholar
  13. Phillips R, Kokotovic P (1981) A singular perturbation approach to modeling and control of Markov chains. IEEE Trans Autom Control 26:1087–1094CrossRefzbMATHGoogle Scholar
  14. Puterman M (1994) Markov decision processes: discrete stochastic dynamic programming. Wiley, New YorkCrossRefzbMATHGoogle Scholar
  15. Zhao W, Chen H, Fang H (2013) Convergence of distributed randomized PageRank algorithms. IEEE Trans Autom Control 58:3255–3259CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London 2013

Authors and Affiliations

  1. 1.Tokyo Institute of TechnologyYokohamaJapan
  2. 2.CNR-IEIIT, Politecnico di TorinoTorinoItaly