How to use martingales

Quick description

An important general principle in probability is that a sum of bounded independent random variables, each of mean 0, has a very small probability of having a large modulus, where "large" means substantially larger than its standard deviation. For example, if you toss a coin 1000 times, then the probability that the number of heads does not lie between 400 and 600 is tiny. This principle can be proved rigorously – see Example 3 of bounding probabilities by expectations. Also, and this is the main theme of this article, the assumption of independence can be weakened in a way that makes the principle much more widely applicable.

Prerequisites

Basic concepts of probability, graph theory, metric spaces.

This article could use additional contributions. More examples wanted.

Example 1

Let $S_n$ be the symmetric group on $n$ elements. There is a natural graph with $S_n$ as its vertex set, where two permutations $\rho$ and $\sigma$ are joined if and only if $\rho\sigma^{-1}$ is a transposition, or equivalently $\rho=\tau\sigma$ for some transposition $\tau.$ This graph gives us a natural metric on $S_n$ : the distance between two permutations $\rho$ and $\sigma$ is the smallest $k$ such that we can find $k$ transpositions $\tau_1,\dots,\tau_k$ with $\rho=\tau_1\dots\tau_k\sigma.$

Now let $S_n\rightarrow\mathbb{R}$ be a function that is 1-Lipschitz with respect to this distance. This is equivalent to the assumption that $|\phi(\rho)-\phi(\sigma)|\leq 1$ whenever there is a transposition $\tau$ such that $\rho=\tau\sigma.$ (An obvious example of such a function is the function $\phi$ where $\phi(\rho)$ is defined to be the smallest $k$ such that $\rho$ is a product of $k$ transpositions.) Since every permutation can be written as a product of at most $n-1$ transpositions and an $n$ -cycle needs exactly $n-1$ transpositions, the diameter of $S_n$ is $n-1$ , so $|\phi(\rho)-\phi(\sigma)|$ cannot exceed $n-1.$ We shall show that $\phi$ is "highly concentrated", in the sense that for almost all pairs $(\rho,\sigma)$ $|\phi(\rho)-\phi(\sigma)|$ is far smaller than $n$ – it is rarely much more than $\sqrt{n}.$

General discussion

Before we get on to the proof, we need to state Azuma's inequality, and before we do that we need to say what a martingale is. The latter is simple. Let $Z_0,Z_1,\dots,Z_n$ be a sequence of random variables. They are a martingale if for every $k$ the expectation of $Z_k$ given the values of $Z_0,Z_1,\dots,Z_{k-1}$ is $Z_{k-1}.$

It is perhaps clearer to express this in terms of the martingale difference sequence $X_1,\dots,X_n,$ where $X_k=Z_k-Z_{k-1}$ for each $k$ . Then the martingale condition is that the expectation of $X_k$ given the values of $Z_0$ and $X_1,\dots,X_{k-1}$ is always $0$ . An obvious example is when the $X_i$ are independent random variables with mean $0$ , but as we shall see there is an easy way of generating many more examples than this.

Suppose that each variable $X_i$ takes values between $-1$ and $1$ . Then Azuma's inequality states that the probability that $|X_1+\dots+X_n|\geq t$ is at most $2\exp(-t^2/4n).$ More generally, if $X_i$ takes values between $-\lambda_i$ and $\lambda_i$ , then the probability that $|X_1+\dots+X_n|\geq t$ is at most $2\exp(-t^2/4\sum \lambda_i^2).$

Now let us return to our example, and see how we can define a martingale in a very natural way.

Example 1, continued

Recall that we have a function $\phi$ defined on $S_n$ , with the property that if $\rho$ and $\sigma$ are partitions such that $\rho\sigma^{-1}$ is a transposition, then $|\phi(\rho)-\phi(\sigma)|\leq 1.$ In order to define a martingale difference sequence, we choose a random permutation $\rho$ and "reveal information about $\rho$ little by little".

What does this mean? Well, we look at the values of $\rho(1)$ , $\rho(2)$ , $\rho(3)$ , and so on, and at the $k$ th stage we define $Z_k$ to be the expected value of $\phi(\rho)$ given the values $\rho(1),\dots,\rho(k).$ In other words, $Z_0$ is just $\mathbb{E}\phi,$ $Z_1$ is the expected value of $\mathbb{E}\phi(\sigma)$ given that $\sigma(1)=\rho(1),$ and in general $Z_k$ is the expected value of $\mathbb{E}\phi(\sigma)$ given that $\sigma(i)=\rho(i)$ for every $i\in\{1,2,\dots,k\}.$

What can we say about $X_k=Z_k-Z_{k-1}$ ? We would like to prove an upper bound for the maximum absolute value that $X_k$ can take, and we would also like to prove that the expectation of $X_k$ given the values of $X_1,\dots,X_{k-1}$ is $0$ .

$X_k$ is the amount by which the expectation of $\phi(\sigma)$ can change if we add to our existing knowledge of the values of $\sigma(1),\dots,\sigma(k-1)$ the knowledge that $\sigma(k)=\rho(k).$ To get an upper bound for this, let us define $A_r$ , for each $r$ not in the set $\{\rho(1),\dots,\rho(k-1)\}$ , to be the set of all $\sigma$ such that $\sigma(i)=\rho(i)$ for every $i<k$ and $\sigma(k)=r.$ Now let us pick $s$ and $t$ and define a bijection $\beta_{st}$ between $A_s$ and $A_t$ as follows. Given a permutation $\sigma\in A_s$ , let $\beta_{st}(\sigma)=\tau$ agree with $\sigma$ except that $\tau(k)=t$ and $\tau(\sigma^{-1}(t))=s.$ To put this another way, $\beta_{st}(\sigma)=(st)\sigma,$ where $(st)$ is the transposition that exchanges $s$ and $t$ .

Now the distance between $\sigma$ and $\beta_{rs}\sigma$ is at most 1 (and exactly 1 if $r\ne s$ ). Therefore, $|\phi(\sigma)-\phi(\beta_{rs}\sigma)|\leq 1$ for every $\sigma\in A_s.$ It follows that $\mathbb{E}[\phi(\sigma)|\sigma\in A_s]$ differs from $\mathbb{E}[\phi(\sigma)|\sigma\in A_t]$ by at most 1.

In other words, once we know what $\rho(k)$ is, the difference it can make to our expectation of $\phi(\rho)$ is at most 1. Also, the average of $\phi(\sigma)$ given that $\sigma(i)=\rho(i)$ for every $i<k$ is the average over $s$ of the averages of $\phi(\sigma)$ over $A_s$ (since all the sets $A_s$ have the same size), so the expectation of $Z_k$ given $Z_{k-1}$ is $Z_{k-1}$ , as we wanted. (Equivalently, the expectation of $X_k$ given $X_1,\dots,X_{k-1}$ is $0.$ )

We have now checked the facts we wanted to check: we have a martingale and the random variables $X_i$ in the martingale difference sequence take values in $[-1,1]$ . It follows from Azuma's inequality that the probability that $|\phi(\rho)-\mathbb{E}\phi|\geq t$ is at most $2\exp(-t^2/4n).$ From this we see that even though the diameter of $S_n$ is $n-1$ , for the vast majority of pairs $(\rho,\sigma)$ the difference $|\phi(\rho)-\phi(\sigma)|$ is at most $100\sqrt{n}.$

General discussion

The proof of Azuma's inequality is a relatively straightforward generalization of the proof given in Example 3 of bounding probabilities by expectations, which deals with the special case where the martingale difference sequence consists of independent random variables that take values in $[-1,1].$

Comments

possible typo?

Fri, 29/05/2009 - 00:38 — alex

The article says

"Then the martingale condition is that the expectation of $X_k$ given the values of $X_1,\dots,X_{k-1}$ is always 0."

I see that the martingale condition implies this, but is it true that the martingale condition "is" this? This condition can be satisfied without the sequence $Z_i$ being a martingale.

Indeed, suppose A, B are independent random variables with mean zero. Define
$Z_0 = A, Z_1 = A + B, Z_2 = 2A+B$ . Clearly, $E[X_1]=0, E[X_2| X_1]=E[A | B]=0$ , so the above condition holds. On the other this is not a martingale: $E[Z_2|Z_1, Z_0] = Z_1 + Z_0$ which may not equal $Z_1$ .

Thanks – I've corrected

Fri, 29/05/2009 - 21:55 — gowers

Thanks – I've corrected it. I've tended to deal with martingales, such as the one in this permutations example, where $Z_0$ is constant, which is why I wrote what I wrote.

Trivia

Mon, 22/06/2009 - 19:56 — Didier Piau (not verified)

There is a typo in the second paragraph of section Example 1, continued: one should cancel the symbol $\mathbb{E}$ in the two occurrences of

the expected value of $\mathbb{E}\phi(\sigma)$ given that

And, reading the last paragraph of this section, I wondered why you picked $100\sqrt{n}$ as an example of extremely unlikely deviation. First, $10\sqrt{n}$ would already yield a rather respectable (i.e., small) bound, namely, something like $10^{-11}$ . Second, there is nothing specific about $100\sqrt{n}$ here, although the formulation seems to indicate otherwise: you might want to add a for example somewhere.

Thank you for this article.

Post new comment

(Note: commenting is not possible on this snapshot.)