Bounding probabilities by expectations

a repository of mathematical know-how

Bounding probabilities by expectations

Title: *

Area of mathematics: *

A comma-separated list of areas of mathematics to which this article applies. Use ">" to tag in a subcategory. Example: Analysis > Harmonic analysis, Combinatorics

Keywords:

A comma-separated list of keywords associated with this article. Example: free group

Used in:

A comma-separated list of examples of where this technique is used. Example: Cauchy-Schwarz inequality

Parent articles:		Order

Body:

[QUICK DESCRIPTION]

If $X$ is a random variable that takes non-negative values, then the probability that $X\geq t$ is at most $\mathbb{E}X/t.$ This simple result is called Markov's inequality. Clever uses of Markov's inequality can have surprisingly powerful consequences. The main trick is to choose an increasing function $F$ and apply Markov's inequality to the random variable $F(X)$ instead. If $F$ is rapidly growing but $\mathbb{E}F(X)$ is not too large, then one can prove that the probability that $X\geq t$, which equals the probability that $F(X)\geq F(t)$, is very small.

[PREREQUISITES]

A familiarity with basic concepts in probability such as [[w:Random_variable|random variables]] and [[w:Expected_value|expectation]].

[EXAMPLE]

Let us begin with a proof of Markov's inequality itself. By [[Don't forget to use the law of total probability|the law of total probability]] we know that [maths]\mathbb{E}X=\mathbb{E}[X|X\geq t]\mathbb{P}[X\geq t]+\mathbb{E}[X|X<t]\mathbb{P}[X<t].[/maths] Since $X\geq 0$ (by hypothesis), this is at least $\mathbb{E}[X|X\geq t]\mathbb{P}[X\geq t],$ which is obviously at least $t\mathbb{P}[X\geq t],$ and from this it follows that $\mathbb{P}[X\geq t]\leq\mathbb{E}X/t$, as we wanted.

[EXAMPLE]

It is frequently useful in probability theory (and also in [[Probabilistic combinatorics front page|probabilistic combinatorics]]) to prove that a random variable is close to its mean most of the time. Of course, this may not always be true, which is why we invent concepts such as variance to give us some measure of how concentrated a random variable is about its mean. A precise relationship between the variance of a random variable $X$ and the concentration of $X$ about its mean $\mu$ is given by Chebyshev's inequality, which states that if $t>0$, then $\mathbb{P}[|X-\mu|\geq t]\leq\mathrm{Var}X/t^2.$

To prove this, we first recall the definition of the variance of $X$, which is $\mathbb{E}(X-\mu)^2$. This instantly suggests a function that we can plug into Markov's inequality: instead of looking at the random variable $X$ we shall look at $(X-\mu)^2.$ (The function $x\mapsto(x-\mu)^2$ is not increasing, but this is all right because we are actually interested in the random variable $|X-\mu|.$)

The random variable $(X-\mu)^2$ is non-negative, so by Markov's inequality we find that [maths]\mathbb{P}[(X-\mu)^2\geq t^2]\leq\mathbb{E}(X-\mu)^2/t^2=\mathrm{Var}X/t^2,[/maths] and we are done, since $\mathbb{P}[(X-\mu)^2\geq t^2]=\mathbb{P}[|X-\mu|\geq t].$

[GENERAL DISCUSSION]

Suppose that we have a random variable $X$ with mean $\mu$ such that $\mathbb{E}|X-\mu|=r$ and $\mathrm{Var}X=\mathbb{E}(X-\mu)^2=s^2.$ Then Markov's inequality implies that $\mathbb{P}[|X-\mu|\geq t]\leq r/t,$ while Chebyshev's inequality implies that $\mathbb{P}[|X-\mu|\geq t]\leq s^2/t^2.$ Since all probabilities are at most 1, Markov's inequality tells us nothing unless $t\geq r$, and Chebyshev's inequality tells us nothing unless $t\geq s.$ Since $\mathbb{E}(X-\mu)^2\geq(\mathbb{E}|X-\mu|)^2,$ this tells us that Markov's inequality will usually give us some information for smaller values than Chebyshev's inequality. However, if $\mathbb{E}(X-\mu)^2$ and $(\mathbb{E}|X-\mu|)^2$ are reasonably comparable, then Chebyshev's inequality is much stronger for larger values of $t$. This illustrates a general principle: a good bound on $\mathbb{E}F(X)$ for a quickly growing positive function $F$ gives us a good upper bound for the probability that $X$ is large. In the next example, we shall see a very strong result of this kind.

[EXAMPLE]

Let $X_1,X_2,\dots,X_n$ be independent random variables, each of mean 0 and each taking values in the interval $[-1,1].$ What is the probability that $|X_1+\dots+X_n|\geq t$?

To get a preliminary feel for this question, let us [[use the fact that variances of independent events add together]]. Each $X_i$ obviously has variance at most 1, since it cannot differ from its mean by more than 1. Therefore, the variance of $|X_1+X_2+\dots+X_n|$ is at most $n$, so by Chebyshev's inequality the probability that $|X_1+\dots+X_n|\geq t$ is at most $n/t^2.$ This tells us that once $t$ gets past $\sqrt{n}$, the probability decays fairly rapidly. However, we shall now see that it decays ''far'' more rapidly than Chebyshev's inequality implies.

To prove this, we shall estimate the probability that $X_1+\dots+X_n\geq t$ and then, by symmetry, we will also have an estimate for the probability that $X_1+\dots+X_n\leq -t.$ To obtain the first estimate we shall think about the expectation of $\exp(\lambda(X_1+\dots+X_n)).$

The exponential function has three properties that make it a particularly good choice. The first is that it is non-negative. The second is that it turns sums into products, which is very useful to us because the expectation of a product of independent random variables is equal to the product of the expectations. The third is that it is a rapidly growing function.

Using these properties, we find that [maths] \mathbb{E}\exp(\lambda(X_1+\dots+X_n))=\mathbb{E}\prod_{i=1}^n\exp(\lambda X_i)=\prod_{i=1}^n\mathbb{E}(\exp(\lambda X_i)).[/maths] We would like to show that this is not too large, and our problem is now reduced to showing that each $\mathbb{E}(\exp(\lambda X_i))$ is not too large. It is at this point that we exploit the fact that each $X_i$ has mean $0$ and takes values in $[-1,1].$ As long as $\lambda\geq 0$, this implies that 
[maths] \mathbb{E}(\exp(\lambda(X_i))=1+\lambda\mathbb{E}X_i+\frac{\lambda^2}2\mathbb{E}X_i^2+\dots\leq 1+\frac{\lambda^2}2+\frac{\lambda^3}{3!}+\dots [/math]
[add] And now a small calculation shows that if $\lambda\leq 1$ then this is at most $\exp(\lambda^2).$ {{Indeed, we can prove this coefficient by coefficient: $\frac{\lambda^2}2+\frac{\lambda^3}{3!}\leq\lambda^2,$ $\frac{\lambda^4}{4!}+\frac{\lambda^5}{5!}\leq\frac{\lambda^4}{2}$, and so on.}}[/add] Therefore, $\mathbb{E}\exp(\lambda(X_1+\dots+X_n))\leq\exp(n\lambda^2).$

Now we are ready to apply Markov's inequality. The probability that $X_1+\dots+X_n\geq t$ is equal to the probability that $\exp(\lambda(X_1+\dots+X_n))\geq\exp(\lambda t),$ which by Markov's inequality is at most $\exp(n\lambda^2-\lambda t).$ We are now free to choose $\lambda$ so as to minimize this estimate, provided that $0\leq\lambda\leq 1.$ The minimum occurs when $\lambda=t/2n,$ when it gives $\exp(-t^2/4n).$ This estimate is valid if $t\leq 2n$, and therefore for all $t$, since the probability that $X_1+\dots+X_n\geq 2n$ is zero.

After applying the same argument to $-X_1-\dots-X_n,$ we find that the probability that $|X_1+\dots+X_n|\geq t$ is at most $2\exp(-t^2/4n),$ so once $t$ gets beyond $\sqrt{n}$ the probability decays very quickly indeed.

This conclusion is far stronger than that of Chebyshev's inequality, but it should be stressed that we have also used a stronger hypothesis: here we assumed that the random variables were bounded, but we could have obtained an upper bound of $n/t^2$ from the weaker assumption that each $X_i$ had variance at most 1.

This is a stub

A stub is an article that is not sufficiently complete to be interesting.

Notifications

File attachments

Changes made to the attachments are not permanent until you save this post. The first "listed" file will be included in RSS feeds.

Attach new file:

Images are larger than 640x480 will be resized. The maximum upload size is 1 MB. Only files with the following extensions may be uploaded: jpg jpeg gif png svg.

Revision information

Log message:

An explanation of the additions or updates being made to help other authors understand your motivations.

Bounding probabilities by expectations

Recent articles

Active forum topics

Recent comments