Full Lecture 6

From Significant Statistics
Jump to navigation Jump to search

Multiple Random Variables (cont.)

Marginal CDF, PMF/PDF

If [math]\left(X,Y\right)'[/math] is a bivariate random vector, then the cdf of [math]X[/math] (and of [math]Y[/math]) is called the marginal cdf of [math]X[/math] [math]\left(Y\right)[/math].

For example, the marginal cdf of [math]X[/math] can be obtained via:

[math]F_{X}\left(x\right)=\lim_{y\rightarrow\infty}F_{X,Y}\left(x,y\right),\,\forall x\in\mathbb{R}[/math].

Notice that knowledge of [math]F_{X,Y}\left(x,y\right)[/math] implies knowledge of the marginal distributions. The converse is only true if [math]X[/math] and [math]Y[/math] are independent.

We can also obtain the marginal pmf/pdf in the following way.

  • If [math]\left(X,Y\right)'[/math] is discrete, then

[math]f_{X}\left(x\right)=\sum_{y\in\mathbb{R}}f_{X,Y}\left(x,y\right),\,x\in\mathbb{R}[/math].

  • If [math]\left(X,Y\right)'[/math] is continuous, then

[math]f_{X}\left(x\right)=\int_{-\infty}^{\infty}f_{X,Y}\left(x,y\right)dy,\,x\in\mathbb{R}[/math].

Independence

Two random variables [math]X[/math] and [math]Y[/math] are independent if

[math]F_{X,Y}\left(x,y\right)=F_{X}\left(x\right)F_{Y}\left(y\right),\forall\left(x,y\right)'\in\mathbb{R}^{2}.[/math]

Equivalently, two random variables [math]X[/math] and [math]Y[/math] are independent if

[math]f_{X,Y}\left(x,y\right)=f_{X}\left(x\right).f_{Y}\left(y\right),\forall\left(x,y\right)'\in\mathbb{R}^{2}.[/math]


Conditional PMF/PDF

The conditional pmf/pdf of [math]X[/math] given [math]Y=y[/math], [math]f_{\left.X\right|Y}\left(x,y\right)[/math], is given by

[math]f_{\left.X\right|Y}\left(x,y\right)=\frac{f_{X,Y}\left(x,y\right)}{f_{Y}\left(y\right)}.[/math]

If [math]f_{Y}\left(y\right)\gt 0[/math], then all of the properties of pmfs/pdfs apply to the conditional pmf/pdf.

This conditional function is a pmf/pdf in its own right. It adds up to one, and is positive (bounded at 1 in the case of the pmf).

The interpretation is intuitive. If [math]\left(X,Y\right)'[/math] is discrete and [math]f_{Y}\left(y\right)\gt 0[/math], then

[math]f_{\left.X\right|Y}\left(x,y\right)=\frac{f_{X,Y}\left(x,y\right)}{f_{Y}\left(y\right)}=\frac{P\left(X=x,Y=y\right)}{P\left(Y=y\right)}=P\left(\left.X=x\right|Y=y\right)[/math] i.e., [math]f_{\left.X\right|Y}\left(x,y\right)[/math] is the conditional probability of [math]X=x[/math] given that [math]Y=y[/math].

Because the value of a pdf does not correspond to a probability, the interpretation is trickier in the continuous case. If we would like to describe the density of [math]X[/math] when [math]Y=5[/math], for example, then [math]f_{\left.X\right|Y}\left(x,5\right)[/math] gives us the intended object. However, a problem may arise. How can we ask about the pdf [math]f_{\left.X\right|Y}\left(x,5\right)[/math], for example, given that [math]P\left(Y=5\right)=0[/math] when [math]Y[/math] is continuous? A related issue is that one can actually obtain different conditional pdfs [math]f_{\left.X\right|Y}\left(x,y\right)[/math], depending on how they are calculated! In the next section we briefly mention the Borel paradox.

Conditioning on Sets

From the discrete case, above, it follows that in the continuous case we can condition on an interval over [math]y[/math]. For example,

[math]f_{\left.X\right|Y}\left(x,y\in\left[\underline{y},\overline{y}\right]\right)[/math]=[math]\frac{\int_{\underline{y}}^{\overline{y}}f_{X,Y}\left(x,y\right)dy}{P\left(y\in\left[\underline{y},\overline{y}\right]\right)}=\frac{\int_{\underline{y}}^{\overline{y}}f_{X,Y}\left(x,y\right)dy}{\int_{-\infty}^{\infty}\int_{\underline{y}}^{\overline{y}}f_{X,Y}\left(x,y\right)dydx}.[/math]

Borel Paradox

The Borel paradox is that it is possible to construct two distinct but equally valid conditional pdfs, when we condition on a measure zero event (i.e., an event with probability zero, such as a point in the continuous case).

A resolution for this ambiguity is to instead take the limit of a pdf conditional on a set of values with positive probability that converges to the measure zero event as [math]n[/math] increases.

For example, let [math]\overline{y}_{n}^{1}\gt y\gt \underline{y}_{n}^{1}[/math] and [math]\overline{y}_{n}^{2}\gt y\gt \underline{y}_{n}^{2}[/math] be different sequences of upper and lower bounds, such that

[math]\lim_{n\rightarrow\infty}\overline{y}_{n}^{1}=\lim_{n\rightarrow\infty}\underline{y}_{n}^{1}=\lim_{n\rightarrow\infty}\overline{y}_{n}^{2}=\lim_{n\rightarrow\infty}\underline{y}_{n}^{2}=y[/math]

and

[math]P\left(y\in\left[\underline{y}_{n}^{1},\overline{y}_{n}^{1}\right]\right)\gt 0[/math] and [math]P\left(y\in\left[\underline{y}_{n}^{2},\overline{y}_{n}^{2}\right]\right)\gt 0\,\forall n\in\mathbb{N}.[/math]

Then, one could calculate

[math]\lim_{n\rightarrow \infty}f_{\left.X\right|Y}\left(x,y\in\left[\underline{y}_{n}^{1},\overline{y}_{n}^{1}\right]\right)[/math]

and

[math]\lim_{n\rightarrow n\rightarrow \infty}f_{\left.X\right|Y}\left(x,y\in\left[\underline{y}_{n}^{2},\overline{y}_{n}^{2}\right]\right).[/math]

Although this procedure can yield two different results, each limit can only yield one conditional probability. In practice, this issue is often ignored or assumed away.

For more information on the Borel paradox, see Proschan and Presnell (1998), this Wikipedia page, and this blog post for an intuitive explanation.


Some Conditional Moments

  • The expected value of [math]X[/math] given [math]Y[/math] is defined as

[math]E_{\left.X\right|Y}\left(\left.X\right|Y=y\right)=\begin{cases} \sum_{s\in\mathbb{R}}s.f_{X|Y}\left(s,y\right), & \text{if}\,\left(X,Y\right)'\,\text{is discrete }\\ \int_{-\infty}^{\infty}s.f_{X|Y}\left(s,y\right)ds & \text{if}\,\left(X,Y\right)'\,\text{is continuous} \end{cases}[/math]

We often write [math]E_{\left.X\right|Y}\left(\left.X\right|Y\right)[/math] instead of [math]E_{\left.X\right|Y}\left(\left.X\right|Y=y\right)[/math], when we haven't yet settled for a value of [math]y[/math].

Perhaps a more popular notation, which means the same, is

[math]E\left(\left.X\right|Y\right).[/math]

Notice that [math]E\left(\left.X\right|Y\right)[/math] is a function of [math]Y.[/math] The [math]X[/math] has been integrated out!

  • The conditional variance of [math]X[/math] given [math]Y[/math] is

[math]Var\left(\left.X\right|Y\right)=E\left[\left.\left(X-E\left(\left.X\right|Y\right)\right)^{2}\right|Y\right]=E\left(\left.X^{2}\right|Y\right)-E\left(\left.X\right|Y\right)^{2}[/math].


Law of Iterated Expectations

If [math]\left(X,Y\right)'[/math] is a random vector, then

[math]E\left(X\right)=E\left[E\left(\left.X\right|Y\right)\right],[/math]

provided the expectations of [math]X[/math] and [math]Y[/math] exist. Notice, above, that the outer expectation is w.r.t. [math]Y.[/math]

The intuition is that, in order to calculate the expectation of [math]X[/math], we can first calculate the expectations of [math]X[/math] at each value of [math]Y[/math], and then average each one of those.

Proof of the Law of Iterated Expectations

Here we prove the law of iterated expectations for the continuous case:

[math]\begin{aligned} E\left(X\right) & =\int_{-\infty}^{\infty}\int_{-\infty}^{\infty}s.f_{X,Y}\left(s,t\right)dtds\\ & =\int_{-\infty}^{\infty}\int_{-\infty}^{\infty}s.\underset{=f_{X,Y}\left(s,t\right)}{\underbrace{f_{X|Y}\left(s,t\right)f_{Y}\left(t\right)}}dtds\\ & =\int_{-\infty}^{\infty}\underset{E_{\left.X\right|Y}\left(\left.X\right|Y\right)}{\underbrace{\int_{-\infty}^{\infty}s.f_{X|Y}\left(s,t\right)ds}}f_{Y}\left(t\right)dt\\ & =E_{Y}\left[E_{\left.X\right|Y}\left(\left.X\right|Y\right)\right].\end{aligned}[/math]


Conditional Variance Identity

The CVI is useful identity (especially for later, when we discuss linear regression). It is the decomposition

[math]Var\left(X\right)=E\left[Var\left(X|Y\right)\right]+Var\left[E\left(X|Y\right)\right].[/math]

The interpretation for this equality will be clear when we discuss linear regression.


Covariance

The covariance of [math]X[/math] and [math]Y[/math] is

[math]Cov\left(X,Y\right)=\sigma_{XY}=E\left[\left(X-\mu_{X}\right)\left(Y-\mu_{Y}\right)\right].[/math]

Some properties:

  • [math]Cov\left(X,Y\right)=...=E\left(XY\right)-E\left(X\right)E\left(Y\right).[/math]
  • [math]Cov\left(X,X\right)=Var\left(X\right).[/math]
  • [math]Cov\left(X,Y\right)=0[/math] if [math]X[/math] and [math]Y[/math] are independent.

Correlation

The correlation of [math]X[/math] and [math]Y[/math] is

[math]Corr\left(X,Y\right)=\rho_{XY}=\frac{Cov\left(X,Y\right)}{\sigma_{X}\sigma_{Y}}=\frac{\sigma_{XY}}{\sigma_{X}\sigma_{Y}},[/math]

i.e., the correlation is equal to the covariance, standardized by the product of the standard deviations of the variables.

Some properties:

  • If [math]X[/math] and [math]Y[/math] are independent, then [math]\rho_{XY}=0[/math] .
  • [math]\left|\rho_{XY}\right|\leq1[/math], by the Cauchy-Schwarz inequality (explained next).
  • [math]\left|\rho_{XY}\right|=1[/math] if [math]P\left(Y=aX\pm b\right)=1[/math] for some [math]a\neq0,b\in\mathbb{R}[/math].
  • [math]\rho_{XY}[/math] is a measure of linear dependence.


Cauchy-Schwarz Inequality

If [math]\left(X,Y\right)'[/math] is a bivariate random vector, then

[math]\left|E\left(XY\right)\right|\leq E\left(\left|XY\right|\right)\leq\sqrt{E\left(X^{2}\right)}\sqrt{E\left(Y^{2}\right)}[/math].

The inequality binds joint moments by their separate properties.

Generalization: Holder Inequality

If [math]\left(X,Y\right)'[/math] is a bivariate random vector, then

[math]\left|E\left(XY\right)\right|\leq E\left(\left|XY\right|\right)\leq\sqrt[p]{E\left(\left|X\right|^{p}\right)}\sqrt[q]{E\left(\left|Y\right|^{q}\right)},\,for\,p,q\gt 0,p^{-1}+q^{-1}=1.[/math]

Jensen’s Inequality

If [math]X[/math] is an r.v. and [math]g:\mathbb{R}\rightarrow\mathbb{R}[/math] is a convex function, then

[math]E\left[g\left(X\right)\right]\geq g\left[E\left(X\right)\right].[/math]

This also implies that if [math]g\left(\cdot\right)[/math] is concave, then

[math]E\left[g\left(X\right)\right]\leq g\left[E\left(X\right)\right].[/math]

Finally, if [math]g\left(\cdot\right)[/math] is linear, then

[math]E\left[g\left(X\right)\right]=g\left[E\left(X\right)\right],[/math]

since a linear function is both convex and concave.

For an example, consider an r.v. [math]X[/math] that can equal [math]0[/math] or [math]8[/math] with equal probability.

In this case,

[math]E\left(X\right)^{2}=\left(0.5\times0+0.5\times8\right)^{2}=16[/math]

and

[math]E\left(X^{2}\right)=\left(0.5\times0^{2}+0.5\times8^{2}\right)=32,[/math]

as predicted by Jensen's inequality, since [math]y=x^{2}[/math] is convex.

Consider a graphical depiction is below:

image

The curves above are obtained by varying the probability weights associated with 0 and 8 from zero to one. Jensen's inequality applies throughout the graph.