# Multiple Random Variables (cont.)

## Marginal CDF, PMF/PDF

If $\left(X,Y\right)'$ is a bivariate random vector, then the cdf of $X$ (and of $Y$) is called the marginal cdf of $X$ $\left(Y\right)$.

For example, the marginal cdf of $X$ can be obtained via:

$F_{X}\left(x\right)=\lim_{y\rightarrow\infty}F_{X,Y}\left(x,y\right),\,\forall x\in\mathbb{R}$.

Notice that knowledge of $F_{X,Y}\left(x,y\right)$ implies knowledge of the marginal distributions. The converse is only true if $X$ and $Y$ are independent.

We can also obtain the marginal pmf/pdf in the following way.

• If $\left(X,Y\right)'$ is discrete, then

$f_{X}\left(x\right)=\sum_{y\in\mathbb{R}}f_{X,Y}\left(x,y\right),\,x\in\mathbb{R}$.

• If $\left(X,Y\right)'$ is continuous, then

$f_{X}\left(x\right)=\int_{-\infty}^{\infty}f_{X,Y}\left(x,y\right)dy,\,x\in\mathbb{R}$.

## Independence

Two random variables $X$ and $Y$ are independent if

$F_{X,Y}\left(x,y\right)=F_{X}\left(x\right)F_{Y}\left(y\right),\forall\left(x,y\right)'\in\mathbb{R}^{2}.$

Equivalently, two random variables $X$ and $Y$ are independent if

$f_{X,Y}\left(x,y\right)=f_{X}\left(x\right).f_{Y}\left(y\right),\forall\left(x,y\right)'\in\mathbb{R}^{2}.$

# Conditional PMF/PDF

The conditional pmf/pdf of $X$ given $Y=y$, $f_{\left.X\right|Y}\left(x,y\right)$, is given by

$f_{\left.X\right|Y}\left(x,y\right)=\frac{f_{X,Y}\left(x,y\right)}{f_{Y}\left(y\right)}.$

If $f_{Y}\left(y\right)\gt 0$, then all of the properties of pmfs/pdfs apply to the conditional pmf/pdf.

This conditional function is a pmf/pdf in its own right. It adds up to one, and is positive (bounded at 1 in the case of the pmf).

The interpretation is intuitive. If $\left(X,Y\right)'$ is discrete and $f_{Y}\left(y\right)\gt 0$, then

$f_{\left.X\right|Y}\left(x,y\right)=\frac{f_{X,Y}\left(x,y\right)}{f_{Y}\left(y\right)}=\frac{P\left(X=x,Y=y\right)}{P\left(Y=y\right)}=P\left(\left.X=x\right|Y=y\right)$ i.e., $f_{\left.X\right|Y}\left(x,y\right)$ is the conditional probability of $X=x$ given that $Y=y$.

Because the value of a pdf does not correspond to a probability, the interpretation is trickier in the continuous case. If we would like to describe the density of $X$ when $Y=5$, for example, then $f_{\left.X\right|Y}\left(x,5\right)$ gives us the intended object. However, a problem may arise. How can we ask about the pdf $f_{\left.X\right|Y}\left(x,5\right)$, for example, given that $P\left(Y=5\right)=0$ when $Y$ is continuous? A related issue is that one can actually obtain different conditional pdfs $f_{\left.X\right|Y}\left(x,y\right)$, depending on how they are calculated! In the next section we briefly mention the Borel paradox.

# Conditioning on Sets

From the discrete case, above, it follows that in the continuous case we can condition on an interval over $y$. For example,

$f_{\left.X\right|Y}\left(x,y\in\left[\underline{y},\overline{y}\right]\right)$=$\frac{\int_{\underline{y}}^{\overline{y}}f_{X,Y}\left(x,y\right)dy}{P\left(y\in\left[\underline{y},\overline{y}\right]\right)}=\frac{\int_{\underline{y}}^{\overline{y}}f_{X,Y}\left(x,y\right)dy}{\int_{-\infty}^{\infty}\int_{\underline{y}}^{\overline{y}}f_{X,Y}\left(x,y\right)dydx}.$

The Borel paradox is that it is possible to construct two distinct but equally valid conditional pdfs, when we condition on a measure zero event (i.e., an event with probability zero, such as a point in the continuous case).

A resolution for this ambiguity is to instead take the limit of a pdf conditional on a set of values with positive probability that converges to the measure zero event as $n$ increases.

For example, let $\overline{y}_{n}^{1}\gt y\gt \underline{y}_{n}^{1}$ and $\overline{y}_{n}^{2}\gt y\gt \underline{y}_{n}^{2}$ be different sequences of upper and lower bounds, such that

$\lim_{n\rightarrow\infty}\overline{y}_{n}^{1}=\lim_{n\rightarrow\infty}\underline{y}_{n}^{1}=\lim_{n\rightarrow\infty}\overline{y}_{n}^{2}=\lim_{n\rightarrow\infty}\underline{y}_{n}^{2}=y$

and

$P\left(y\in\left[\underline{y}_{n}^{1},\overline{y}_{n}^{1}\right]\right)\gt 0$ and $P\left(y\in\left[\underline{y}_{n}^{2},\overline{y}_{n}^{2}\right]\right)\gt 0\,\forall n\in\mathbb{N}.$

Then, one could calculate

$\lim_{n\rightarrow \infty}f_{\left.X\right|Y}\left(x,y\in\left[\underline{y}_{n}^{1},\overline{y}_{n}^{1}\right]\right)$

and

$\lim_{n\rightarrow n\rightarrow \infty}f_{\left.X\right|Y}\left(x,y\in\left[\underline{y}_{n}^{2},\overline{y}_{n}^{2}\right]\right).$

Although this procedure can yield two different results, each limit can only yield one conditional probability. In practice, this issue is often ignored or assumed away.

# Some Conditional Moments

• The expected value of $X$ given $Y$ is defined as

$E_{\left.X\right|Y}\left(\left.X\right|Y=y\right)=\begin{cases} \sum_{s\in\mathbb{R}}s.f_{X|Y}\left(s,y\right), & \text{if}\,\left(X,Y\right)'\,\text{is discrete }\\ \int_{-\infty}^{\infty}s.f_{X|Y}\left(s,y\right)ds & \text{if}\,\left(X,Y\right)'\,\text{is continuous} \end{cases}$

We often write $E_{\left.X\right|Y}\left(\left.X\right|Y\right)$ instead of $E_{\left.X\right|Y}\left(\left.X\right|Y=y\right)$, when we haven't yet settled for a value of $y$.

Perhaps a more popular notation, which means the same, is

$E\left(\left.X\right|Y\right).$

Notice that $E\left(\left.X\right|Y\right)$ is a function of $Y.$ The $X$ has been integrated out!

• The conditional variance of $X$ given $Y$ is

$Var\left(\left.X\right|Y\right)=E\left[\left.\left(X-E\left(\left.X\right|Y\right)\right)^{2}\right|Y\right]=E\left(\left.X^{2}\right|Y\right)-E\left(\left.X\right|Y\right)^{2}$.

# Law of Iterated Expectations

If $\left(X,Y\right)'$ is a random vector, then

$E\left(X\right)=E\left[E\left(\left.X\right|Y\right)\right],$

provided the expectations of $X$ and $Y$ exist. Notice, above, that the outer expectation is w.r.t. $Y.$

The intuition is that, in order to calculate the expectation of $X$, we can first calculate the expectations of $X$ at each value of $Y$, and then average each one of those.

## Proof of the Law of Iterated Expectations

Here we prove the law of iterated expectations for the continuous case:

\begin{aligned} E\left(X\right) & =\int_{-\infty}^{\infty}\int_{-\infty}^{\infty}s.f_{X,Y}\left(s,t\right)dtds\\ & =\int_{-\infty}^{\infty}\int_{-\infty}^{\infty}s.\underset{=f_{X,Y}\left(s,t\right)}{\underbrace{f_{X|Y}\left(s,t\right)f_{Y}\left(t\right)}}dtds\\ & =\int_{-\infty}^{\infty}\underset{E_{\left.X\right|Y}\left(\left.X\right|Y\right)}{\underbrace{\int_{-\infty}^{\infty}s.f_{X|Y}\left(s,t\right)ds}}f_{Y}\left(t\right)dt\\ & =E_{Y}\left[E_{\left.X\right|Y}\left(\left.X\right|Y\right)\right].\end{aligned}

# Conditional Variance Identity

The CVI is useful identity (especially for later, when we discuss linear regression). It is the decomposition

$Var\left(X\right)=E\left[Var\left(X|Y\right)\right]+Var\left[E\left(X|Y\right)\right].$

The interpretation for this equality will be clear when we discuss linear regression.

# Covariance

The covariance of $X$ and $Y$ is

$Cov\left(X,Y\right)=\sigma_{XY}=E\left[\left(X-\mu_{X}\right)\left(Y-\mu_{Y}\right)\right].$

Some properties:

• $Cov\left(X,Y\right)=...=E\left(XY\right)-E\left(X\right)E\left(Y\right).$
• $Cov\left(X,X\right)=Var\left(X\right).$
• $Cov\left(X,Y\right)=0$ if $X$ and $Y$ are independent.

# Correlation

The correlation of $X$ and $Y$ is

$Corr\left(X,Y\right)=\rho_{XY}=\frac{Cov\left(X,Y\right)}{\sigma_{X}\sigma_{Y}}=\frac{\sigma_{XY}}{\sigma_{X}\sigma_{Y}},$

i.e., the correlation is equal to the covariance, standardized by the product of the standard deviations of the variables.

Some properties:

• If $X$ and $Y$ are independent, then $\rho_{XY}=0$ .
• $\left|\rho_{XY}\right|\leq1$, by the Cauchy-Schwarz inequality (explained next).
• $\left|\rho_{XY}\right|=1$ if $P\left(Y=aX\pm b\right)=1$ for some $a\neq0,b\in\mathbb{R}$.
• $\rho_{XY}$ is a measure of linear dependence.

# Cauchy-Schwarz Inequality

If $\left(X,Y\right)'$ is a bivariate random vector, then

$\left|E\left(XY\right)\right|\leq E\left(\left|XY\right|\right)\leq\sqrt{E\left(X^{2}\right)}\sqrt{E\left(Y^{2}\right)}$.

The inequality binds joint moments by their separate properties.

# Generalization: Holder Inequality

If $\left(X,Y\right)'$ is a bivariate random vector, then

$\left|E\left(XY\right)\right|\leq E\left(\left|XY\right|\right)\leq\sqrt[p]{E\left(\left|X\right|^{p}\right)}\sqrt[q]{E\left(\left|Y\right|^{q}\right)},\,for\,p,q\gt 0,p^{-1}+q^{-1}=1.$

# Jensen’s Inequality

If $X$ is an r.v. and $g:\mathbb{R}\rightarrow\mathbb{R}$ is a convex function, then

$E\left[g\left(X\right)\right]\geq g\left[E\left(X\right)\right].$

This also implies that if $g\left(\cdot\right)$ is concave, then

$E\left[g\left(X\right)\right]\leq g\left[E\left(X\right)\right].$

Finally, if $g\left(\cdot\right)$ is linear, then

$E\left[g\left(X\right)\right]=g\left[E\left(X\right)\right],$

since a linear function is both convex and concave.

For an example, consider an r.v. $X$ that can equal $0$ or $8$ with equal probability.

In this case,

$E\left(X\right)^{2}=\left(0.5\times0+0.5\times8\right)^{2}=16$

and

$E\left(X^{2}\right)=\left(0.5\times0^{2}+0.5\times8^{2}\right)=32,$

as predicted by Jensen's inequality, since $y=x^{2}$ is convex.

Consider a graphical depiction is below:

The curves above are obtained by varying the probability weights associated with 0 and 8 from zero to one. Jensen's inequality applies throughout the graph.