# Parametric Families of Distributions

When working in statistics, it is often useful to draw conclusions that apply to multiple distributions. We now define classes of distributions, often referred to as families.

## Exponential Family

The set of pmfs/pdfs $\left\{ f\left(\left.\cdot\right|\theta\right):\theta\in\Theta\right\}$ is the exponential family if $f\left(\left.x\right|\theta\right)=h\left(x\right)c\left(\theta\right)\exp\left[\sum_{i=1}^{K}\omega_{i}\left(\theta\right)t_{i}\left(x\right)\right],\,x\in\mathbb{R},\theta\in\Theta$

where

$h:\mathbb{R}\rightarrow\mathbb{R}_{+},c:\Theta\rightarrow\mathbb{R}_{++},\omega_{i}:\Theta\rightarrow\mathbb{R}\,\forall i,t_{i}:\mathbb{R}\rightarrow\mathbb{R}\,\forall i$ and some $K\geq1$.

### Normal Distribution

The normal distribution is part of the exponential family, as we now show:

\begin{aligned} f\left(\left.x\right|\mu,\sigma^{2}\right) & =\frac{1}{\sqrt{2\pi\sigma^{2}}}\exp\left(-\frac{\left(x-\mu\right)^{2}}{2\sigma^{2}}\right)\\ & \frac{1}{\sqrt{2\pi\sigma^{2}}}\exp\left(-\frac{1}{2\sigma^{2}}\left(x^{2}+\mu^{2}-2\mu x\right)\right)\\ & =\underset{h\left(x\right)}{\underbrace{1}}.\underset{c\left(\mu,\sigma^{2}\right)}{\underbrace{\frac{1}{\sqrt{2\pi\sigma^{2}}}\exp\left(-\frac{\mu^{2}}{2\sigma^{2}}\right)}}\underset{\exp\left[\sum_{i=1}^{K}\omega_{i}\left(\theta\right)t_{i}\left(x\right)\right]}{\underbrace{\exp\left(-\frac{x^{2}}{2\sigma^{2}}+\frac{\mu}{\sigma^{2}}x\right)}}\end{aligned}

where

$\omega_{1}\left(\mu,\sigma^{2}\right)=-\frac{1}{2\sigma^{2}};t_{1}\left(x\right)=x^{2};\omega_{2}\left(\mu,\sigma^{2}\right)=\frac{\mu}{\sigma^{2}};t_{2}\left(x\right)=x$.

### Bernoulli Distribution

The Bernoulli also belongs to the exponential family(!):

\begin{aligned} f\left(\left.x\right|p\right) & =\begin{cases} p, & x=1\\ 1-p, & x=0\\ 0, & otherwise \end{cases}\\ & =\begin{cases} p^{x}\left(1-p\right)^{x}, & x\in\left\{ 0,1\right\} \\ 0, & otherwise \end{cases}\\ & =1\left(x\in\left\{ 0,1\right\} \right)p^{x}\left(1-p\right)^{1-x}\end{aligned} where $1\left(\cdot\right)$ is the indicator function.

Factorization yields:

\begin{aligned} f\left(\left.x\right|p\right) & =1\left(x\in\left\{ 0,1\right\} \right)p^{x}\left(1-p\right)^{1-x}\\ & =1\left(x\in\left\{ 0,1\right\} \right)p^{x}\left(1-p\right)\left(1-p\right)^{-x}\\ & =1\left(x\in\left\{ 0,1\right\} \right)\left(1-p\right)\left(\frac{p}{1-p}\right)^{x}\\ & =\underset{h\left(x\right)}{\underbrace{1\left(x\in\left\{ 0,1\right\} \right)}}\underset{c\left(p\right)}{\underbrace{\left(1-p\right)}}\exp\left(\underset{\omega_{1}}{\underbrace{\log\left(\frac{p}{1-p}\right)}}\underset{t_{1}}{\underbrace{x}}\right)\end{aligned}

## Remarks

• Factor $c\left(\theta\right)$ is the normalizing constant of the pmf/pdf. This means that it can always be obtained, since it is there to ensure that the functions add up to 1.
• The support of pmfs/pdfs of members of the exponential family does not depend on $\theta$, i.e., $S_{X}=\left\{ x\in\mathbb{R}:f\left(\left.x\right|\theta\right)\gt 0\right\} =\left\{ x\in\mathbb{R}:h\left(x\right)\gt 0\right\}$. Otherwise, it is impossible to produce $h\left(x\right)c\left(\theta\right)$. For example, the uniform distribution with pdf $f_{X}\left(\left.x\right|a,b\right)=\frac{1}{b-a}1\left(a\leq x\leq b\right)$ does not belong to the exponential family, since it is impossible to separate $a$ and $b$ from $x$ in the indicator function.

## Location-Scale Family

A (parametric) family $\mathcal{F}$ of pdfs is a location-scale family is given by

$\mathcal{F}=\left\{ \frac{1}{\sigma}f\left(\frac{\cdot-\mu}{\sigma}\right):\mu\in\mathbb{R},\sigma\gt 0\right\}$

where $f\left(\cdot\right)$ is the standard pdf of the family, $\mu$ is the location parameter and $\sigma$ is the scale parameter. The idea is that $\frac{1}{\sigma}f\left(\frac{\cdot-\mu}{\sigma}\right)$ is the pdf of $\mu+\sigma\widetilde{X}$ where $\widetilde{X}$ has pdf $f\left(\cdot\right)$.

Clearly, r.v.s with pdf $N\left(\mu,\sigma^{2}\right)$ belong to the location-scale family of $N\left(0,1\right)$. Similarly, r.v.s with pdf $U\left(a,b\right)$ belong to the location-scale family of $U\left(0,1\right)$.

Functions that differ from the standard pdf only in their location or scale parameter belong to the pdf’s location and scale family, respectively.

# Chebychev's Inequality

This inequality establishes a bound on how far the values of an r.v. can be from its mean. Suppose $X$ is an r.v.. Then, for any $r\gt 0$ and any $g:\mathbb{R}\rightarrow\mathbb{R}_{++}$,

$P\left(g\left(X\right)\geq r\right)\leq\frac{E\left(g\left(X\right)\right)}{r}$.

## Proof

The proof is relatively simple. Note that

\begin{aligned} & \forall x\in\mathbb{R},\,r1\left(g\left(x\right)\geq r\right)\leq g\left(x\right)\\ \Leftrightarrow & \forall x\in\mathbb{R},1\left(g\left(x\right)\geq r\right)\leq\frac{g\left(x\right)}{r}\\ \Rightarrow & \underset{=P\left(g\left(X\right)\geq r\right)}{\underbrace{E\left[1\left(g\left(X\right)\geq r\right)\right]}}\leq E\left(\frac{g\left(X\right)}{r}\right)\end{aligned}

## Implications

From this result, we can derive some popular implications:

• $P\left(\left|X\right|\geq r\right)\leq\frac{E\left(\left|X\right|\right)}{r},\,\forall r\gt 0$ - here, $g\left(x\right)=\left|X\right|$. This is often referred to as Markov’s inequality.
• $P\left(\left|X-\mu\right|\geq r\right)\leq\frac{E\left(\left|X-\mu\right|\right)}{r},\,\forall r\gt 0$ - here, $g\left(x\right)=\left|X-\mu\right|$.
• From the previous identity, we can obtain an expression that relates mean to variance:

\begin{aligned} & P\left(\left|X-\mu\right|\geq\varepsilon\right)\\ = & P\left(\left(X-\mu\right)^{2}\geq\varepsilon^{2}\right)\leq\frac{E\left(\left(X-\mu\right)^{2}\right)}{\varepsilon^{2}}=\frac{Var\left(X\right)}{\varepsilon^{2}}\end{aligned} which implies that $P\left(\left|X-\mu\right|\geq\varepsilon\right)\leq\frac{Var\left(X\right)}{\varepsilon^{2}}$.

We have established a bound on how much $X$ can vary around its mean, as function of its variance.

# Multiple Random Variables

An n-dimensional vector $X=\left(X_{1}..X_{n}\right)'$ is a random vector if $X_{1}..X_{n}$ are random variables (defined on the same probability space).

We will mostly discuss bivariate distributions, but the results are mostly generalizable for the n-case.

## Joint CDF

The joint cdf of random vector $\left(X,Y\right)$ is the function $F_{X,Y}:\mathbb{R}^{2}\rightarrow\left[0,1\right]$, given by

$F_{X,Y\left(x,y\right)}=P\left(X\leq x,Y\leq y\right),\forall\left(x,y\right)'\in\mathbb{R}^{2}.$

## Joint PMF/PDF

• $\left(X,Y\right)'$ is discrete if $\exists f_{X,Y}:\mathbb{R}^{2}\rightarrow[0,1]$ s.t. $F_{X,Y}\left(x,y\right)=\sum_{s\leq x}\sum_{t\leq y}f_{X,Y}\left(s,t\right),\,\forall\left(x,y\right)'\in\mathbb{R}^{2}.$
• $\left(X,Y\right)'$ is continuous if $\exists f_{X,Y}:\mathbb{R}^{2}\rightarrow\mathbb{R}_{+}$ s.t. $F_{X,Y}\left(x,y\right)=\int_{-\infty}^{x}\int_{-\infty}^{y}f_{X,Y}\left(s,t\right)dtds,\,\forall\left(x,y\right)'\in\mathbb{R}^{2}.$

The remaining properties we discussed in the univariate case extend: Functions with applicable domains and codomains that ‘sum up to one’ are pmfs/pdfs. In addition, expectations extend intuitively.

Let $g\left(x,y\right):\mathbb{R}^{2}\rightarrow\mathbb{R}$. Its expected value is equal to

$E\left(g\left(x,y\right)\right)=\begin{cases} \sum_{s,t\in\mathbb{R^{2}}}g\left(s,t\right)f_{X,Y}\left(s,t\right), & \text{if}\,\left(X,Y\right)'\,\text{is discrete }\\ \int_{-\infty}^{\infty}\int_{-\infty}^{\infty}g\left(s,t\right)f_{X,Y}\left(s,t\right)dtds & \text{if}\,\left(X,Y\right)'\,\text{is continuous} \end{cases}$