Full Lecture 8

From Significant Statistics
Jump to navigation Jump to search

Point Estimation

Let [math]X_{1}..X_{n}[/math] be a random sample from a distribution with cdf [math]F\left(\left.\cdot\right|\theta\right)[/math] where [math]\theta\in\Theta[/math] is unknown.

A point estimator is any function [math]\omega\left(X_{1}..X_{n}\right)[/math].

Notice that a point estimator is a statistic. This does mean that it too is a random variable: For different random samples, we will obtain different point estimators.

We call the realized value of an estimator (i.e., the value of the statistic applied to the realized values of a random sample) as an estimate.

Clearly, a good estimator will be close to [math]\theta[/math] in some probabilistic sense. Finally, an estimator cannot use the true value of [math]\theta[/math] itself.

We consider two methods for point estimation: The method of moments and maximum likelihood.


Method of Moments

Example: Bernoulli

We start with an example. Suppose [math]X_{i}\sim Ber\left(p\right)[/math], where [math]p\in\left[0,1\right][/math] is unknown.

The method of moments operates by equaling sample moments to population moments. To estimate [math]p[/math], we might equal the sample moment [math]\overline{x}[/math] with the population moment [math]E\left(X\right)=p[/math]. In this case, we obtain the method of moments estimator, [math]\widehat{\theta}_{MM}=\widehat{p}_{MM}[/math] (we use [math]\widehat{}[/math]s to refer to estimators) by writing the moment equation [math]E\left(X\right)=p[/math] by replacing the moment [math]E\left(X\right)[/math] by its sample analogue [math]\overline{x}[/math], and replacing parameter [math]p[/math] by the estimator [math]\widehat{p}_{MM}[/math] to obtain [math]\overline{x}=\widehat{p}_{MM}[/math].

In other words, the method of moments estimate for [math]p[/math] is the proportion of observed 1s ([math]\overline{x}[/math]) in the sample. Sometimes, the parameter of interest does not equal [math]E\left(X\right)[/math], in which case we would have to have solved the last equation w.r.t. [math]\widehat{p}_{MM}[/math].

Method of Moments

Let [math]X_{1}..X_{n}[/math] be a random sample from a distribution with pmf/pdf [math]f\left(\left.\cdot\right|\theta\right)[/math], where [math]\theta\subseteq\Theta\subseteq\mathbb{R}[/math] is unknown.

And let [math]\mu\left(\cdot\right):\Theta\rightarrow\mathbb{R}[/math] (aka the moment equation) be given by [math]\mu\left(\theta\right)=\begin{cases} \sum_{x\in\mathbb{R}}xf\left(\left.x\right|\theta\right), & x\,discrete\\ \int_{-\infty}^{\infty}xf\left(\left.x\right|\theta\right)dx, & x\,continuous \end{cases}[/math]

The method of moments estimator [math]\widehat{\theta}_{MM}[/math] is the solution to equation

[math]\mu\left(\widehat{\theta}_{MM}\right)=\overline{x}=\frac{1}{n}\sum_{i=1}^{n}x_{i}.[/math]

The method of moments estimator can be summarized as equaling the population mean (which depends on some parameter [math]\theta[/math]) to the sample mean [math]\overline{x}[/math], and solving for the unknown parameter [math]\theta[/math], calling it [math]\widehat{\theta}_{MM}[/math].

Up to now, the method could as well be called “method of moment” estimator, since it only uses the first moment of [math]X[/math]. Additional moments will be used depending on the number of parameters we would like to estimate.

Example: Uniform

Let [math]X_{i}\overset{iid}{\sim}U\left(0,\theta\right)[/math], where [math]\theta\gt 0[/math] is unknown. In this case, our moment equation is [math]\mu\left(\theta\right)=E\left(X\right)=\frac{\theta}{2}[/math].

For estimation, we replace [math]\theta[/math] by [math]\widehat{\theta}_{MM}[/math] and equal [math]E\left(X\right)[/math] to [math]\overline{x}[/math], to obtain

[math]\begin{aligned} \mu\left(\widehat{\theta}_{MM}\right) & =\frac{\widehat{\theta}_{MM}}{2}=\overline{x}\\ & \Leftrightarrow\widehat{\theta}_{MM}=2\overline{x}.\end{aligned}[/math]

So, the method of moments estimator for the upper bound of the uniform distribution is simply twice the observed mean.

It turns out this is not a very good estimator of [math]\overline{x}[/math], because we can obtain an estimate for [math]\widehat{\theta}_{MM}[/math] that falls below the maximum value of [math]x_{i}[/math] observed in the sample. For example, suppose [math]\overline{x}[/math] is 0.4, making our estimate [math]\widehat{\theta}_{MM}=0.8[/math]. This may not be reasonable if the highest draw was 1.2, for example: From this, we immediately know that [math]\theta\geq1.2[/math]! However, the method of moments estimator ignores the bounds of the uniform distribution, and so fails to account for the fact that [math]\widehat{\theta}=1.2[/math] is clearly superior to [math]\widehat{\theta}_{MM}=0.8[/math] in this case.

Multiple Parameters

When we want to estimate multiple parameters, we simply solve more equations, by including additional moments (typically in ascending order):

Let [math]X_{1}..X_{n}[/math] be a random sample from a distribution with pmf/pdf [math]f\left(\left.\cdot\right|\theta\right)[/math], where [math]\theta\subseteq\Theta\subseteq\mathbb{R}^{k}[/math] is unknown.

For [math]j=1..k[/math], let [math]\mu_{j}:\Theta\rightarrow\mathbb{R}[/math] be given by

[math]\mu_{j}\left(\theta\right)=\begin{cases} \sum_{x\in\mathbb{R}}x^{j}f\left(\left.x\right|\theta\right), & x\,discrete\\ \int_{-\infty}^{\infty}x^{j}f\left(\left.x\right|\theta\right)dx, & x\,continuous \end{cases}[/math]

The method of moments estimator [math]\widehat{\theta}_{MM}[/math] solves the system of equations

[math]\mu_{j}\left(\widehat{\theta}_{MM}\right)=\frac{1}{n}\sum_{i=1}^{n}x_{i}^{j},\,j=1..k[/math]

Consider the following example. Suppose [math]X_{i}\overset{iid}{\sim}N\left(\mu,\sigma^{2}\right)[/math] where [math]\mu\in\mathbb{R}[/math] and [math]\sigma^{2}\gt 0[/math] are unknown.

Then, the moment equations equal:

[math]\begin{aligned} \mu_{1}\left(\mu,\sigma^{2}\right) & =E\left(X\right)=\mu\\ \mu_{2}\left(\mu,\sigma^{2}\right) & =E\left(X^{2}\right)=\sigma^{2}+\mu^{2}\end{aligned}[/math]

Plugging in [math]\widehat{\mu}_{MM}[/math], [math]\widehat{\sigma^{2}}_{MM}[/math] and equaling the population moments to their sample analogues yields the system

[math]\begin{aligned} & \left\{ \begin{array}{c} \mu_{1}\left(\widehat{\mu}_{MM},\widehat{\sigma^{2}}_{MM}\right)=E\left(X\right)=\widehat{\mu}_{MM}=\overline{x}\\ \mu_{2}\left(\widehat{\mu}_{MM},\widehat{\sigma^{2}}_{MM}\right)=E\left(X^{2}\right)=\widehat{\sigma^{2}}_{MM}+\widehat{\mu}_{MM}^{2}=\frac{1}{n}\sum_{i=1}^{n}x_{i}^{2} \end{array}\right.\\ = & \left\{ \begin{array}{c} \widehat{\mu}_{MM}=\overline{x}\\ \widehat{\sigma^{2}}_{MM}+\overline{x}^{2}=\frac{1}{n}\sum_{i=1}^{n}x_{i}^{2} \end{array}\right.=\left\{ \begin{array}{c} -\\ \widehat{\sigma^{2}}_{MM}=\frac{\sum_{i=1}^{n}x_{i}^{2}}{n}-\overline{x}^{2} \end{array}\right.\end{aligned}[/math] and the last expression can be further simplified: [math]\frac{\sum_{i=1}^{n}x_{i}^{2}}{n}-\overline{x}^{2}=\frac{\sum_{i=1}^{n}x_{i}^{2}}{n}-\frac{n}{n}\overline{x}^{2}=\frac{\sum_{i=1}^{n}\left(x_{i}^{2}-\overline{x}^{2}\right)}{n}=\frac{\sum_{i=1}^{n}\left(x_{i}^{2}-\overline{x}^{2}+\left(\overline{x}^{2}-\overline{x}^{2}\right)+\left(2x_{i}\overline{x}-2x_{i}\overline{x}\right)\right)}{n}=\frac{\sum_{i=1}^{n}\left(\left(x_{i}-\overline{x}^{2}\right)^{2}-2\overline{x}^{2}+2x_{i}\overline{x}\right)}{n}=\frac{1}{n}\sum_{i=1}^{n}\left(x_{i}-\overline{x}\right)^{2}+\frac{2}{n}\underset{=0}{\underbrace{\left[\sum_{i=1}^{n}\left(x_{i}\overline{x}\right)-n\overline{x}^{2}\right]}}[/math][math]=\frac{1}{n}\sum_{i=1}^{n}\left(x_{i}-\overline{x}\right)^{2}[/math].

In this case, we had to solve a system of equation, and used the first and second moments of [math]X[/math].

While the estimator [math]\widehat{\mu}_{MM}[/math] is intuitive, the estimator [math]\widehat{\sigma^{2}}_{MM}=\frac{1}{n}\sum_{i=1}^{n}\left(x_{i}-\overline{x}\right)^{2}[/math] is a bit surprising, since we know that [math]E\left(\frac{1}{n}\sum_{i=1}^{n}\left(x_{i}-\overline{x}\right)^{2}\right)\neq\sigma^{2}[/math]. Rather, we know that [math]E\left(s^{2}\right)=E\left(\frac{1}{n-1}\sum_{i=1}^{n}\left(x_{i}-\overline{x}\right)^{2}\right)=\sigma^{2}[/math], so it is possible that the method of moments estimator for [math]\sigma^{2}[/math] can be improved. Currently, on average, it produces a biased estimate of [math]\sigma^{2}[/math]. We will return to this point, with a formal analysis.

For cases with complicated estimators, involving non-linear equations for example, one may solve the system of equations with the help of a computer.


Maximum Likelihood

Let [math]X_{1}..X_{N}[/math] be a random sample from a distribution with pmf/pdf [math]f\left(\left|\cdot\right.\theta\right)[/math], where [math]\theta\in\Theta[/math] is unknown.

The maximum likelihood estimator (MLE) [math]\widehat{\theta}_{ML}[/math] maximizes [math]L\left(\widehat{\theta}_{ML}\left|x_{1}..x_{n}\right.\right)[/math] where [math]L\left(\cdot\left|x_{1}..x_{n}\right.\right):\Theta\rightarrow\mathbb{R}_{+}[/math] (codomain is [math]\left[0,1\right][/math] in the case of the pmf) is given by

[math]L\left(\theta\left|x_{1}..x_{n}\right.\right)=\Pi_{i=1}^{n}f\left(\left|x_{i}\right.\theta\right),\,\theta\in\Theta.[/math]

Function [math]L\left(\theta\left|x_{1}..x_{n}\right.\right)[/math] is called the likelihood function.

Because [math]X_{1}..X_{n}[/math] is a random sample, in discrete case, it equals the probability of the sample having occurred, given parameter [math]\theta[/math]. So, the intuition for the maximum likelihood estimator, is that we look for the value of [math]\theta[/math] that maximizes the probability of the observed sample having occurred.

The maximum likelihood estimator has some incredibly useful properties, which we will discuss later.

log-Likelihood

Sometimes, a convenient object to work with is the log-likelihood function, given by

[math]l\left(\theta\left|x_{1}..x_{n}\right.\right)=\log\,L\left(\theta\left|x_{1}..x_{n}\right.\right)=\sum_{i=1}^{n}\log f\left(\left|x_{i}\right.\theta\right)[/math]

The last identity follows from the fact that the log of the product is equal to the sum of the logs. Notice that because [math]\log\left(\cdot\right)[/math] is a monotone function, it is also maximized by [math]\widehat{\theta}_{ML}[/math].

In order to compute the maximum likelihood function, we simply need to obtain it and maximize it w.r.t. the parameters of interest.

Example: Bernoulli

Suppose [math]X_{i}\overset{iid}{\sim}Ber\left(p\right)[/math], where [math]p\in\left[0,1\right][/math] is unknown.

The marginal pmf equals [math]f\left(\left.x\right|p\right)=p^{x}\left(1-p\right)^{1-x}1\left(x\in\left\{ 0,1\right\} \right)[/math]. We can ignore [math]1\left(x\in\left\{ 0,1\right\} \right)[/math], since by assumption it is satisfied in the sample.

The likelihood function equals [math]L\left(\left.p\right|x_{1}..x_{n}\right)=\Pi_{i=1}^{n}f\left(\left.x_{i}\right|p\right)=\Pi_{i=1}^{n}\left(p^{x}\left(1-p\right)^{1-x}\right)=p^{\sum_{i=1}^{n}x_{i}}\left(1-p\right)^{n-\sum_{i=1}^{n}x_{i}},p\in\left[0,1\right][/math]

The log-likelihood function equals [math]l\left(\left.p\right|x_{1}..x_{n}\right)=\sum_{i=1}^{n}x_{i}\log\left(p\right)+\left(n-\sum_{i=1}^{n}x_{i}\right)\log\left(1-p\right),\,p\in\left(0,1\right)[/math]

(Because [math]\log\left(0\right)=-\infty[/math], we will inspect [math]p=0[/math] and [math]p=1[/math] separately.)

We look for an interior solution:

[math]\begin{aligned} foc\left(p\right): & =\frac{\partial}{\partial p}l\left(\left.p\right|x_{1}..x_{n}\right)0\\ \Leftrightarrow & \frac{\sum_{i=1}^{n}x_{i}}{p}-\frac{n-\sum_{i=1}^{n}x_{i}}{1-p}=0\\ \Leftrightarrow & \widehat{p}_{ML}=\frac{\sum_{i=1}^{n}x_{i}}{n}\end{aligned}[/math]

Verifying the second-order condition reveals that our estimator is indeed a maximum.

Finally, consider the possible exceptions. When [math]\frac{\sum_{i=1}^{n}x_{i}}{n}=0[/math], [math]\widehat{p}_{ML}=0[/math], but this value is outside the allowed bounds of the log-likelihood.

Similarly, when [math]\frac{\sum_{i=1}^{n}x_{i}}{n}=1[/math], our estimator’s value falls outside the defined region of our log-likelihood. We can use the likelihood function to determine what the estimator is in these two cases:

  • [math]\sum_{i=1}^{n}x_{i}=0[/math], [math]\max_{p}\,L\left(\left.p\right|x_{1}..x_{n}\right)=\max_{p}\,\left(1-p\right)^{n}\Rightarrow\widehat{p}_{ML}=0[/math]
  • [math]\sum_{i=1}^{n}x_{i}=n[/math], [math]\max_{p}\,L\left(\left.p\right|x_{1}..x_{n}\right)=\max_{p}\,p^{n}\Rightarrow\widehat{p}_{ML}=1[/math]

Interestingly, our original formula [math]\widehat{p}_{ML}=\frac{\sum_{i=1}^{n}x_{i}}{n}[/math] also agrees with these two cases, and so we have found that for any admissible sample, [math]\widehat{p}_{ML}=\frac{\sum_{i=1}^{n}x_{i}}{n}[/math].

Notice also that the method of moments estimator coincides with the maximum likelihood estimator (this is not guaranteed by any means).