Full Lecture 8

From Significant Statistics
Jump to navigation Jump to search

Point Estimation

Let [math]X_{1}..X_{n}[/math] be a random sample from a distribution with cdf [math]F\left(\left.\cdot\right|\theta\right)[/math] where [math]\theta\in\Theta[/math] is unknown.

A point estimator is any function [math]\omega\left(X_{1}..X_{n}\right)[/math].

Notice that a point estimator is a statistic. This does mean that it too is a random variable: For different random samples, we will obtain different point estimators.

We refer to the realized value of an estimator (i.e., the value of the statistic applied to the realized values of a random sample) as an estimate.

Clearly, a good estimator will be close to [math]\theta[/math] in some probabilistic sense. Finally, an estimator cannot use the true value of [math]\theta[/math] itself.

We consider two methods for point estimation: The method of moments and maximum likelihood.


Method of Moments

Example: Bernoulli

We start with an example. Suppose [math]X_{i}\sim Ber\left(p\right)[/math], where [math]p\in\left[0,1\right][/math] is unknown.

The method of moments operates by equaling sample moments to population moments. To estimate [math]p[/math], we might equal the sample moment [math]\overline{x}[/math] with the population moment [math]E\left(X\right)=p[/math]. In this case, we obtain the method of moments estimator, [math]\widehat{\theta}_{MM}=\widehat{p}_{MM}[/math] (we use [math]\widehat{}[/math] to refer to estimators) by writing the moment equation [math]E\left(X\right)=p[/math] by replacing the moment [math]E\left(X\right)[/math] by its sample analogue [math]\overline{x}[/math], and replacing parameter [math]p[/math] by the estimator [math]\widehat{p}_{MM}[/math] to obtain [math]\overline{x}=\widehat{p}_{MM}[/math].

In other words, the method of moments estimate for [math]p[/math] is the proportion of observed 1s ([math]\overline{x}[/math]) in the sample. Sometimes, the parameter of interest does not equal [math]E\left(X\right)[/math], in which case we would have to have solved the last equation w.r.t. [math]\widehat{p}_{MM}[/math].

Method of Moments

Let [math]X_{1}..X_{n}[/math] be a random sample from a distribution with pmf/pdf [math]f\left(\left.\cdot\right|\theta\right)[/math], where [math]\theta\subseteq\Theta\subseteq\mathbb{R}[/math] is unknown.

And let [math]\mu\left(\cdot\right):\Theta\rightarrow\mathbb{R}[/math] (aka the moment equation) be given by [math]\mu\left(\theta\right)=\begin{cases} \sum_{x\in\mathbb{R}}xf\left(\left.x\right|\theta\right), & x\,discrete\\ \int_{-\infty}^{\infty}xf\left(\left.x\right|\theta\right)dx, & x\,continuous \end{cases}[/math]

The method of moments estimator [math]\widehat{\theta}_{MM}[/math] is the solution to equation

[math]\mu\left(\widehat{\theta}_{MM}\right)=\overline{x}=\frac{1}{n}\sum_{i=1}^{n}x_{i}.[/math]

The method of moments estimator can be summarized as equaling the population mean (which depends on some parameter [math]\theta[/math]) to the sample mean [math]\overline{x}[/math], and solving for the unknown parameter [math]\theta[/math], calling it [math]\widehat{\theta}_{MM}[/math].

Up to now, the method could as well be called “method of moment” estimator, since it only uses the first moment of [math]X[/math]. Additional moments will be used depending on the number of parameters we would like to estimate.

Example: Uniform

Let [math]X_{i}\overset{iid}{\sim}U\left(0,\theta\right)[/math], where [math]\theta\gt 0[/math] is unknown. In this case, our moment equation is [math]\mu\left(\theta\right)=E\left(X\right)=\frac{\theta}{2}[/math].

For estimation, we replace [math]\theta[/math] by [math]\widehat{\theta}_{MM}[/math] and equal [math]E\left(X\right)[/math] to [math]\overline{x}[/math], to obtain

[math]\begin{aligned} \mu\left(\widehat{\theta}_{MM}\right) & =\frac{\widehat{\theta}_{MM}}{2}=\overline{x}\\ & \Leftrightarrow\widehat{\theta}_{MM}=2\overline{x}.\end{aligned}[/math]

So, the method of moments estimator for the upper bound of the uniform distribution is simply twice the observed mean.

It turns out this is not a very good estimator of [math]\overline{x}[/math], because we can obtain an estimate for [math]\widehat{\theta}_{MM}[/math] that falls below the maximum value of [math]x_{i}[/math] observed in the sample. For example, suppose [math]\overline{x}[/math] is 0.4, making our estimate [math]\widehat{\theta}_{MM}=0.8[/math]. This may not be reasonable if the highest draw was 1.2, for example: From this, we immediately know that [math]\theta\geq1.2[/math]. However, the method of moments estimator ignores the bounds of the uniform distribution, and so fails to account for the fact that [math]\widehat{\theta}_{MM}=0.8[/math] may fall under the highest value observed in the sample.

Multiple Parameters

When we want to estimate multiple parameters, we simply solve more equations, by including additional moments (typically in ascending order):

Let [math]X_{1}..X_{n}[/math] be a random sample from a distribution with pmf/pdf [math]f\left(\left.\cdot\right|\theta\right)[/math], where [math]\theta\subseteq\Theta\subseteq\mathbb{R}^{k}[/math] is unknown.

For [math]j=1..k[/math], let [math]\mu_{j}:\Theta\rightarrow\mathbb{R}[/math] be given by

[math]\mu_{j}\left(\theta\right)=\begin{cases} \sum_{x\in\mathbb{R}}x^{j}f\left(\left.x\right|\theta\right), & x\,discrete\\ \int_{-\infty}^{\infty}x^{j}f\left(\left.x\right|\theta\right)dx, & x\,continuous \end{cases}[/math]

The method of moments estimator [math]\widehat{\theta}_{MM}[/math] solves the system of equations

[math]\mu_{j}\left(\widehat{\theta}_{MM}\right)=\frac{1}{n}\sum_{i=1}^{n}x_{i}^{j},\,j=1..k[/math]

Consider the following example. Suppose [math]X_{i}\overset{iid}{\sim}N\left(\mu,\sigma^{2}\right)[/math] where [math]\mu\in\mathbb{R}[/math] and [math]\sigma^{2}\gt 0[/math] are unknown.

Then, the moment equations equal:

[math]\begin{aligned} \mu_{1}\left(\mu,\sigma^{2}\right) & =E\left(X\right)=\mu\\ \mu_{2}\left(\mu,\sigma^{2}\right) & =E\left(X^{2}\right)=\sigma^{2}+\mu^{2}\end{aligned}[/math]

Plugging in [math]\widehat{\mu}_{MM}[/math], [math]\widehat{\sigma^{2}}_{MM}[/math] and equaling the population moments to their sample analogues yields the system

[math]\begin{aligned} & \left\{ \begin{array}{c} \mu_{1}\left(\widehat{\mu}_{MM},\widehat{\sigma^{2}}_{MM}\right)=E\left(X\right)=\widehat{\mu}_{MM}=\overline{x}\\ \mu_{2}\left(\widehat{\mu}_{MM},\widehat{\sigma^{2}}_{MM}\right)=E\left(X^{2}\right)=\widehat{\sigma^{2}}_{MM}+\widehat{\mu}_{MM}^{2}=\frac{1}{n}\sum_{i=1}^{n}x_{i}^{2} \end{array}\right.\\ = & \left\{ \begin{array}{c} \widehat{\mu}_{MM}=\overline{x}\\ \widehat{\sigma^{2}}_{MM}+\overline{x}^{2}=\frac{1}{n}\sum_{i=1}^{n}x_{i}^{2} \end{array}\right.=\left\{ \begin{array}{c} -\\ \widehat{\sigma^{2}}_{MM}=\frac{\sum_{i=1}^{n}x_{i}^{2}}{n}-\overline{x}^{2} \end{array}\right.\end{aligned}[/math] and the last expression can be further simplified: [math]\frac{\sum_{i=1}^{n}x_{i}^{2}}{n}-\overline{x}^{2}=\frac{\sum_{i=1}^{n}x_{i}^{2}}{n}-\frac{n}{n}\overline{x}^{2}=\frac{\sum_{i=1}^{n}\left(x_{i}^{2}-\overline{x}^{2}\right)}{n}=\frac{\sum_{i=1}^{n}\left(x_{i}^{2}-\overline{x}^{2}+\left(\overline{x}^{2}-\overline{x}^{2}\right)+\left(2x_{i}\overline{x}-2x_{i}\overline{x}\right)\right)}{n}=\frac{\sum_{i=1}^{n}\left(\left(x_{i}-\overline{x}\right)^{2}-2\overline{x}^{2}+2x_{i}\overline{x}\right)}{n}=\frac{1}{n}\sum_{i=1}^{n}\left(x_{i}-\overline{x}\right)^{2}+\frac{2}{n}\underset{=0}{\underbrace{\left[\sum_{i=1}^{n}\left(x_{i}\overline{x}\right)-n\overline{x}^{2}\right]}}[/math][math]=\frac{1}{n}\sum_{i=1}^{n}\left(x_{i}-\overline{x}\right)^{2}[/math].

In this case, we had to solve a system of equations, and used the first and second moments of [math]X[/math].

While the estimator [math]\widehat{\mu}_{MM}[/math] is intuitive, the estimator [math]\widehat{\sigma^{2}}_{MM}=\frac{1}{n}\sum_{i=1}^{n}\left(x_{i}-\overline{x}\right)^{2}[/math] is a bit surprising, since we know that [math]E\left(\frac{1}{n}\sum_{i=1}^{n}\left(x_{i}-\overline{x}\right)^{2}\right)\neq\sigma^{2}[/math]. Rather, we know that [math]E\left(s^{2}\right)=E\left(\frac{1}{n-1}\sum_{i=1}^{n}\left(x_{i}-\overline{x}\right)^{2}\right)=\sigma^{2}[/math], so it is possible that the method of moments estimator for [math]\sigma^{2}[/math] can be improved. Currently, on average, it produces a biased estimate of [math]\sigma^{2}[/math]. We will return to this point, with a formal analysis.

For cases with complicated estimators, involving non-linear equations for example, one may solve the system of equations with the help of a computer.


Maximum Likelihood

Let [math]X_{1}..X_{N}[/math] be a random sample from a distribution with pmf/pdf [math]f\left(\left|\cdot\right.\theta\right)[/math], where [math]\theta\in\Theta[/math] is unknown.

The maximum likelihood estimator (MLE) [math]\widehat{\theta}_{ML}[/math] maximizes [math]L\left(\widehat{\theta}_{ML}\left|x_{1}..x_{n}\right.\right)[/math] where [math]L\left(\cdot\left|x_{1}..x_{n}\right.\right):\Theta\rightarrow\mathbb{R}_{+}[/math] (codomain is [math]\left[0,1\right][/math] in the case of the pmf) is given by

[math]L\left(\theta\left|x_{1}..x_{n}\right.\right)=\Pi_{i=1}^{n}f\left(\left|x_{i}\right.\theta\right),\,\theta\in\Theta.[/math]

Function [math]L\left(\theta\left|x_{1}..x_{n}\right.\right)[/math] is called the likelihood function.

In discrete case, the joint pmf equals the probability of the sample having occurred, given parameter [math]\theta[/math]. So, the intuition for the maximum likelihood estimator is that we look for the value of [math]\theta[/math] that maximizes the probability of the observed sample having occurred.

The maximum likelihood estimator has some incredibly useful properties, which we will discuss later.

log-Likelihood

Sometimes, a convenient object to work with is the log-likelihood function, given by

[math]l\left(\theta\left|x_{1}..x_{n}\right.\right)=\log\,L\left(\theta\left|x_{1}..x_{n}\right.\right)=\sum_{i=1}^{n}\log f\left(\left|x_{i}\right.\theta\right)[/math]

The last identity follows from the fact that the log of the product is equal to the sum of the logs. Notice that because [math]\log\left(\cdot\right)[/math] is a monotone function, it is also maximized by [math]\widehat{\theta}_{ML}[/math].

In order to compute the maximum likelihood function, we simply need to obtain it and maximize it w.r.t. the parameters of interest.

Example: Bernoulli

Suppose [math]X_{i}\overset{iid}{\sim}Ber\left(p\right)[/math], where [math]p\in\left[0,1\right][/math] is unknown.

The marginal pmf equals [math]f\left(\left.x\right|p\right)=p^{x}\left(1-p\right)^{1-x}1\left(x\in\left\{ 0,1\right\} \right)[/math]. We can ignore [math]1\left(x\in\left\{ 0,1\right\} \right)[/math], since by assumption it is satisfied in the sample.

The likelihood function equals [math]L\left(\left.p\right|x_{1}..x_{n}\right)=\Pi_{i=1}^{n}f\left(\left.x_{i}\right|p\right)=\Pi_{i=1}^{n}\left(p^{x}\left(1-p\right)^{1-x}\right)=p^{\sum_{i=1}^{n}x_{i}}\left(1-p\right)^{n-\sum_{i=1}^{n}x_{i}},p\in\left[0,1\right][/math]

The log-likelihood function equals [math]l\left(\left.p\right|x_{1}..x_{n}\right)=\sum_{i=1}^{n}x_{i}\log\left(p\right)+\left(n-\sum_{i=1}^{n}x_{i}\right)\log\left(1-p\right),\,p\in\left(0,1\right)[/math]

(Because [math]\log\left(0\right)=-\infty[/math], we will inspect [math]p=0[/math] and [math]p=1[/math] separately.)

We look for an interior solution:

[math]\begin{aligned} foc\left(p\right): & =\frac{\partial}{\partial p}l\left(\left.p\right|x_{1}..x_{n}\right)=0\\ \Leftrightarrow & \frac{\sum_{i=1}^{n}x_{i}}{p}-\frac{n-\sum_{i=1}^{n}x_{i}}{1-p}=0\\ \Leftrightarrow & \widehat{p}_{ML}=\frac{\sum_{i=1}^{n}x_{i}}{n}\end{aligned}[/math]

Verifying the second-order condition reveals that our estimator is indeed a maximum.

Finally, consider the possible exceptions. When [math]\frac{\sum_{i=1}^{n}x_{i}}{n}=0[/math], [math]\widehat{p}_{ML}=0[/math], but this value is outside the allowed bounds of the log-likelihood.

Similarly, when [math]\frac{\sum_{i=1}^{n}x_{i}}{n}=1[/math], our estimator’s value falls outside the defined region of our log-likelihood. We can use the likelihood function to determine what the estimator is in these two cases:

  • [math]\sum_{i=1}^{n}x_{i}=0[/math], [math]\max_{p}\,L\left(\left.p\right|x_{1}..x_{n}\right)=\max_{p}\,\left(1-p\right)^{n}\Rightarrow\widehat{p}_{ML}=0[/math]
  • [math]\sum_{i=1}^{n}x_{i}=n[/math], [math]\max_{p}\,L\left(\left.p\right|x_{1}..x_{n}\right)=\max_{p}\,p^{n}\Rightarrow\widehat{p}_{ML}=1[/math]

Interestingly, our original formula [math]\widehat{p}_{ML}=\frac{\sum_{i=1}^{n}x_{i}}{n}[/math] also agrees with these two cases, and so we have found that for any admissible sample, [math]\widehat{p}_{ML}=\frac{\sum_{i=1}^{n}x_{i}}{n}[/math].

Notice also that the method of moments estimator coincides with the maximum likelihood estimator (this is not guaranteed by any means).