# Point Estimation

Let $X_{1}..X_{n}$ be a random sample from a distribution with cdf $F\left(\left.\cdot\right|\theta\right)$ where $\theta\in\Theta$ is unknown.

A point estimator is any function $\omega\left(X_{1}..X_{n}\right)$.

Notice that a point estimator is a statistic. This does mean that it too is a random variable: For different random samples, we will obtain different point estimators.

We call the realized value of an estimator (i.e., the value of the statistic applied to the realized values of a random sample) as an estimate.

Clearly, a good estimator will be close to $\theta$ in some probabilistic sense. Finally, an estimator cannot use the true value of $\theta$ itself.

We consider two methods for point estimation: The method of moments and maximum likelihood.

# Method of Moments

## Example: Bernoulli

We start with an example. Suppose $X_{i}\sim Ber\left(p\right)$, where $p\in\left[0,1\right]$ is unknown.

The method of moments operates by equaling sample moments to population moments. To estimate $p$, we might equal the sample moment $\overline{x}$ with the population moment $E\left(X\right)=p$. In this case, we obtain the method of moments estimator, $\widehat{\theta}_{MM}=\widehat{p}_{MM}$ (we use $\widehat{}$s to refer to estimators) by writing the moment equation $E\left(X\right)=p$ by replacing the moment $E\left(X\right)$ by its sample analogue $\overline{x}$, and replacing parameter $p$ by the estimator $\widehat{p}_{MM}$ to obtain $\overline{x}=\widehat{p}_{MM}$.

In other words, the method of moments estimate for $p$ is the proportion of observed 1s ($\overline{x}$) in the sample. Sometimes, the parameter of interest does not equal $E\left(X\right)$, in which case we would have to have solved the last equation w.r.t. $\widehat{p}_{MM}$.

## Method of Moments

Let $X_{1}..X_{n}$ be a random sample from a distribution with pmf/pdf $f\left(\left.\cdot\right|\theta\right)$, where $\theta\subseteq\Theta\subseteq\mathbb{R}$ is unknown.

And let $\mu\left(\cdot\right):\Theta\rightarrow\mathbb{R}$ (aka the moment equation) be given by $\mu\left(\theta\right)=\begin{cases} \sum_{x\in\mathbb{R}}xf\left(\left.x\right|\theta\right), & x\,discrete\\ \int_{-\infty}^{\infty}xf\left(\left.x\right|\theta\right)dx, & x\,continuous \end{cases}$

The method of moments estimator $\widehat{\theta}_{MM}$ is the solution to equation

$\mu\left(\widehat{\theta}_{MM}\right)=\overline{x}=\frac{1}{n}\sum_{i=1}^{n}x_{i}.$

The method of moments estimator can be summarized as equaling the population mean (which depends on some parameter $\theta$) to the sample mean $\overline{x}$, and solving for the unknown parameter $\theta$, calling it $\widehat{\theta}_{MM}$.

Up to now, the method could as well be called “method of moment” estimator, since it only uses the first moment of $X$. Additional moments will be used depending on the number of parameters we would like to estimate.

## Example: Uniform

Let $X_{i}\overset{iid}{\sim}U\left(0,\theta\right)$, where $\theta\gt 0$ is unknown. In this case, our moment equation is $\mu\left(\theta\right)=E\left(X\right)=\frac{\theta}{2}$.

For estimation, we replace $\theta$ by $\widehat{\theta}_{MM}$ and equal $E\left(X\right)$ to $\overline{x}$, to obtain

\begin{aligned} \mu\left(\widehat{\theta}_{MM}\right) & =\frac{\widehat{\theta}_{MM}}{2}=\overline{x}\\ & \Leftrightarrow\widehat{\theta}_{MM}=2\overline{x}.\end{aligned}

So, the method of moments estimator for the upper bound of the uniform distribution is simply twice the observed mean.

It turns out this is not a very good estimator of $\overline{x}$, because we can obtain an estimate for $\widehat{\theta}_{MM}$ that falls below the maximum value of $x_{i}$ observed in the sample. For example, suppose $\overline{x}$ is 0.4, making our estimate $\widehat{\theta}_{MM}=0.8$. This may not be reasonable if the highest draw was 1.2, for example: From this, we immediately know that $\theta\geq1.2$! However, the method of moments estimator ignores the bounds of the uniform distribution, and so fails to account for the fact that $\widehat{\theta}=1.2$ is clearly superior to $\widehat{\theta}_{MM}=0.8$ in this case.

## Multiple Parameters

When we want to estimate multiple parameters, we simply solve more equations, by including additional moments (typically in ascending order):

Let $X_{1}..X_{n}$ be a random sample from a distribution with pmf/pdf $f\left(\left.\cdot\right|\theta\right)$, where $\theta\subseteq\Theta\subseteq\mathbb{R}^{k}$ is unknown.

For $j=1..k$, let $\mu_{j}:\Theta\rightarrow\mathbb{R}$ be given by

$\mu_{j}\left(\theta\right)=\begin{cases} \sum_{x\in\mathbb{R}}x^{j}f\left(\left.x\right|\theta\right), & x\,discrete\\ \int_{-\infty}^{\infty}x^{j}f\left(\left.x\right|\theta\right)dx, & x\,continuous \end{cases}$

The method of moments estimator $\widehat{\theta}_{MM}$ solves the system of equations

$\mu_{j}\left(\widehat{\theta}_{MM}\right)=\frac{1}{n}\sum_{i=1}^{n}x_{i}^{j},\,j=1..k$

Consider the following example. Suppose $X_{i}\overset{iid}{\sim}N\left(\mu,\sigma^{2}\right)$ where $\mu\in\mathbb{R}$ and $\sigma^{2}\gt 0$ are unknown.

Then, the moment equations equal:

\begin{aligned} \mu_{1}\left(\mu,\sigma^{2}\right) & =E\left(X\right)=\mu\\ \mu_{2}\left(\mu,\sigma^{2}\right) & =E\left(X^{2}\right)=\sigma^{2}+\mu^{2}\end{aligned}

Plugging in $\widehat{\mu}_{MM}$, $\widehat{\sigma^{2}}_{MM}$ and equaling the population moments to their sample analogues yields the system

\begin{aligned} & \left\{ \begin{array}{c} \mu_{1}\left(\widehat{\mu}_{MM},\widehat{\sigma^{2}}_{MM}\right)=E\left(X\right)=\widehat{\mu}_{MM}=\overline{x}\\ \mu_{2}\left(\widehat{\mu}_{MM},\widehat{\sigma^{2}}_{MM}\right)=E\left(X^{2}\right)=\widehat{\sigma^{2}}_{MM}+\widehat{\mu}_{MM}^{2}=\frac{1}{n}\sum_{i=1}^{n}x_{i}^{2} \end{array}\right.\\ = & \left\{ \begin{array}{c} \widehat{\mu}_{MM}=\overline{x}\\ \widehat{\sigma^{2}}_{MM}+\overline{x}^{2}=\frac{1}{n}\sum_{i=1}^{n}x_{i}^{2} \end{array}\right.=\left\{ \begin{array}{c} -\\ \widehat{\sigma^{2}}_{MM}=\frac{\sum_{i=1}^{n}x_{i}^{2}}{n}-\overline{x}^{2} \end{array}\right.\end{aligned} and the last expression can be further simplified: $\frac{\sum_{i=1}^{n}x_{i}^{2}}{n}-\overline{x}^{2}=\frac{\sum_{i=1}^{n}x_{i}^{2}}{n}-\frac{n}{n}\overline{x}^{2}=\frac{\sum_{i=1}^{n}\left(x_{i}^{2}-\overline{x}^{2}\right)}{n}=\frac{\sum_{i=1}^{n}\left(x_{i}^{2}-\overline{x}^{2}+\left(\overline{x}^{2}-\overline{x}^{2}\right)+\left(2x_{i}\overline{x}-2x_{i}\overline{x}\right)\right)}{n}=\frac{\sum_{i=1}^{n}\left(\left(x_{i}-\overline{x}^{2}\right)^{2}-2\overline{x}^{2}+2x_{i}\overline{x}\right)}{n}=\frac{1}{n}\sum_{i=1}^{n}\left(x_{i}-\overline{x}\right)^{2}+\frac{2}{n}\underset{=0}{\underbrace{\left[\sum_{i=1}^{n}\left(x_{i}\overline{x}\right)-n\overline{x}^{2}\right]}}$$=\frac{1}{n}\sum_{i=1}^{n}\left(x_{i}-\overline{x}\right)^{2}$.

In this case, we had to solve a system of equation, and used the first and second moments of $X$.

While the estimator $\widehat{\mu}_{MM}$ is intuitive, the estimator $\widehat{\sigma^{2}}_{MM}=\frac{1}{n}\sum_{i=1}^{n}\left(x_{i}-\overline{x}\right)^{2}$ is a bit surprising, since we know that $E\left(\frac{1}{n}\sum_{i=1}^{n}\left(x_{i}-\overline{x}\right)^{2}\right)\neq\sigma^{2}$. Rather, we know that $E\left(s^{2}\right)=E\left(\frac{1}{n-1}\sum_{i=1}^{n}\left(x_{i}-\overline{x}\right)^{2}\right)=\sigma^{2}$, so it is possible that the method of moments estimator for $\sigma^{2}$ can be improved. Currently, on average, it produces a biased estimate of $\sigma^{2}$. We will return to this point, with a formal analysis.

For cases with complicated estimators, involving non-linear equations for example, one may solve the system of equations with the help of a computer.

# Maximum Likelihood

Let $X_{1}..X_{N}$ be a random sample from a distribution with pmf/pdf $f\left(\left|\cdot\right.\theta\right)$, where $\theta\in\Theta$ is unknown.

The maximum likelihood estimator (MLE) $\widehat{\theta}_{ML}$ maximizes $L\left(\widehat{\theta}_{ML}\left|x_{1}..x_{n}\right.\right)$ where $L\left(\cdot\left|x_{1}..x_{n}\right.\right):\Theta\rightarrow\mathbb{R}_{+}$ (codomain is $\left[0,1\right]$ in the case of the pmf) is given by

$L\left(\theta\left|x_{1}..x_{n}\right.\right)=\Pi_{i=1}^{n}f\left(\left|x_{i}\right.\theta\right),\,\theta\in\Theta.$

Function $L\left(\theta\left|x_{1}..x_{n}\right.\right)$ is called the likelihood function.

Because $X_{1}..X_{n}$ is a random sample, in discrete case, it equals the probability of the sample having occurred, given parameter $\theta$. So, the intuition for the maximum likelihood estimator, is that we look for the value of $\theta$ that maximizes the probability of the observed sample having occurred.

The maximum likelihood estimator has some incredibly useful properties, which we will discuss later.

## log-Likelihood

Sometimes, a convenient object to work with is the log-likelihood function, given by

$l\left(\theta\left|x_{1}..x_{n}\right.\right)=\log\,L\left(\theta\left|x_{1}..x_{n}\right.\right)=\sum_{i=1}^{n}\log f\left(\left|x_{i}\right.\theta\right)$

The last identity follows from the fact that the log of the product is equal to the sum of the logs. Notice that because $\log\left(\cdot\right)$ is a monotone function, it is also maximized by $\widehat{\theta}_{ML}$.

In order to compute the maximum likelihood function, we simply need to obtain it and maximize it w.r.t. the parameters of interest.

## Example: Bernoulli

Suppose $X_{i}\overset{iid}{\sim}Ber\left(p\right)$, where $p\in\left[0,1\right]$ is unknown.

The marginal pmf equals $f\left(\left.x\right|p\right)=p^{x}\left(1-p\right)^{1-x}1\left(x\in\left\{ 0,1\right\} \right)$. We can ignore $1\left(x\in\left\{ 0,1\right\} \right)$, since by assumption it is satisfied in the sample.

The likelihood function equals $L\left(\left.p\right|x_{1}..x_{n}\right)=\Pi_{i=1}^{n}f\left(\left.x_{i}\right|p\right)=\Pi_{i=1}^{n}\left(p^{x}\left(1-p\right)^{1-x}\right)=p^{\sum_{i=1}^{n}x_{i}}\left(1-p\right)^{n-\sum_{i=1}^{n}x_{i}},p\in\left[0,1\right]$

The log-likelihood function equals $l\left(\left.p\right|x_{1}..x_{n}\right)=\sum_{i=1}^{n}x_{i}\log\left(p\right)+\left(n-\sum_{i=1}^{n}x_{i}\right)\log\left(1-p\right),\,p\in\left(0,1\right)$

(Because $\log\left(0\right)=-\infty$, we will inspect $p=0$ and $p=1$ separately.)

We look for an interior solution:

\begin{aligned} foc\left(p\right): & =\frac{\partial}{\partial p}l\left(\left.p\right|x_{1}..x_{n}\right)0\\ \Leftrightarrow & \frac{\sum_{i=1}^{n}x_{i}}{p}-\frac{n-\sum_{i=1}^{n}x_{i}}{1-p}=0\\ \Leftrightarrow & \widehat{p}_{ML}=\frac{\sum_{i=1}^{n}x_{i}}{n}\end{aligned}

Verifying the second-order condition reveals that our estimator is indeed a maximum.

Finally, consider the possible exceptions. When $\frac{\sum_{i=1}^{n}x_{i}}{n}=0$, $\widehat{p}_{ML}=0$, but this value is outside the allowed bounds of the log-likelihood.

Similarly, when $\frac{\sum_{i=1}^{n}x_{i}}{n}=1$, our estimator’s value falls outside the defined region of our log-likelihood. We can use the likelihood function to determine what the estimator is in these two cases:

• $\sum_{i=1}^{n}x_{i}=0$, $\max_{p}\,L\left(\left.p\right|x_{1}..x_{n}\right)=\max_{p}\,\left(1-p\right)^{n}\Rightarrow\widehat{p}_{ML}=0$
• $\sum_{i=1}^{n}x_{i}=n$, $\max_{p}\,L\left(\left.p\right|x_{1}..x_{n}\right)=\max_{p}\,p^{n}\Rightarrow\widehat{p}_{ML}=1$

Interestingly, our original formula $\widehat{p}_{ML}=\frac{\sum_{i=1}^{n}x_{i}}{n}$ also agrees with these two cases, and so we have found that for any admissible sample, $\widehat{p}_{ML}=\frac{\sum_{i=1}^{n}x_{i}}{n}$.

Notice also that the method of moments estimator coincides with the maximum likelihood estimator (this is not guaranteed by any means).