# Ordinary Least Squares

Suppose we have some data $\left\{ x_{i},y_{i}\right\} _{i=1}^{N}$. We would like to relate it through a line, i.e.,

An intuitive estimator minimizes the distance between $y_{i}$ and $\beta_{0}+\beta_{1}x_{i}$, for example,

$\min_{\beta_{0},\beta_{1}}\,\sum_{i=1}^{n}\left(y_{i}-\beta_{0}-\beta_{1}x_{i}\right)^{2}$

The quadratic distance is especially tractable, hence its use. Calculating the first order conditions,

\begin{aligned} & \left\{ \begin{array}{c} foc\left(\beta_{0}\right):\,\sum_{i=1}^{n}-2\left(y_{i}-\beta_{0}-\beta_{1}x_{i}\right)=0\\ foc\left(\beta_{1}\right):\,\sum_{i=1}^{n}-2x_{i}\left(y_{i}-\beta_{0}-\beta_{1}x_{i}\right)=0 \end{array}\right.\\ \Leftrightarrow & \left\{ \begin{array}{c} \beta_{0}=\frac{\sum y_{i}}{n}-\beta_{1}\frac{\sum x_{i}}{n}=\overline{y}-\beta_{1}\overline{x}\\ \sum x_{i}y_{i}-\beta_{1}\sum x_{i}^{2}-n\overline{x}\beta_{0}=0 \end{array}\right.\\ \Leftrightarrow & \left\{ \begin{array}{c} \\ \sum x_{i}y_{i}-\beta_{1}\sum x_{i}^{2}-n\overline{x}\left(\overline{y}-\beta_{1}\overline{x}\right)=0 \end{array}\right.\\ \Leftrightarrow & \left\{ \begin{array}{c} \\ \beta_{1}=\frac{\sum x_{i}y_{i}-n\overline{x}\overline{y}}{\sum x_{i}^{2}-n\overline{x}^{2}}=\frac{\sum\left(x_{i}-\overline{x}\right)\left(y_{i}-\overline{y}\right)}{\sum\left(x_{i}-\overline{x}\right)^{2}} \end{array}.\right.\end{aligned}

So, we have learned that

$\widehat{\beta_{0}}^{OLS}=\overline{y}-\beta_{1}\overline{x}.$

$\widehat{\beta_{1}}^{OLS}=\frac{Cov\left(x_{i},y_{i}\right)}{Var\left(x_{i}\right)}.$

The expression of the slope parameter is interesting: It represents the fraction of the variation in $x_{i}$ that covaries with $y_{i}$.

## Some Remarks

• After estimating $\beta$, we can predict $y_{i}$ via

$\widehat{y_{i}}=\widehat{\beta_{0}}^{OLS}+\widehat{\beta_{1}}^{OLS}x_{i}$

• We can also define the prediction errors as $\widehat{\varepsilon_{i}}=y_{i}-\widehat{y_{i}}$, which we can estimate according to

\begin{aligned} \widehat{\varepsilon}_{i} & =y_{i}-\widehat{y_{i}}\\ & =y_{i}-\left(\widehat{\beta_{0}}^{OLS}+\widehat{\beta_{1}}^{OLS}x_{i}\right)\end{aligned}

These estimated errors provide the vertical distance between our estimation line and the height of each data point, indexed by $i$.

• Notice also that we could estimate the sample variance of the errors, $Var\left(\widehat{\varepsilon_{i}}\right)$. This statistic provides a notion of how far the estimated line is from each data point.

# Normal Linear Model

In the previous example, the notion of random variable wasn’t mentioned. We simply wanted to draw a predictive line along some points. In this example, we introduce a few features:

• We will include multiple regressors/independent variables, $x_{i1},x_{i2},...,x_{iK}$.
• These regressors are constants. When we do hypothesis testing, for example, we will assume they will remain constant, no matter the number of experiments we run. For example, suppose we would like to regress $y_{i}$ on the months of a year, i.e., $1..12$. These numbers won’t change once we collect information for different years.
• We will assume that the errors are normally distributed. This is a big difference from the previous example: We are stating that $\varepsilon_{i}$ are themselves random variables. In a sense, they are a primitive of this model.
• We will denote matrices by uppercase letters (e.g., $X$).
• We represent vectors by lowercase letters (e.g., $y$, $x_{1}$).
• We define

$x_{i}=\underset{\left(K\times1\right)}{\left[\begin{array}{c} x_{i1}\\ x_{i2}\\ \vdots\\ x_{iK} \end{array}\right]}$

such that each vector $x_{i}$ contains the regressors for observation $i$. For example, if $i$ is an individual, then $x_{i}$ could contain his/her age, gender, income, etc.

We will also define, for each observation, $y_{i}$ and $\varepsilon_{i}\sim N\left(0,\sigma^{2}\right)$, s.t.

$y_{i}=\beta_{1}x_{i1}+\beta_{2}x_{i2}+...+\beta_{K}x_{iK}+\varepsilon_{i}$

For each observation $i$, there exists a random variable $\varepsilon_{i}$. Once this variable is added to a weighted sum of parameters $\left(\beta\right)$ and regressors $\left(x_{i}\right)$, it yields the variable $y_{i}$.

We can rewrite the equation above in a more compact form:

$\underset{\left(1\times1\right)}{y_{i}}=\underset{\left(1\times K\right)}{x_{i}^{'}}\underset{\left(K\times1\right)}{\beta}+\underset{\left(1\times1\right)}{\varepsilon_{i}}$

## Matrix Notation

It is possible to stack the equation above across observations. Let

$y=\left[\begin{array}{c} y_{1}\\ y_{2}\\ \vdots\\ y_{N} \end{array}\right];\,\varepsilon=\left[\begin{array}{c} \varepsilon_{1}\\ \varepsilon_{2}\\ \vdots\\ \varepsilon_{N} \end{array}\right];\,X=\left[\begin{array}{c} x_{1}^{'}\\ x_{2}^{'}\\ \vdots\\ x_{N}^{'} \end{array}\right]=\left[\begin{array}{ccc} x_{11} & \cdots & x_{1K}\\ x_{21} & \cdots & x_{2K}\\ \vdots & & \vdots\\ x_{N1} & \cdots & x_{NK} \end{array}\right]$

In this case, we can rewrite the linear model for the whole sample as

$\underset{\left(N\times1\right)}{y}=\underset{\left(N\times K\right)}{X}\underset{\left(K\times1\right)}{\beta}+\underset{\left(N\times1\right)}{\varepsilon}$

We will make a few additional assumptions:

• $\left\{ x_{i},y_{i}\right\} _{i=1}^{N}$ are i.i.d., with first finite moments.
• $\left.\varepsilon_{i}\right|X\sim N\left(0,\sigma^{2}\right)$

We will model the conditional distribution of $y$ as if $X$ was fixed. In other words, matrix $X$ has constants that never change. For example, if we drew a different random sample, we would observe different $y$’s, but the same $X$. (The reason we would observe different $y$’s is because of the draws of $\varepsilon$’s).

## Log-likelihood of $Y$ conditional on $X$

Notice that because $y$ equals a constant times parameters plus a normal random variable, $y$ is itself normally distributed:

$\left.y_{i}\right|x_{i}\sim N\left(x_{i}^{'}\beta,\varepsilon_{i}\right)$

The log-likelihood of $y$ equals

$l\left(\beta,\sigma^{2}\right)=\sum_{i=1}^{n}\left\{ -\frac{1}{2}\log\left(2\pi\right)-\frac{1}{2}\log\left(\sigma^{2}\right)-\frac{1}{2\sigma^{2}}\left(y_{i}-x_{i}^{'}\beta\right)^{2}\right\}$

Note that $\widehat{\beta}_{OLS}=\widehat{\beta}_{ML}=\text{argmax}_{\beta}\,l\left(\beta,\sigma^{2}\right)$, i.e., the solution for $\widehat{\beta}$ in the normal linear model is the same that the one for the OLS model.

## Matrix Derivation

We will find vector $\widehat{\beta}_{ML}$ by minimizing the sum of squares,

\begin{aligned} SSR & =\varepsilon^{'}\varepsilon\\ & =\left(y-X\beta\right)^{'}\left(y-X\beta\right)\\ & =y^{'}y-\underset{\left(1\times1\right)}{\beta^{'}X^{'}y}-\underset{\left(1\times1\right)}{y^{'}X\beta}+\beta^{'}X^{'}X\beta\\ & =y^{'}y-2y^{'}X\beta+\beta^{'}X^{'}X\beta\end{aligned}

where we have used the fact that $\left(X\beta\right)^{'}=\beta^{'}X^{'}$.

Notice that $y^{'}y$ does not depend on $\beta$, so we are left with problem $\widehat{\beta}_{ML}=\text{argmin}_{\beta}-2y^{'}X\beta+\beta^{'}X^{'}X\beta$

Taking the first-order condition,

\begin{aligned} foc\left(\beta\right):\, & \left(-2y^{'}X\right)+2\beta^{'}X^{'}X=0\\ \Leftrightarrow & \beta^{'}X^{'}X=y^{'}X\\ \Leftrightarrow & X^{'}X\beta=X^{'}y\\ & \widehat{\beta}_{ML}=\left(X^{'}X\right)^{-1}X^{'}y\end{aligned}

where we have used the fact that $\frac{d}{dv}A.v=A$ and $\frac{d}{dv}v^{'}Av=2v^{'}A$. (You can also make the analogue transpose assumption; it works as long as you remain consistent with whatever assumption on vector derivatives you make). Above, we have assumed that $X^{'}X$ is invertible. It is also useful to note that $X^{'}X$ is symmetric.

We have found the ML estimator, which is consistent with our previous OLS example.

## Distribution of $\widehat{\beta_{OLS}}$

Let’s write the expression for $\widehat{\beta}$, opening up $y$, which is really just a function of $\beta$, $\varepsilon$ and $X$:

\begin{aligned} \widehat{\beta}_{ML} & =\left(X^{'}X\right)^{-1}X^{'}y\\ & =\left(X^{'}X\right)^{-1}X^{'}\left(X\beta+\varepsilon\right)\\ & =\beta+\left(X^{'}X\right)^{-1}X^{'}\varepsilon\end{aligned}

The result above implies that $\widehat{\beta}_{ML}$ is a linear combination of normal random variables, given $X$. Moreover, the estimator has a mean and variance (remember that the only random variable is $\varepsilon$):

$E_{\beta}\left(\widehat{\beta}\right)=\beta+\left(X^{'}X\right)^{-1}X^{'}E\left(\varepsilon\right)=\beta$

and

\begin{aligned} Var_{\beta}\left(\widehat{\beta}\right) & =Var\left(\beta+\left(X^{'}X\right)^{-1}X^{'}\varepsilon\right)\\ & =Var\left(\left(X^{'}X\right)^{-1}X^{'}\varepsilon\right)\\ & =\left(X^{'}X\right)^{-1}X^{'}Var\left(\varepsilon\right)X\left(X^{'}X\right)^{-1}\\ & =\left(X^{'}X\right)^{-1}X^{'}E\left(\varepsilon\varepsilon^{'}\right)X\left(X^{'}X\right)^{-1}\end{aligned}

where $E\left(\varepsilon\varepsilon^{'}\right)$ is the covariance matrix of $\varepsilon$, given that $E\left(\varepsilon\right)=0$:

$Var\left(\varepsilon\right)=E\left(\varepsilon\varepsilon^{'}\right)=\left[\begin{array}{cccc} \sigma_{\varepsilon}^{2} & & & 0\\ & \sigma_{\varepsilon}^{2}\\ & & \ddots\\ 0 & & & \sigma_{\varepsilon}^{2} \end{array}\right]=\sigma^{2}I_{N},$

where $I_{N}$ is the identity matrix with size $\left(N\times N\right)$.

Continuing,

\begin{aligned} Var_{\beta}\left(\widehat{\beta}\right) & =\left(X^{'}X\right)^{-1}X^{'}E\left(\varepsilon\varepsilon^{'}\right)X\left(X^{'}X\right)^{-1}\\ & =\left(X^{'}X\right)^{-1}X^{'}I_{N}X^{'}\left(X^{'}X\right)^{-1}\\ & =\left(X^{'}X\right)^{-1}X^{'}X^{'}\left(X^{'}X\right)^{-1}\sigma^{2}\\ & =\left(X^{'}X\right)^{-1}\sigma^{2}.\end{aligned}

So, we have learned that

$\left.\widehat{\beta}_{ML}\right|X\sim N\left(\beta,\left(X^{'}X\right)^{-1}\sigma_{\varepsilon}\right)$

To reiterate,

• We require the rank of $X$ to be $K$, s.t. $\left(X^{'}X\right)^{-1}$ exists.
• $X$ is considered as fixed, so that all calculations take $X$ as constant.

# Asymptotic Properties of OLS

We now allow,

• $X$ to be random variables
• $\varepsilon$ to not necessarily be normally distributed.

In this case, we will need additional assumptions to be able to produce $\widehat{\beta}$:

• $\left\{ y_{i},x_{i}\right\}$ is a random sample.
• Strict Exogeneity: $E\left(\left.\varepsilon_{i}\right|X\right)=0,\,i=1..N$.
• Homoskedasticity: $E\left(\left.\varepsilon_{i}^{2}\right|X\right)=\sigma^{2},\,i=1..N$ and $E\left(\left.\varepsilon_{i}\varepsilon_{j}\right|X\right)=0,\,\forall i,j=1..N,i\neq j.$

## Implications of Strict Exogeneity

First, notice that if $E\left(\left.\varepsilon_{i}\right|X\right)=0$, then $E\left(\varepsilon\right)=0$:

$E\left(\left.\varepsilon_{i}\right|X\right)=0\Leftrightarrow E\left(E\left(\left.\varepsilon_{i}\right|X\right)\right)=E\left(0\right)\Leftrightarrow E\left(\varepsilon_{i}\right)=0.$

In other words, if the conditional expectation of $\varepsilon_{i}$ given any $X$ is zero, then the expectation of $\varepsilon_{i}$ also needs to be zero. This assumption implies that $\varepsilon_{i}$ is uncorrelated with $x_{i1}$, $x_{i2}$, ..., and also $x_{1k}$, $x_{2k}$, etc.

Second, the strict exogeneity assumption implies the orthogonality condition $E\left(x_{jk}\varepsilon_{i}\right)=0,\,\forall\,j,k$. (i.e., no matter how you pick $x$’s by selecting $j$ and $k$, the result is uncorrelated with $\varepsilon_{i}$).

To see this, let $E\left(x_{j}\varepsilon_{i}\right)=\left[\begin{array}{c} E\left(x_{j1}\varepsilon_{i}\right)\\ E\left(x_{j2}\varepsilon_{i}\right)\\ \vdots\\ E\left(x_{jK}\varepsilon_{i}\right) \end{array}\right]=\left[\begin{array}{c} 0\\ 0\\ \vdots\\ 0 \end{array}\right],\,\forall i,j=1..N$

Then, it follows that

$E\left(x_{jk}\varepsilon_{i}\right)=E\left[E\left(\left.x_{jk}\varepsilon_{i}\right|x_{jk}\right)\right]=E\left[x_{jk}\underset{=0}{\underbrace{E\left(\left.\varepsilon_{i}\right|x_{jk}\right)}}\right]=0.$

## Asymptotic Distribution

First, notice that

• $X^{'}X=\sum_{i=1}^{n}x_{i}x_{i}^{'}$.
• $X^{'}\varepsilon=\sum_{i=1}^{n}x_{i}\varepsilon_{i}$.

It is possible to prove that under the assumptions above,

$\sqrt{N}\left(\widehat{\beta}_{OLS}-\beta\right)\overset{\sim}{\sim}N\left(0,Q^{-1}\sigma^{2}\right)$

where $Q=\text{plim}\,\frac{X^{'}X}{N}$.

This is relatively intuitive given our previous example. Yet, it is extremely useful: As long as we satisfy the assumptions laid out before, we can conduct hypothesis tests for OLS even if the distribution of $\varepsilon$ is unknown (up to some moments).

## Proof

Note that

\begin{aligned} \sqrt{N}\left(\widehat{\beta}-\beta\right) & =\sqrt{N}\left(\beta+\left(X^{'}X\right)^{-1}X^{'}\varepsilon-\beta\right)\\ & =\sqrt{N}\left(X^{'}X\right)^{-1}X^{'}\varepsilon\frac{N}{N}\\ & =\left(\frac{X^{'}X}{N}\right)^{-1}\frac{1}{\sqrt{N}}X^{'}\varepsilon\end{aligned}

While we will not show this here, we assume that $\frac{X^{'}X}{N}\overset{p}{\rightarrow}Q$, where $Q$ is a matrix. Notice that it is not implausible that $Q$ is a well-defined matrix: as $N$ increases, the size of $X^{'}X$ remains $\left(K\times K\right)$.

By a matrix version of Slutsky’s theorem, it follows that

$\left(\frac{X^{'}X}{N}\right)^{-1}\overset{p}{\rightarrow}Q^{-1}$

As for the second factor, because of term $\varepsilon$, it is likely that it converges in distribution. Let

$\frac{1}{\sqrt{N}}X^{'}\varepsilon=\sqrt{N}\frac{1}{N}\sum x_{i}\varepsilon_{i}=\sqrt{N}\overline{w}$

where $w_{i}=x_{i}\varepsilon_{i}$. Then,

$E\left(\overline{w}\right)=E\left(\frac{1}{N}\sum x_{i}\varepsilon_{i}\right)=0$

\begin{aligned} Var\left(\overline{w}\right) & =\frac{1}{N^{2}}Var\left(\frac{1}{N}\sum x_{i}\varepsilon_{i}\right)=\frac{1}{N^{2}}\sum_{i=1}^{N}E\left(x_{i}E\left[\left.\varepsilon_{i}\varepsilon_{i}^{'}\right|x_{i}\right]x_{i}^{'}\right)\\ & =\frac{1}{N^{2}}\sigma^{2}\sum_{i=1}^{N}E\left(x_{i}x_{i}^{'}\right)\\ & =\frac{\sigma^{2}}{N}E\left(\frac{X^{'}X}{N}\right)\\ & =\frac{\sigma^{2}}{N}Q.\end{aligned}

By the CLT,

$\sqrt{N}\left(\overline{w}-E\left(\overline{w}\right)\right)\overset{d}{\rightarrow}N\left(0,\sigma^{2}Q\right)$

and by Slutsky’s theorem,

\begin{aligned} \sqrt{N}\left(\widehat{\beta}-\beta\right) & =\underset{\overset{\overset{p}{\rightarrow}}{Q}}{\underbrace{\left(\frac{X^{'}X}{N}\right)^{-1}}}\underset{\overset{\overset{d}{\rightarrow}}{N\left(0,\sigma^{2}Q\right)}}{\underbrace{\sqrt{N}\overline{w}}}\\ \overset{\sim}{\sim} & N\left(0,Q^{-1}\sigma^{2}QQ^{-1}\right)\\ = & N\left(0,Q^{-1}\sigma^{2}\right)\end{aligned}

## Some Remarks

In practice, we use

$\widehat{\sigma^{2}}_{unbiased}=\frac{1}{N-K}\sum_{i=1}^{N}\left(y_{i}-x_{i}^{'}\widehat{\beta}\right)^{2}$

or

$\widehat{\sigma^{2}}_{MLE}=\frac{1}{N}\sum_{i=1}^{N}\left(y_{i}-x_{i}^{'}\widehat{\beta}\right)^{2}$

and

$\widehat{Q^{-1}}=\left(\frac{X^{'}X}{N}\right)^{-1}$.

## A note on the variance of $\varepsilon_{i}$

In the proof above, we have assumed that $Var\left(\varepsilon_{i}\right)=\sigma^{2}$. However, it is possible that $Var\left(\varepsilon_{i}\right)=\sigma_{i}^{2}$. In this case, the step $\frac{1}{N^{2}}\sum_{i=1}^{N}E\left(x_{i}E\left[\left.\varepsilon_{i}\varepsilon_{i}^{'}\right|x_{i}\right]x_{i}^{'}\right)=\frac{1}{N^{2}}\sigma^{2}\sum_{i=1}^{N}E\left(x_{i}x_{i}^{'}\right)$

does not hold. In this case, it is possible to show that

$\overset{\sim}{\sim}\sqrt{N}\left(\widehat{\beta}-\beta\right)N\left(0,\Omega\right)$

where

$\Omega=E\left(X^{'}X\right)^{-1}Var\left(X\varepsilon\right)E\left(X^{'}X\right)^{-1},$

and

$\widehat{\Omega}=\left(\frac{X^{'}X}{N}\right)^{-1}\left(\frac{1}{N}\sum x_{i}x_{i}^{'}\widehat{\varepsilon}_{i}^{2}\right)\left(\frac{X^{'}X}{N}\right)^{-1}.$

The estimator above is called the Huber-Eicker-White estimator (or a variation using 1 or 2 of these names).

An issue remains: How do we obtain $\widehat{\varepsilon}_{i}^{2}$?

It turns out that $\widehat{\beta}_{OLS}$ is consistent even if $\varepsilon_{i}$ is heteroskedastic: It suffices that $E\left[\left(X^{'}X\right)^{-1}X^{'}\varepsilon\right]=0$, which is guaranteed by strict exogeneity. So, one can use the OLS estimates to produce estimator $\widehat{\Omega}$, and then perform valid asymptotic hypothesis tests.

# Bootstrapping

The origin of the term “bootstrapping” may relate to someone pulling themselves up by their own boot straps/laces. In a sense, it means making do with little or nothing. Here is the idea: Suppose you would like to conduct an hypothesis test, but were unaware of the test distribution. Even if it converges to a normal, who knows what its asymptotic variance may be (i.e., the test statistic’s variance when $n$ tends to infinity)?

Consider the following approach: If one has enough data, then the distribution in the sample is representative of the distribution of the population. So, one may pretend that the sample itself is the population, and draw from that sample as if one was drawing from the population.

The bootstrap technique can be applied to MLE in the following way. Given a sample of size $N$, $\left\{ y_{i},x_{i}\right\} _{i=1}^{N}$:

• Estimate

$\widehat{\theta}_{ML}=\text{argmax}_{\theta}\,f\left(\left.y\right|X,\theta\right)$

as usual.

• Calculate the test statistic of interest, $T$. (We could use the LRT, for example; notice that we do not know its distribution, nor the desirable critical value).
• Then, resample (with replacement) $N$ pairs $\left\{ y_{i},x_{i}\right\}$ to get a new (bootstrap) sample $\left\{ y_{j}^{b},x_{j}^{b}\right\} _{j=1}^{N}$. Do this $B$ times, such that each sample can be indexed by $b\in\left\{ 1,..,B\right\} .$
• For each bootstrap sample, estimate

$\widehat{\theta}_{b}=\text{argmax}_{\theta}\,f\left(\left.y\right|X,\theta\right)$

• Calculate the test statistic of interest for each estimation, $T^{b}$.

While we do not know the distribution of the test statistic, we can approximate it, since we drew it many times from our own sample. Moreover, we can now build confidence intervals for the test statistic (we just need to pick $\underline{t},\overline{t}$ s.t. 95% of the test statistics $T^{b}$ fall in the interval), and in the case of the LRT, we can reject the null hypothesis if $T$ is higher than at least 95% of the $T^{b}$ tests we drew. Notice that such a test commits a type 1 error with 5% probability, as is often conducted.