# Normal Linear Model

In the previous example, the notion of random variable wasn’t mentioned. We simply wanted to draw a predictive line along some points. In this example, we introduce a few features:

• We will include multiple regressors/independent variables, $x_{i1},x_{i2},...,x_{iK}$.
• These regressors are constants. When we do hypothesis testing, for example, we will assume they will remain constant, no matter the number of experiments we run. For example, suppose we would like to regress $y_{i}$ on the months of a year, i.e., $1..12$. These numbers won’t change once we collect information for different years.
• We will assume that the errors are normally distributed. This is a big difference from the previous example: We are stating that $\varepsilon_{i}$ are themselves random variables. In a sense, they are a primitive of this model.
• We will denote matrices by uppercase letters (e.g., $X$).
• We represent vectors by lowercase letters (e.g., $y$, $x_{1}$).
• We define

$x_{i}=\underset{\left(K\times1\right)}{\left[\begin{array}{c} x_{i1}\\ x_{i2}\\ \vdots\\ x_{iK} \end{array}\right]}$

such that each vector $x_{i}$ contains the regressors for observation $i$. For example, if $i$ is an individual, then $x_{i}$ could contain his/her age, gender, income, etc.

We will also define, for each observation, $y_{i}$ and $\varepsilon_{i}\sim N\left(0,\sigma^{2}\right)$, the model

$y_{i}=\beta_{1}x_{i1}+\beta_{2}x_{i2}+...+\beta_{K}x_{iK}+\varepsilon_{i}$

For each observation $i$, there exists a random variable $\varepsilon_{i}$. Once this variable is added to a weighted sum of parameters $\left(\beta\right)$ and regressors $\left(x_{i}\right)$, it yields the variable $y_{i}$.

We can rewrite the equation above in a more compact form:

$\underset{\left(1\times1\right)}{y_{i}}=\underset{\left(1\times K\right)}{x_{i}^{'}}\underset{\left(K\times1\right)}{\beta}+\underset{\left(1\times1\right)}{\varepsilon_{i}}$

## Matrix Notation

It is possible to stack the equation above across observations. Let

$y=\left[\begin{array}{c} y_{1}\\ y_{2}\\ \vdots\\ y_{N} \end{array}\right];\,\varepsilon=\left[\begin{array}{c} \varepsilon_{1}\\ \varepsilon_{2}\\ \vdots\\ \varepsilon_{N} \end{array}\right];\,X=\left[\begin{array}{c} x_{1}^{'}\\ x_{2}^{'}\\ \vdots\\ x_{N}^{'} \end{array}\right]=\left[\begin{array}{ccc} x_{11} & \cdots & x_{1K}\\ x_{21} & \cdots & x_{2K}\\ \vdots & & \vdots\\ x_{N1} & \cdots & x_{NK} \end{array}\right]$

In this case, we can rewrite the linear model for the whole sample as

$\underset{\left(N\times1\right)}{y}=\underset{\left(N\times K\right)}{X}\underset{\left(K\times1\right)}{\beta}+\underset{\left(N\times1\right)}{\varepsilon}$

We will make a few additional assumptions:

• $\left\{ x_{i},y_{i}\right\} _{i=1}^{N}$ are i.i.d., with first finite moments.
• $\left.\varepsilon_{i}\right|X\sim N\left(0,\sigma^{2}\right)$

We will model the conditional distribution of $y$ as if $X$ was fixed. In other words, matrix $X$ has constants that never change. For example, if we drew a different random sample, we would observe different $y$’s, but the same $X$. (The reason we would observe different $y$’s is because of the draws of $\varepsilon$’s).

## Log-likelihood of $Y$ conditional on $X$

Notice that because $y$ equals a constant times parameters plus a normal random variable, $y$ is itself normally distributed:

$\left.y_{i}\right|x_{i}\sim N\left(x_{i}^{'}\beta,\varepsilon_{i}\right)$

The log-likelihood of $y$ equals

$l\left(\beta,\sigma^{2}\right)=\sum_{i=1}^{n}\left\{ -\frac{1}{2}\log\left(2\pi\right)-\frac{1}{2}\log\left(\sigma^{2}\right)-\frac{1}{2\sigma^{2}}\left(y_{i}-x_{i}^{'}\beta\right)^{2}\right\}$

Note that $\widehat{\beta}_{OLS}=\widehat{\beta}_{ML}=\text{argmax}_{\beta}\,l\left(\beta,\sigma^{2}\right)$, i.e., the solution for $\widehat{\beta}$ in the normal linear model is the same than the one for the OLS model.

## Matrix Derivation

We will find vector $\widehat{\beta}_{ML}$ by minimizing the sum of squares,

\begin{aligned} SSR & =\varepsilon^{'}\varepsilon\\ & =\left(y-X\beta\right)^{'}\left(y-X\beta\right)\\ & =y^{'}y-\underset{\left(1\times1\right)}{\beta^{'}X^{'}y}-\underset{\left(1\times1\right)}{y^{'}X\beta}+\beta^{'}X^{'}X\beta\\ & =y^{'}y-2y^{'}X\beta+\beta^{'}X^{'}X\beta\end{aligned}

where we have used the fact that $\left(X\beta\right)^{'}=\beta^{'}X^{'}$.

Notice that $y^{'}y$ does not depend on $\beta$, so we are left with problem $\widehat{\beta}_{ML}=\text{argmin}_{\beta}-2y^{'}X\beta+\beta^{'}X^{'}X\beta$

Taking the first-order condition,

\begin{aligned} foc\left(\beta\right):\, & \left(-2y^{'}X\right)+2\beta^{'}X^{'}X=0\\ \Leftrightarrow & \beta^{'}X^{'}X=y^{'}X\\ \Leftrightarrow & X^{'}X\beta=X^{'}y\\ & \widehat{\beta}_{ML}=\left(X^{'}X\right)^{-1}X^{'}y\end{aligned}

where we have used the fact that $\frac{d}{dv}A.v=A$ and $\frac{d}{dv}v^{'}Av=2v^{'}A$. (You can also make the analogue transpose assumption; it works as long as you remain consistent with whatever assumption on vector derivatives you make). Above, we have assumed that $X^{'}X$ is invertible. It is also useful to note that $X^{'}X$ is symmetric.

We have found the ML estimator, which is consistent with our previous OLS example.

## Distribution of $\widehat{\beta_{OLS}}$

Let’s write the expression for $\widehat{\beta}$, opening up $y$, which is really just a function of $\beta$, $\varepsilon$ and $X$:

\begin{aligned} \widehat{\beta}_{ML} & =\left(X^{'}X\right)^{-1}X^{'}y\\ & =\left(X^{'}X\right)^{-1}X^{'}\left(X\beta+\varepsilon\right)\\ & =\beta+\left(X^{'}X\right)^{-1}X^{'}\varepsilon\end{aligned}

The result above implies that $\widehat{\beta}_{ML}$ is a linear combination of normal random variables, given $X$. Moreover, the estimator has a mean and variance (remember that the only random variable is $\varepsilon$):

$E_{\beta}\left(\widehat{\beta}\right)=\beta+\left(X^{'}X\right)^{-1}X^{'}E\left(\varepsilon\right)=\beta$

and

\begin{aligned} Var_{\beta}\left(\widehat{\beta}\right) & =Var\left(\beta+\left(X^{'}X\right)^{-1}X^{'}\varepsilon\right)\\ & =Var\left(\left(X^{'}X\right)^{-1}X^{'}\varepsilon\right)\\ & =\left(X^{'}X\right)^{-1}X^{'}Var\left(\varepsilon\right)X\left(X^{'}X\right)^{-1}\\ & =\left(X^{'}X\right)^{-1}X^{'}E\left(\varepsilon\varepsilon^{'}\right)X\left(X^{'}X\right)^{-1}\end{aligned}

where $E\left(\varepsilon\varepsilon^{'}\right)$ is the covariance matrix of $\varepsilon$, given that $E\left(\varepsilon\right)=0$:

$Var\left(\varepsilon\right)=E\left(\varepsilon\varepsilon^{'}\right)=\left[\begin{array}{cccc} \sigma_{\varepsilon}^{2} & & & 0\\ & \sigma_{\varepsilon}^{2}\\ & & \ddots\\ 0 & & & \sigma_{\varepsilon}^{2} \end{array}\right]=\sigma^{2}I_{N},$

where $I_{N}$ is the identity matrix with size $\left(N\times N\right)$.

Continuing,

\begin{aligned} Var_{\beta}\left(\widehat{\beta}\right) & =\left(X^{'}X\right)^{-1}X^{'}E\left(\varepsilon\varepsilon^{'}\right)X\left(X^{'}X\right)^{-1}\\ & =\left(X^{'}X\right)^{-1}X^{'}I_{N}X^{'}\left(X^{'}X\right)^{-1}\\ & =\left(X^{'}X\right)^{-1}X^{'}X^{'}\left(X^{'}X\right)^{-1}\sigma^{2}\\ & =\left(X^{'}X\right)^{-1}\sigma^{2}.\end{aligned}

So, we have learned that

$\left.\widehat{\beta}_{ML}\right|X\sim N\left(\beta,\left(X^{'}X\right)^{-1}\sigma_{\varepsilon}\right)$

To reiterate,

• We require the rank of $X$ to be $K$, s.t. $\left(X^{'}X\right)^{-1}$ exists.
• $X$ is considered as fixed, so that all calculations take $X$ as constant.