# Lecture 1

## Sample Space

Traditionally, in probability theory, tossing a coin can yield only Heads or Tails. The set of all possible cases, Heads or Tails in this case, is called the sample space. (It’s the space from which one can sample from, for example, when one runs an experiment, or essentially when one performs a measurement, like observing the outcome of tossing a coin in the air.)

Let's introduce some more notation:
.

• $\emptyset$ is the empty set.

.

• $\bigcup_{i=1}^{\infty}B_{i}$ is the union of sets $B_{i}$, formally, $\bigcup_{i=1}^{\infty}B_{i}=\left\{ s\in S:s\in B_{i}\,for\,some\,i\right\}$

.

• $B\subseteq S$ means $B$ is a subset of the sample space. For example, $B=\left\{ Heads\right\}$ is a subset of $S=\left\{ Heads,Tails\right\}$.

.

• From the line above, notice that $Heads$ is an element of set $B$ (and $S$). Notice the absence of curly braces.

.

• $B^{C}=S\backslash B$ is the complement of set $B$. When $B=\left\{ Heads\right\}$, $B^{C}=\left\{ Tails\right\}$. Formally, $B^{C}=\left\{ s\in S:s\notin B\right\}$.

## Probability Function

A probability function is a function $P:\mathcal{B}\rightarrow\left[0,1\right]$, where: .

• $P\left(S\right)=1$

.

• $P\left(\bigcup_{i=1}^{\infty}B_{i}\right)=\sum_{i=1}^{\infty}P\left(B_{i}\right)$, whenever ${B_{1},B_{2},...}$ are pairwise disjoint.

.

• Notice that we haven't defined the domain $\mathcal{B}$ yet.

. The properties of the probability function are intuitive. The probability of observing any event in set $S$ should equal 1, so we interpret $P\left(S\right)$ as the probability of any event in $S$ taking place. As for the second property: it states that the probability of a union of sets is the sum of the probabilities of each set, when the sets do not intersect. Clearly, we are interpreting $P\left(B\right)$ as the probability that any of the events in $B$ occurs.

## Remark

By now, you’ve probably noticed that, rather than talking about probability, we’re talking about sets and set theory. This is not too complicated. It’s simply the approach that mathematicians have taken to formalize probability, from ‘first principles’, aka, set theory. Essentially, we would like the probability function to map any set into a probability, i.e., a number between zero and one.

Let us consider the domain of the probability function now. One possibility is to make $\mathcal{B}=S$. If we did this, however, we would then have difficulty expressing probabilities like $P\left(Heads\,or\,Tails\right)$, which clearly should equal one (notice that $\{Heads, Tails\}$ does not belong to $S$). So, an intuitive idea is to make $\mathcal{B}=2^{S}$, i.e., $\mathcal{B}$ is the power set of $S$:
$2^{S}=\left\{ \emptyset,\left\{ Heads\right\} ,\left\{ Tails\right\} ,\left\{ Heads,Tails\right\} \right\}$. The power set is a ‘set of sets.’ The power set $2^{X}$ contains all the subsets of $X$. Clearly, $\left\{ Heads\right\}$ is a subset of $S$. So is $\left\{ Heads,Tails\right\}$, because a set is a subset of itself. Finally, the power set always contains the empty set, because the empty set is a subset of all sets. Let’s now check whether $2^{S}$ seems like a good candidate for the domain of the probability function, $\mathcal{B}$. With $\mathcal{B}=2^{S}$, we can obtain the following probability function for coin tossing:
.

$P\left(B\right)=\begin{cases} 1, & B=\left\{ Heads,Tails\right\} \\ \frac{1}{2} & B=\left\{ Heads\right\} \\ \frac{1}{2} & B=\left\{ Tails\right\} \\ 0 & B=\emptyset \end{cases}$

We’re making progress: we can interpret $B=\left\{ Heads,Tails\right\}$ as the event of obtaining “heads or tails”, and we can also interpret the $B=\emptyset$ as the event of observing “heads and tails”, for example, or observing that the toss yielded ‘Blue’, rather than Heads or Tails. When we consider sets with countable many elements, the power set is indeed the default choice for the domain of the probability function.

## Domain of the Probability Function

Our choice of domain for the probability function seems to work well. However, mathematicians encountered a difficulty, when they considered the power set of sets with uncountable elements, for example, when determining the probability of a number falling in an interval in the real line. For starters, the power set of the reals seems very complex. There is another, more serious issue: It is possible to prove that if the power set of the reals is used as the domain of $P\left(\cdot\right)$, one can then partition the real line into an infinite number of sets, each with equal probability. This is no good; now, either each set has the same probability, so that the sum of probabilities is infinity, or each set has zero probability, such that the sum of probabilities equals zero. (If you'd like to know how, see the example in the Math blog InfinityPlusOneMath.)

### $\sigma$-algebra

In trying to find a convenient domain for the probability function, mathematicians isolated the properties of a set of sets that does not suffer from the paradox mentioned above. This is called a $\sigma$-algebra, and it’s the typical domain for probability functions. In case you haven’t heard the term ‘algebra’ in this context before, and algebra is a definition of a set, and the operations that can be applied to it. A $\sigma$-algebra $\mathcal{B}$ w.r.t. $S$ is a collection of events with the following properties:

• $\emptyset\in\mathcal{B}$
• $B\in\mathcal{B}\Rightarrow B^{C}\in\mathcal{B}$
• $B_{1},B_{2},...\in\mathcal{B}\Rightarrow\bigcup_{i=1}^{\infty}B_{i}\in\mathcal{B}$

So, $\mathcal{B}$ is a $\sigma$-algebra if the empty set belongs to it; the complement of each element is also in $\mathcal{B}$, and if some sets belong to $\mathcal{B}$, then so does the union of those sets. We won’t be getting into how this definition solves the continuity issues explained above. For the discrete case, it is easy to write a few $\sigma$-algebras down: The discrete $\sigma$-algebra is given by the power set of $S$: $\mathcal{B}=2^{S}=\left\{ \emptyset,\left\{ Heads\right\} ,\left\{ Tails\right\} ,\left\{ Heads,Tails\right\} \right\}$

The trivial $\sigma$-algebra is given by $\mathcal{B}=\emptyset\cup S=\left\{ \emptyset,\left\{ Heads,Tails\right\} \right\}$. (You can verify that these are indeed sigma algebras.)

### Borel $\sigma$-algebra

When $S$ contains uncountable sets, the standard choice for the domain of the probability function is the Borel $\sigma$-algebra (we will denote it as $\mathcal{B}(R)$, but will not define it here). The Borel $\sigma$-algebra contains all open intervals in R, closed intervals, half-open, singletons, unions of intervals, etc. So, the Borel $\sigma$-algebra allows one to ask questions about the probability of most ‘reasonable’ sets. ‘Unreasonable’ sets are hard to come up with (see Borel Set for more information).

## Probability Space

We call the ingredients needed to talk about probabilities the probability space.

A probability space is a triple $(S,\mathcal{B},P)$, where $S$ is a sample space, $\mathcal{B}$ is a $\sigma$-algebra of events in $S$, and $P$ is a probability function.
The interpretation: $S$ is the set of possible singleton events; $\mathcal{B}$ is the set of questions we can ask the probability function (like, what is the probability that this and that happens, but not that other thing), and $P$ maps sets into probabilities.

## More on Probability Functions

Finally, let us define some properties of $P\left(\cdot\right)$:

• $P\left(B\right)=1-P\left(B^{C}\right)$, which implies that
• $P\left(\emptyset\right)=0$ since $P\left(\emptyset\right)=1-\underset{=1}{\underbrace{P\left(S\right)}}$.

Also, $P\left(A\cup B\right)=P\left(A\right)+P\left(B\right)-P\left(A\cup B\right)$ which implies that

• $P\left(A\cup B\right)\leq P\left(A\right)+P\left(B\right)$ and
• $P\left(A\cap B\right)\geq P\left(A\right)+P\left(B\right)-1$.

## Conditional probability

If A and B are events (remember, $A$ and/or $B$ could be $\emptyset$ or $\left\{ Heads,Tails\right\}$ for example, in the coin tossing example), and $P\left(B\right)\gt 0$, then the conditional probability of $A$ given $B$, denoted as $P\left(A|B\right)$, is $P\left(A|B\right)=\frac{P\left(A\cap B\right)}{P\left(B\right)}$.

This definition is intuitive. Suppose that, for a certain region, we obtain the probability of a given farmland having vineyards and/or cork trees, as in the table below:

Probabilities of observing vineyards and cork trees in some fictitious field:

i. Joint Probabilities
Cork Trees
Yes No
Vineyard Yes 20% 5%
No 15% 60%

For example, the table above states that most fields have no vineyards and cork trees, while 20% of them have both. Suppose we would like to ask “what is the probability of observing cork trees in a field that has vineyards?” Looking at the first row (corresponding to fields that have vineyards), it appears that only $20\%$ of the fields that have vineyards also have cork trees. However, the row sums up to 25%. The conditional probability of observing cork trees in a field that has vineyards is $20\%/25\%=0.8$.

## Independence

To wrap up, one more definition: Two events, $A$ and $B$, are independent if $P\left(A\cap B\right)=P\left(A\right).P\left(B\right)$.

Clearly, the probabilities in Table i. are not independent. Consider a new set of probabilities:

ii. Independent Probabilities
Vineyard
Yes No
Cork Trees Yes 25% 25%
No 25% 25%

First, notice that the probability of a vineyard is constant, independently of whether cork trees exist or not; and the converse is also true. For example, \begin{aligned} P\left(V\cap C\right)= & 25\%=P\left(V\right).P\left(C\right)=50\%.50\%\end{aligned}.

## Random variables

To finish, we connect probability spaces to random variables. In practice, we'll often use random variables rather than probability spaces. A random variable is a (Borel measurable) function X:S→$\mathcal{R}$. In other words, a random variable maps an element $s$ from $S$ to a real number. In coin tossing,

$X:\left\{ Heads,Tails\right\}→\mathcal{R}$, given by

$X\left(s\right)=\begin{cases} 1, & if\,s=Heads\\ 0, & if\,s=Tails \end{cases}$.

Basically, a random variable maps experimental outcomes to numbers. It allows us step away from the sample space, which would be cumbersome to work with in many applications. Random variables are related to probability spaces. In fact, we can write random variable (r.v.) $X$ as the probability space $\left(S_{X},\mathcal{B}_{X},P_{X}\right)$, where

• $S_{X} = \mathcal{R}$
• $\mathcal{B}_{X} = \mathcal{B\left(R\right)}$
• $P_{X} = P\left(X^{-1}\right)$

The first two statements are definitional. The third identity states that the probability function of $X$ can be written as a function of the original probability function $P\left(\cdot\right)$ (whose argument is an element from the sample space). To see that we can indeed make this identity work, in the coin tossing example, apply the third equality w.r.t. the element ‘Heads’: $P_{X}\left(1\right)=P_{X}\left(X=1\right)=P\left(X^{-1}\left(1\right)\right)=P\left(\left\{ s\in S:X\left(s\right)=1\right\} \right)$

This is a tricky the first time, but in the end, all we have done is make $P_{X}()$ be a function of real numbers rather than elements from the sample space. We now ask about the probability of a random variable equalling a specific number. The way to compute it is to verify, in the original probability function, how many elements correspond to that number, via function $X$. In continuous cases, say where $X$ refers to height, we can write $P_{X}$ as $P_{X}\left(A\right)=P_{X}\left(X\in A\right)=P\left(X^{-1}\left(A\right)\right)=P\left(\left\{ s\in S:X\left(s\right)\in A\right\} \right)$ where $A$ is a set belonging to $\mathcal{B\left(\mathcal{R}\right)}.$ (Again, only some sets are acceptable). Random variables thus connect probabilities of elements of the sample space with probabilities of sets in the real numbers. For simplicity, we often write $P\left(X=1\right)$ and $P\left(X\in A\right)$ rather than $P_{X}\left(X=1\right)$ and $P_{X}\left(X\in A\right)$. Finally, rather than ‘digging down’ into the probability function of a set, we usually define probabilities for random variables with cumulative distribution functions.

## Cumulative Distribution Function

A cumulative distribution function (c.d.f.) of a random variable $X$ is the function $F_{X}:\mathcal{R}\rightarrow\left[0,1\right]$, given by $F_{X}\left(x\right)=P_{X}\left(\left(-\infty,x\right]\right)=P\left(X\leq x\right),\,x\in\mathcal{R}$

So, the c.d.f. provides the probability of the value of a random variable falling below a scalar $x$. It is relatively easy to write the example of coin tossing in c.d.f. Remember that we use random variable $X\left(s\right)=\begin{cases} 1, & if\,s=Heads\\ 0, & if\,s=Tails \end{cases}$ so that $F_{X}\left(x\right)=\begin{cases} 0, & if\,x\lt 0\\ \frac{1}{2}, & if\,0\leq x\lt 1\\ 1, & if\,x\geq1 \end{cases}$

As a final, quick side-note, $F_{X}$ is simply a function’s name. We could have called it simply $F$, $G$ or "Banana." We usually underscript the function with the variable it refers to, because often we end up using multiple c.d.f.s.