# Full Lecture 1

## Contents

# Sample Space

Traditionally, in probability theory, tossing a coin can yield only Heads or Tails. The set of all possible cases, Heads or Tails in this case, is called the **sample space**. It’s the space from which one can sample from, for example, or the space from which a realization can be produced from an experiment. We denote the sample space by [math]S.[/math] For example, in the coin tossing example,
[math]S=\left\{Heads,Tails\right\}.[/math]

Let's introduce some more notation:

.

- [math]\emptyset[/math] is the
**empty set**. It can be denoted as [math]\emptyset=\left\{\right\}.[/math]

.

- [math]\bigcup_{i=1}^{\infty}B_{i}[/math] is the union of sets [math]B_{i}.[/math] Formally, [math]\bigcup_{i=1}^{\infty}B_{i}=\left\{ s\in S:s\in B_{i}\,for\,some\,i\right\}.[/math]

.

- [math]B\subseteq S[/math] means [math]B[/math] is a
**subset**of the sample space. For example, [math]B=\left\{ Heads\right\}[/math] is a subset of [math]S=\left\{ Heads,Tails\right\}.[/math]

.

- [math]Heads[/math], without curly braces, is an
**element**of set [math]B.[/math]

.

- [math]B^{C}=S\backslash B[/math] is the complement of set [math]B[/math]. When [math]B=\left\{ Heads\right\}[/math], [math]B^{C}=\left\{ Tails\right\}.[/math] Formally, [math]B^{C}=\left\{ s\in S:s\notin B\right\}.[/math]

# Probability Function

A probability function is a function [math]P:\mathcal{B}\rightarrow\left[0,1\right][/math], where: .

- [math]P\left(S\right)=1.[/math]

.

- [math]P\left(\bigcup_{i=1}^{\infty}B_{i}\right)=\sum_{i=1}^{\infty}P\left(B_{i}\right)[/math], whenever [math]{B_{1},B_{2},...}[/math] are pairwise disjoint.

.

- Notice that we haven't defined the domain [math]\mathcal{B}[/math] yet.

. The properties of the probability function are intuitive. The probability of observing any event in set [math]S[/math] should equal 1, so we interpret [math]P\left(S\right)[/math] as the probability of any event in [math]S[/math] taking place. As for the second property: it states that the probability of a union of sets is the sum of the probabilities of each set, when the sets do not intersect. Clearly, we are interpreting [math]P\left(B\right)[/math] as the probability that any of the events in [math]B[/math] occurs.

## Remark

By now, you’ve probably noticed that, rather than talking about probability, we’re talking about sets and set theory. This is not too complicated. It’s simply the approach that mathematicians have taken to formalize probability, from ‘first principles.’ Essentially, we would like the probability function to map any set into a probability, i.e., a number between zero and one.

Let us consider the domain of the probability function now.

An intuitive solution is to make [math]\mathcal{B}=2^{S}[/math], i.e., [math]\mathcal{B}[/math] is the power set of [math]S[/math]:

[math]2^{S}=\left\{ \emptyset,\left\{ Heads\right\} ,\left\{ Tails\right\} ,\left\{ Heads,Tails\right\} \right\}.[/math]

The power set is a ‘set of sets.’ The power set [math]2^{X}[/math] contains all the subsets of [math]X[/math]. Clearly, [math]\left\{ Heads\right\}[/math] is a subset of [math]S[/math]. So is [math]\left\{ Heads,Tails\right\}[/math], because a set is a subset of itself. Finally, the power set always contains the empty set, because the empty set is a subset of all sets.

Let’s now check whether [math]2^{S}[/math] seems like a good candidate for the domain of the probability function, [math]\mathcal{B}[/math]. With [math]\mathcal{B}=2^{S}[/math], we obtain the following probability function for coin tossing:

.

[math]P\left(B\right)=\begin{cases} 1, & B=\left\{ Heads,Tails\right\} \\ \frac{1}{2} & B=\left\{ Heads\right\} \\ \frac{1}{2} & B=\left\{ Tails\right\} \\ 0 & B=\emptyset \end{cases}[/math]

We’re making progress: we can interpret [math]B=\left\{ Heads,Tails\right\}[/math] as the event of obtaining “heads or tails”, and we can also interpret [math]B=\emptyset[/math] as observing an 'impossible' event, such as the coin settling down vertically, or observing it land heads **and** tails. When we consider sets with countable many elements, the power set is indeed the default choice for the domain of the probability function.

# Domain of the Probability Function

Our choice of domain for the probability function seems to work well. However, mathematicians encountered a difficulty, when they considered the power set of sets with uncountable elements, for example, when determining the probability of a number falling in an interval in the real line. For starters, the power set of the reals seems very complex. There is another, more serious issue: It is possible to prove that if the power set of the reals is used as the domain of [math]P\left(\cdot\right)[/math], one can then partition the real line into an infinite number of sets, each of which can be assigned equal probability. This is not good; now, either each set has strictly positive probability, so that the sum of probabilities is infinity, or each set has zero probability, such that the sum of probabilities equals zero. (If you'd like to know how to create such a problematic partition of sets, see the example in the Math blog InfinityPlusOneMath.)

## [math]\sigma[/math]-algebra

In trying to find a convenient domain for the probability function, mathematicians isolated the properties of a set of sets that does not suffer from the paradox mentioned above. This is called a [math]\sigma[/math]-algebra, and it’s the typical domain for probability functions. In case you haven’t heard the term ‘algebra’ in this context before, an algebra is a definition of a set and of the operations that can be applied to it. A [math]\sigma[/math]-algebra [math]\mathcal{B}[/math] w.r.t. [math]S[/math] is a collection of events with the following properties:

- [math]\emptyset\in \mathcal{B}.[/math]
- [math]B\in\mathcal{B}\Rightarrow B^{C}\in\mathcal{B}.[/math]
- [math]B_{1},B_{2},...\in\mathcal{B}\Rightarrow\bigcup_{i=1}^{\infty}B_{i}\in\mathcal{B}.[/math]

Above, notice that even though [math]\emptyset[/math] and [math]B[/math] are sets, we use notation [math]\in[/math] rather than [math]\subseteq[/math], because [math]\mathcal{B}[/math] is a set of sets (i.e., each of its elements is a set).

So, [math]\mathcal{B}[/math] is a [math]\sigma[/math]-algebra if the empty set belongs to it; the complement of each element is also in [math]\mathcal{B}[/math], and if some sets belong to [math]\mathcal{B}[/math], then so does the union of those sets. We won’t be getting into how this definition solves the continuity issues explained above.

For the discrete case, it is easy to write a few [math]\sigma[/math]-algebras down:

- The
**discrete**[math]\sigma[/math]-algebra is given by the power set of [math]S[/math]:

[math]\mathcal{B}=2^{S}=\left\{ \emptyset,\left\{ Heads\right\} ,\left\{ Tails\right\} ,\left\{ Heads,Tails\right\} \right\}.[/math]

- The
**trivial**[math]\sigma[/math]-algebra is given by

[math]\mathcal{B}=\emptyset\cup S=\left\{ \emptyset,\left\{ Heads,Tails\right\} \right\}[/math].

You can verify that these are indeed sigma algebras.

## Remark

You may be a little confused now. Notice that when we defined the sample space, we said that [math]Heads[/math] and [math]Tails[/math] were its elements in the typical coin tossing case. However, we did not list the empty set. The reason is that the empty set is not an element of [math]S[/math] in that case. This wouldn't make sense, since [math]S[/math] is not a set of sets, and [math]\emptyset[/math] is a set. In sum, [math]\emptyset[/math] is a subset of [math]S[/math], and it is an element of [math]\mathcal{B}[/math].

## Borel [math]\sigma[/math]-algebra

When [math]S[/math] contains uncountable sets, the standard choice for the domain of the probability function is the **Borel [math]\sigma[/math]-algebra** (we will denote it as [math]\mathcal{B}(R)[/math], but will not define it here). The Borel [math]\sigma[/math]-algebra contains all open intervals in **R**, closed intervals, half-open, singletons, unions of intervals, etc. So, the Borel [math]\sigma[/math]-algebra allows one to ask questions about the probability of most ‘reasonable’ sets. ‘Unreasonable’ sets are hard to come up with (see Borel Set for more information).

# Probability Space

We call the ingredients needed to talk about probabilities the **probability space.**

A **probability space** is a triple [math](S,\mathcal{B},P)[/math], where
[math]S[/math] is a sample space, [math]\mathcal{B}[/math] is a [math]\sigma[/math]-algebra of events in [math]S[/math], and [math]P[/math] is a probability function.

The interpretation:

- [math]S[/math] is the set of possible singleton events.
- [math]\mathcal{B}[/math] is the set of questions we can ask the probability function (like, what is the probability that this and that happens, but not that other thing).
- [math]P[/math] maps sets into probabilities.

# More on Probability Functions

Finally, let us define some properties of [math]P\left(\cdot\right)[/math]:

- [math]P\left(B\right)=1-P\left(B^{C}\right)[/math],

which implies that

- [math]P\left(\emptyset\right)=0[/math], since [math]P\left(\emptyset\right)=1-\underset{=1}{\underbrace{P\left(S\right)}}.[/math]

Also,

[math]P\left(A\cup B\right)=P\left(A\right)+P\left(B\right)-P\left(A\cap B\right)[/math], which implies that

- [math]P\left(A\cup B\right)\leq P\left(A\right)+P\left(B\right)[/math] and
- [math]P\left(A\cap B\right)\geq P\left(A\right)+P\left(B\right)-1.[/math]

## Conditional probability

If A and B are events (remember, [math]A[/math] and/or [math]B[/math] could be [math]\emptyset[/math] or [math]\left\{ Heads,Tails\right\}[/math] for example, in the coin tossing example), and [math]P\left(B\right)\gt 0[/math], then the **conditional probability** of [math]A[/math] given [math]B[/math], denoted as [math]P\left(A|B\right)[/math], is

[math]P\left(A|B\right)=\frac{P\left(A\cap B\right)}{P\left(B\right)}[/math].

This definition is intuitive. Suppose that, for a certain region, we obtain the probability of a given farmland having vineyards and/or cork trees, as in the table below:

Cork Trees | |||
---|---|---|---|

Yes | No | ||

Vineyard | Yes | 20% | 5% |

No | 15% | 60% |

For example, the table above states that most fields have no vineyards and no cork trees, while 20% of them have both. Suppose we would like to ask “what is the probability of observing cork trees in a field that has vineyards?” Looking at the first row (corresponding to fields that have vineyards), it appears that only [math]20\%[/math] of the fields that have vineyards also have cork trees. However, the row sums up to 25%. The conditional probability of observing cork trees in a field that has vineyards is [math]20\%/25\%=0.8[/math].

## Independence

To wrap up, one more definition: Two events, [math]A[/math] and [math]B[/math], are **independent** if
[math]P\left(A\cap B\right)=P\left(A\right).P\left(B\right)[/math].

Clearly, the probabilities in Table i. are not independent. Consider a new set of probabilities:

Vineyard | |||
---|---|---|---|

Yes | No | ||

Cork Trees | Yes | 25% | 25% |

No | 25% | 25% |

First, notice that the probability of a vineyard is constant, independently of whether cork trees exist or not; and the converse is also true. For example,
[math]\begin{aligned} P\left(V\cap C\right)= & 25\%=P\left(V\right).P\left(C\right)=50\%.50\%\end{aligned}[/math].

# Random variables

To finish, we connect probability spaces to random variables. In practice, we'll often use random variables rather than probability spaces. A random variable is a (Borel measurable) function [math]X:S →\mathcal{R}[/math]. In other words, a random variable maps an element [math]s[/math] from [math]S[/math] to a real number.

In coin tossing,

[math]X:\left\{ Heads,Tails\right\}→\mathcal{R}[/math], given by

[math]X\left(s\right)=\begin{cases} 1, & if\,s=Heads\\ 0, & if\,s=Tails \end{cases}.[/math]

Basically, a random variable maps experimental outcomes to numbers. It allows us to step away from the sample space, which would be cumbersome to work with in many applications. Random variables are related to probability spaces. In fact, we can summarize random variable (r.v.) [math]X[/math] as the probability space [math]\left(S_{X},\mathcal{B}_{X},P_{X}\right)[/math], where

- [math]S_{X} = \mathcal{R}[/math]
- [math]\mathcal{B}_{X} = \mathcal{B\left(R\right)}[/math]
- [math]P_{X} = P\left(X^{-1}\right)[/math]

The first two statements are definitional. The third identity states that the probability function of [math]X[/math] can be written as a function of the original probability function [math]P\left(\cdot\right)[/math]. To see that we can indeed make this identity work, in the coin tossing example, apply the third equality w.r.t. the element ‘Heads’:

[math]P_{X}\left(1\right)=P\left(X^{-1}\left(1\right)\right)=P\left(\left\{ s\in S:X\left(s\right)=1\right\} \right).[/math]

This is a bit tricky the first time, but in the end, all we have done is make [math]P_{X}()[/math] be a function of real numbers rather than elements from the sample space.

Usually, we ask about the probability of a random variable equalling a specific number, or falling in a given set. The way we formally address this question is to map back to the original probability function, and verify how many elements correspond to the number in question, via function [math]X[/math]. In continuous cases, say where [math]X[/math] refers to height, we can write [math]P_{X}[/math] as

[math]P_{X}\left(A\right)=P_{X}\left(X\in A\right)=P\left(X^{-1}\left(A\right)\right)=P\left(\left\{ s\in S:X\left(s\right)\in A\right\} \right)[/math] where [math]A[/math] is a set belonging to [math]\mathcal{B\left(\mathcal{R}\right)}.[/math] Notice also that the inverse function of the r.v. can be a correspondence, s.t. it maps onto several elements of the Borel [math]\sigma[/math]-algebra.

Random variables thus connect probabilities of elements of the sample space with probabilities of sets in the real numbers. For simplicity, we often write [math]P\left(X=1\right)[/math] and [math]P\left(X\in A\right)[/math] rather than [math]P_{X}\left(X=1\right)[/math] and [math]P_{X}\left(X\in A\right)[/math]. Finally, rather than expressing probabilities as functions of sets, which would be tedious, we usually define probabilities for random variables with cumulative distribution functions.

## Cumulative Distribution Function

A cumulative distribution function (c.d.f.) of a random variable [math]X[/math] is the function [math]F_{X}:\mathcal{R}\rightarrow\left[0,1\right][/math], given by

[math]F_{X}\left(x\right)=P_{X}\left(\left(-\infty,x\right]\right)=P\left(X\leq x\right),\,x\in\mathcal{R}.[/math]

So, the c.d.f. provides the probability of the value of a random variable falling below a scalar [math]x[/math]. It is relatively easy to write the example of coin tossing in c.d.f.

Remember that we use random variable

[math]X\left(s\right)=\begin{cases} 1, & if\,s=Heads\\ 0, & if\,s=Tails \end{cases}[/math],

so that

[math]F_{X}\left(x\right)=\begin{cases} 0, & if\,x\lt 0\\ \frac{1}{2}, & if\,0\leq x\lt 1\\ 1, & if\,x\geq1 \end{cases}[/math]

As a final, quick side-note, [math]F_{X}[/math] is simply a function’s name. We could have called it simply [math]F[/math], [math]G[/math] or something else. We sometimes underscript the function with the variable it refers to, because often we end up using multiple c.d.f.s.