Link back to Home page

Link to typed notes


1. Into to Stats

Mean

x¯=1nxn

Median: the middle value in an ordered list of observations
Model the most frequent value in the data

SSE = sum of squares of errors

SSE(a)=i=1n(xia)2

SAE = sum of absolute errors

SAE(a)=i=1n|xia|

Theorem: x¯ minimizes the SSE

Pasted image 20251002173706.png

Theorem: median minimizes the SAE

The main idea is to look at slopes.. and then just find the value where the gradient changes from negative to positive

Proof is long and convoluted, link to it here, at some point it will make sense, until then, link here

Measures of spread

IQR = Interquartile range, difference between Q1 and Q3

The variance is:

Var(x)=1n1i1n(xix¯)2=1n(i=1nxi2nx¯2)

Notice how the SSE(a)=i=1n(xia)2

And we found SSE is minimized at a=x¯

So

i1n(xix¯)2=SSE(x¯)

To calculate the mean, we just divide by the the total population: n1
There is a reason we do n1, it gives us an unbiased estimate of the sample variance - this is covered later in the course

Proof of variance (not examined by still good for understanding)

However, there is a more useful form of the variance, the proof of which is given below:

i1n(xix¯)2=i=1n(xi22xix¯+x¯2)

Then we split the sum:

=i=1nxi2+i=1n2xix¯+i=1nx¯2

As the 2x¯ is a constant, we can factor that out of the sum

=i=1nxi22x¯i=1nxi+nx¯2

Now using the definition of the mean:

x¯=1ni=1nxi

So to rearrange

i=1nxi=nx¯i1n(xix¯)2=i=1nxi22x¯nx¯+nx¯2=i=1nxi22nx¯2+nx¯2=i=1nxi2nx¯2

Therefore:

Var(x)=1n(i=1nxi2nx¯2)

The standard deviation is the square root of variance

s=sd(x)=Var(x)

2. Intro to Probability

Definitions

Definition Random experiment: one in which we do not know exactly what the outcome of the experiment will give.
Definition Sample space - the set of all possible outcomes. Denoted by capital S
Definition Event - particular result of the random experiment, a subset of the sample space. Denoted by capital letters

Everything that is random is denoted by a capital letters. Lowercase is where we have observed

Definition Union - AB - either A or B occurs or both occur
Definition Intersect - AB - both A and B must occur

Definition Mutually exclusive - AD= - A and D have no outcomes in, i.e. cannot happen at the same time

Definition Complement - A - all of the outcomes not in A. Note the things below:

AA=S

The statement above basically means that everything in A and not A is the sample space

AA=

Axioms of probability

P{S}=10P{A}1

P{AB}=P{A}+P{B} provided that A and B are mutually exclusive

Combinatorics

The entire point of this section is fairly simple: being able to count outcomes without listing them all

There are two main questions that will determine which formula we have to use.

  1. Does the order matter? If it does, then we use permutations, otherwise combinations
  2. Do we replace items after choosing? If yes, with we use the with replacement cases

Permutations

The scenario here is that we have n distinct items and we want to arrange them k of them in a specific order.

Think of it like filling k positions on a shelf:

So the total number of ordered arrangements (aka permutations) is:

P(n,k)=n×(n1)××(nk+1)

Multiply numerator and denominator by (nk)! then we get the simplified definition below

P(n,k)=n!(nk)!

Combinations

Now suppose we want to choose k items from n distinct ones, but the order doesn't matter.

In other ways, when we go from permutations to combinations, we don't care what order the items are chose in, only which items were chosen. So as we've already counted the same group multiple times for every possible order with permutations, we need to divide by how many times we've 'overcounted', which is k!

(nk)=P(n,k)k!=n!(nk)! k!

Conditional probability and Bayes’ Theorem

At its simplest, probability is just chance of an event, so it's:

P(A)= number of ways A can happen total possible outcomes

Conditional probability is the probability event A happening given that another event B has already happening.
The formal definition is given below:

P(B|A)=P(AB)P(A)P(A|B)=P(AB)P(B)

When we arrange, we can get:

P(AB)=P(B|A) P(A)=P(A|B) P(B)

Bayes’ Theorem

Now, when we know something about one conditional event, but want to know about the reverse, we can use Bayes' Theorem:

P(B|A)=P(B) P(A|B)P(A)

We can derive the numerator by just substituting AB from the rearrangement above.

Bayes' Theorem is used widely when in medical test scenarios as it allows people to update their belief about an event after new evidence appears. To think about using an analogy:

Independence

Intuitively, events A and B are independent if occurrence of one event does not affect the probability that other event occurs.

Knowing that one event happened gives you no new information about the other.

Notice how this is different to mutually exclusive, which means that both events cannot happen at the same time, so P(AB)=0

Formally,
If

P(B|A)=P(B) or P(A|B)=P(A)

Then

P(B|A)=P(AB)P(A)=P(B)P(AB)=P(A)P(B)

Proof of Independence of complementary events

If A and B are independent, so are A and B

We want to show the following below:

P(AB)=P(A)P(B)P(AB)=1P(AB)

And how using using the general addition rule, we can rewrite the above as

=1[P(A)+P(B)P(AB)]

Given that A and B are independent, we can make the substitution

=1P(A)P(B)+P(A)P(B)

This can be factorise as:

=[1P(A)][1P(B)]

Which simplifies to P(A)P(B)


3. Probability Distributions

Random variables are denoted by uppercase letters, like X,Y,Z

A random variable is just a mapping from the outcomes space to the real line, so the combined probability of all values of a random variable is 1

Discrete Random Variable (DRV)

Definition
DRV - discrete random variable. If a random variable has a finite value

A DRV takes specific values, like heads in coin flips and we can count how many outcomes there are. We assign probabilities to each value individually:

P(X=xi)

And so to find the total probability up to some value k, we just need to sum the DRVs:

P(Xk)=xikP(X=xi)

For a DRV, we define a function $$ f(x) = P(X=x) $$
The function f(x) is just the probability mass (pmf) or the probability function. As the total probability must be 1 and all probabilities must be non-negative, we need:

Continuous Random Variable (CRV)

Definition
CRV - Continous random variable. When a variable can take any value on the real line

There are infinitely many possible values, so the probability of hitting any exact number, P(X=x)=0

So instead, we can talk about densities, the probability per unit of x

Probability Density Function (PDF)

This function essentially tells us how densely probability is packed around each value.
This is defined as:

P(aXb)=abf(x)dx

That's why for CRVs, integration replaces summation, it's the continuous version of adding up infinitely small pieces.

For the pdf, we need:

Cumulative distribution function (CDF)

This function calculates the probability of the random variables up to its arguments:

F(x)=P(Xx)

For a DRV, it's defined as:

P(Xx)F(x)=xixP(X=xi)

And for a CRV, the CDF is defined as:

P(Xx)F(x)=xf(t)dt

So the CDF is essentially the area under the PDF curve up to x

Relationship between PDF and CDF

For a CRV, the PDF is the derivative of the CDF

If

F(x)=xf(t)dt

And we differentiate both sides with respect to x we get:

f(x)=d F(x)dx

What this tells us that the PDF is just the slope of the CDF. Where the CDF rises, the PDF is large, lots of probability density there. And where the CDF is flat, the PDF is small.

Expectation

Think of expectation (or the expected value) as the long-run average value of a random variable if you repeated the experiment infinitely many times.

For example, if you roll a fair rice, the values are in the set of [1,6]
Each roll has probability 1/6.
Then

E[X]=116+216++616=3.5

You'll never roll a 3.5, but if you roll hundreds of times, you average approaches 3.5

Formal definition

For a DRV with possible values x1,x2...xn, and probabilities pi=P(X=xi):

E[X]=ixipi

And for CRV with PDF f(x):

E[X]=xf(x)dx

Same idea, we're just replacing the sum with an integral because now the variable takes infinitely many values

Expectation of a uniform distribution function

Say we have XU(a,b), a uniform distribution where each probability is equally likely anywhere between a and b

The PDF is:

f(x)={1ba,axb0, otherwise 

Then the expectation is:

E[X]=abx1ba=1ba[x22]ab=a+b2

Which is just the midpoint of the interval

Expectation of a function

If you have some function of the random variable say Y=g(X), then it's expectation is

E[g(X)]=ig(xi)pi E[g(X)]=g(x)f(x)dx

This is extremely useful as it lets us find things like:

Linearity of expectation

this is probably one of the most useful properties in probability.
For any random variable X, Y, and constants a,b:

Y=aX+bE(Y)=aE(X)+b

The reason we can do this is because expectation acts like a weighted average. If we stretch all our data by a factor of a, then the average stretches by a, and then if we shift all the data by b, the average shifts by b.

Proof of linearity

For a DRV:

E[Y]=yyP(Y=y)

But since Y=aX+b, we can rewrite the above expression as:

E[Y]=x(ax+b)P(X=x)

And now we can distribute the sum:

E[Y]=axxP(X=x)+bxP(X=x)

The first part of the sum is just aE[X] by definition of expectation, and the second part is b(1) since the probabilities sum to 1.
So we have:

E[Y]=aE[X]+b

For a CRV:

E[Y]=yf(y)dy

And again, as Y=aX+b, we can write it as:

E[Y]=(ax+b)f(x)dx=axf(x)dx+bf(x)dx

The first part is again just aE[X] by definition and the second part equals 1 since the total probability is 1.

Note on notation here: capital X represents the random variable, i.e the function that maps outcomes to numbers. And an x represents a value that X might take. So when we're computing the expectation of E[Y], we're summing or integrating over all the possible values it can take, so we replace X with x.

Expectation of symmetric random variable

A random variable X is called symmetric about a point c if its PDF (or PMF) satisfied:

f(c+x)=f(cx) x>0

If X is symmetric about c, then E(X)=c

Intuitively, think of this as the centre of mass of the function. If the PDF or PMF is symmetric about a point c, then for every value to the left of c, there's an equal value to the right of c with the same probability. This property saves us from having to do nasty integrals when finding expectations.

Proof for symmetric random variable

First, let Y=Xc. This means that Y is symmetric about 0, so our PDF satisfies the following property of an even function:

f(y)=f(y) y>0

Then we have:

E(Y)=yf(y)dy=0yf(y)dy+0yf(y)dy

We can substitute z=y and flip the integral limits to get:

=0zf(z)dy+0yf(y)dy

As as f(z)=f(z)

=0zf(z)dy+0yf(y)dy=0

Variance

We already saw that the expectation E(X) is the long run average of X. Now, if we want to do know how spread out the values are from the mean, we can use the variance.

If we wanted to measure how far (deviated) each possible value of X is from the mean, that's XE(X). But if we took lots of samples from the population, we could average those deviations. But the positive and negatives ones cancel, so we instead square them to get the magnitude:

Var(X)=E(XE(X))2

For very large n, we know that x¯E(X), as for a large enough sample, the sample mean gets closer to the actual mean. So if we write μ=E(X), the sample variance is:

1ni=1n(xiμ)2

So when n is large, this will be close to :

E(Xμ)2

This variance is therefore defined by:

Var(x)=E(Xμ)2={x(xμ)2f(x) if X is discrete (xμ)2f(x)dx if X is continuous

Proving variance

There is a more useful form of variance, which we can prove from the definition below:

Var(X)=E[(XE(X))2]

After expanding the bracket

(XE(X))2=X22XE(X)+(E(X))2

Take the expectation of both sides:

E[(XE(X))2]=E[X22XE(X)+(E(X))2]

Using the linearity of expectation, we can split and pull constants out to get:

=E(X2)2E(X)E(X)+(E(X))2

After simplifying we have:

Var(X)=E(X2)(E(X))2E(X2)μ2 where μ2=(E(X))2

Linearity of variance

If Y=aX+b, then

E(Y)=aE(X)+bVar(Y)=a2Var(X)

To prove this, we can use the definition of variance and some algebra

Var(Y)=E[Y(E(Y))2]

After substituting in for E(Y) using the linearity of expectation we have:

=E[(Y(aE(X)+b))2]=E[((aX+b)aE(X)b)2]

We can do some cancelling of the b to get:

=E[a2(XE(X))2]

As (XE(X))2=Var(X) by definition, we have:

Var(Y)=a2Var(X)

Sample median of random variables

Quantiles

For any 0<p<1, the p-quantile q solves F(q)=p. So if F is invertible, then q=F1(p)

Standard Discrete Distributions

Bernoulli Distributions

Bernoulli trails are a name for a set of independent trials, where each trial has only two possible outcomes- success, and failure.

For a Bernoulli trial:

X={1 if S2 if F

X Bernoulli(p)

The Bernoulli distributions has pmf:

f(x)=P(X=x)=px(1p)1x, x=0,1

For this distribution, the expectation is:

E(X)=xxf(x)=pE(X2)=xx2f(x)=p

And the variance is:

Var(x)=E(X2)μ2=pp2=p(1p), μ=E(X)

Binomial distribution

Defined as:

XB(n,p)

If X is the number of successes (S) out of n Bernoulli trials

PMF for the Binomial

To find the PMF, we want to find f(x)=P(X=x), i.e. the probability of x Successes, and (nx) Failures

The probability of the outcomes can be given by:

P(SSFSSF)=pp(1p)pp(1p)=(nk)px(1p)nx

So to generalize,
The pmf is:

f(x)=P(X=x)=(nk)px(1p)nx, x=0,1,n

Now one might wonder how we know that $$ \sum_{x=0}^{n}f(x) = 1$$
… by using the binomial theorem

(a+b)n=bn(n1)abn1+(n2)a2bn2++an=x=0n(nx)bnxax

Now if we choose a=p and b=(1p), then

x=0n(nx)px(1p)nx=1

Expectation and Variance of a Binomial function

Intuitively, think of X (the total number of successes) as being made up on n smaller random variables, one for each trial

Let

X=X1+X2+X3++Xn

Where Xi is 1 if the trial is a success and 0 if the trial is a failure. So each Xi is a Bernoulli random variable with parameter p

Now, we know that for a single trial, the expectation of a Bernoulli variable is

E(Xi)=1(p)+0(1p)=p

And since each expectation is linear, the expectation of the sum is

E(X)=E(X1+X2++Xn)=E(X1)+E(X2)++E(Xn)

Where each E(Xi)=p

So we get that expectation of a binomial is

E(X)=np

The variance is

Var(X)=np(1p)

The proof of this is left as an exercise to the reader

Geometric Distribution

Intuitively, the geometric distribution is the first one that captures the "keep trying until you succeed" idea. So imagine we're running an experiment where:

The random variable X represents the trial number on which the first success occurs.

PMF for a geometric distribution

To get a success on the x-th trial, we must have (x1) failures first, followed by 1 success. Each success has a probability of p and each failure has a probability of (1p)

P(X=x)=(1p)x1p, x=1,2,3

CDF

P(Xx)=1(1px),P(X>x)=(1p)x

Note - R assumes the number of failures until first success, instead of the number of trials until the first success. So if XGeo(p),X1 is the number of failures

Memoryless distribution

The geometric distribution is memoryless, i.e

P(X>s+k |X>k)=P(X>s)

This means just that the probability we still have to wait for s for more trials doesn't depend on how long we've already waited for. No other discrete distribution has this property!

Expectation of a geometric distribution

For the intuition, imagine playing a game where:

How many tries do we expect to need before success?

If p was large, then we'd succeed quickly, resulting in a small expected X
If p was small, then we would have to wait much longer, so a large expected X

So the average should:

E(X)=1p

The proof involves using negative binomial series, something which will not be covered in lectures, but it’s in the notes. A much nicer proof is covered in 2nd year Statistical Inference module

Variance of a geometric distribution

Going back our previous analogy, the variance here is how spread out the time between our successes. So if p was big, we would succeed quickly, resulting in a smaller variance. And if p was small, we could have to wait a long time, so a big variance

Var(X)=1pp2

Hypergeometric Distribution

This distribution is all about sampling without replacement, i.e. the realistic binomial

Suppose we have a population of N total individuals:

So the total number of type S individuals is Np and type F is N(1p)

Now, we want to sample n individuals without replacement. Let X = the number of type S Is our sample, then X follows a hypergeometric distribution:

XHypGeo(N,n,p)

Then it's PMF is given by:

f(x)=P(X=x)=(Npx)(N(1p)nx)(Nn)x=0,1,2n

The numerator counts how many ways to get exactly x succeses, and nx failures, and the denominator normalises so that the probabilities sum to 1

Expectation

E(X)=np

Notice how the expectation is the same as the binomial distribution, and that's because we are drawing without replacement

Variance

Var(X)=np(1p)NnN1

Negative Binomial Distribution

The negative binomial distribution models the number of trials needed to achieve a fixed number of successes, r in independent Bernoulli trials. In other words, it's how long till we get r successes.

If we just wanted 1 success, then the negative binomial is the same as the geometric distribution, which stop when we get our first success.

NegBin(r=1,p)Geo(p)

PMF

Imagine flipping a coin where p=0.3 of getting heads. And we want to know how many flips it will takes to get 3 heads. That random variable X follows a negative binomial distribution, the PMF for which is given by:

XNegBin(r,p)f(x)=P(X=x)=(x1r1)(1p)xrpr,x=r,r+1

Expectation

Each success takes 1/p trials on average, and we need r of them

E(X)=rp

Variance

Var(X)=r1pp2

Poisson Distribution

The Poisson distribution models the number of times an event happens in a fixed interval. It includes things like the number of emails you might get per hour, we can't predict exactly when then event happens, but over many intervals, the average rate stays the same.

So formally, It's the limit as n and p0 of Bin(n,p) so that λ=np is fixed. And we can often use the Poisson as an approximation to binomial when

PMF

The PMF of this function can be derived by taking the limit of the binomial where we want to rearrange this and substitute for p=λ/n

XP(λ)P(X=x) = \frac{e^{-\lambda}\lambda^x}{x!} ,\quad x = 0,1,2,3 $$Here, $\lambda$ is the average number of events per interval (mean rate), and $e^{-x}$ ensures that the probabilities all add up to 1 ### Expectation On average, we would expect $\lambda$ events to occur per interval, so that's our expectation $$E(X) = \lambda

Variance

If events occur randomly and independently, then the variability of counts around the means grows proportionally to the mean. So, the mean = variance in this case

Var(X)=λ

Therefore, the expectation and variance are the same for the Poisson distribution

Uniform Distribution

PDF - as this is a continuous distribution

F(X)=P(Xx)=xaba,a<x<b

Expectation

E(X)=ba2

Variance

Var(X)=(ba)212

Exponential Distribution

This distribution is bit like a continuous version of the geometric distribution, and it connects really nicely with the Poisson process. The exponential models the waiting time until the next event in a Poisson process. So if:

XExp(θ)

So the Poisson counts how many events occur in a fixed time interval, whereas the exponential measures how long between events

PDF

 For X0f(x)=θeθx

The full proof involves the gamma function, which is given is the official lecture notes

Expectation

E(X)=1θ

If events occur at θ per unit time, the average wait, intuitively, is 1θ time units

Variance

Var(X)=1θ2

This one isn't immediately obvious, but the spread of waiting time grows as θ grows

CDF

The cdf for x>0 is defined as:

F(x)=P(X=x)=0xθθudu=1eθx

Quantiles

As talked about earlier, quantiles tells us how far along the probability distribution we have to go to capture a certain proportion of data. So for a CRV X with CDF F(X), the p-th quantile is the value q s.t.:

P(Xq)=F(q)=p

So we're solving for the x value that corresponds to the cumulative probability p. And to find theat value, we need the inverse of the CDF, i.e. q=F1(p)

For the exponential distribution, recall that the cdf is F(x)=1eθx, so to solve for q, we have:

F(q)=1eθq=p(1p)=eθqln(1p)=θqq=1θln(1p)

And that's our quantile function

Memorylessness (Again!)

The exponential distribution is also memoryless, as it’s just a continuous version of the geometric.

P(X>s+t|X>S)=P(X>t)

So the probability that we want to wait at least t more time units, given that we've already waited s time units without an event, is the same as if we just started waiting now. i.e. past waiting doesn't change future expectation

Normal Distribution

XN(μ,σ2)

PDF

f(x)=12πσ2e(xμ)22σ2

The CDF is just

ϕ(z)=P(Zz)=z12πeu22du

3b1b has some good videos on this

To show that sum of all probabilities is 1, we use the property of the Gamma function;

Γ(12)=π

And the standard normal Z=xμσ. The full proof can be found in the notes

Expectation

E(X)=μ

Variance

Var(X)=σ2

Linear Transformation

XN(μ,σ2)Y=aX+b=N(aμ+b,a2σ2)

We can also standardise any normal to get

Z=Xμσ,ZN(0,12)

Probabilities using tables

Given the standard normal distribution with Z, we can use one universal table for all the normal distribution as the standard normal is just a shift and stretch of every normal curve

So if we want probability for P(aXb), we can use the standard normal to get:

P(aXb)=P(aμσZbμσ)

And then we ca use the normal CDF ϕ(z)=P(Zz) to get the probability

Log-Normal Distributions

If XN(μ,σ2), then the random variable Y=exp(X) is the log-normal random variable

Expectation

Variance

#finishexplaination

Joint Distributions

Up to now, we've mostly dealt with one random variable at a time, eg: X = height of a person. But in the real world, we often want to study two random variables together, eg:
X = height
Y = weight
And we want to know how these two relate, and that's essentially what joint distributions are about

So when we have two random variables, X and Y, instead of having probabilites of single values of X, we have probabilities of pairs of values (x,y)

Discrete Case

IF X and Y are discrete (like dice rolls, counts), then the joint PMF is:

f(x,y)=P(X=x and Y=y)

This would like a table of probabilities, one for each combination of x and y, and all probabilities must be

Continuous case

If X and Y are continuous, then the joint PDF is f(x,y) s.t.

f(x,y) dx dy=1

The method above uses bivariate integration, which is just the same as finding the volume under a curve if we think about integration as finding the are under a curve.

Marginal Distributions

Each random variable has its own individual distribution, even when considered together. And these are called marginal distributions. This is because we get them by summing or integrating the joint distributions along the margins.

Discrete

f(x)=yf(x,y)f(y)=xf(x,y)

This basically means to get the probability of X, add up all the probabilities for each possible y

Continuous

f(x)=f(x,y) dy,f(y)=f(x,y) dx

Expectation of g(X,Y)

Now consider we want to find the average (expected) value of something that depends on both X and Y, so we have a generalised formula for expectation

E[g(X,Y)]={xyg(x,y)f(x,y), discreteg(x,y)f(x,y) dx dy, continuous

This is the same as the single-variable version, but now in two dimension

Covariance

Covariance tells us how two random variables move together

Cov(X,Y)=E[(XE(X))(YE(Y))]=E(XY)E(X)E(Y)E(XY)μxμy

The intuition behind this is essentially:

Correlation - normalised covariance

Correlation then, is just the scaled version of covariance so its unit free and lies between 1 and 1 and it’s given by:

Cor(X,Y)=Cov(X,Y)Var(X)Var(Y)

I can’t be bothered enough to do a proof for this, so yeah.

So correlation just shows how strong and in which direction X and Y move

In function terms two random variable are independent if knowing one tells us nothing about the other:

f(x,y)=f(x)×f(y) x,y

So if X and Y are independent they have 0 correlation, but the converse is not true

Sums of Random Variables

Independent Binomial Random Variables

If XB(m,p) and YB(n,p), then

X+YB(n+m,p)

The proof is in notes, but otherwise this makes intuitive sense.

IF Y=X1+X2+Xn with E(Xi)=μi and Var(Xi)=σi2

Then:

E(Y)=μ1+μ2++μnVar(Y)=σ12+σ22++σn2

This is actually a pretty neat way to derive the expectation and variance for the Binomial and the negative binomial distribution:

E(Y)=E(i=1nXi)=p+p++p=npVar(Y)=Var(X1)++Var(Xn)=p(1p)++p(1p)=np(1p)

Central Limit Theorem