Probability, Statistics and Information Theory

math

Small Probability, Statistics & Information Theory cheatsheet.

Author

Theo POMIES

Published

September 2, 2025

Modified

October 25, 2025

Definitions & Formulas

Outcomes and Events

When studying probability, we are performing experiments, random trials or observations. The set of all possible outcomes of this experiment is \(\mathcal{\Omega}\) (or \(\mathcal{S}\)). eg. When rolling a die, \(\mathcal{\Omega} = \{1,2,3,4,5,6\}\).

We can group these outcomes into events — \(\mathcal{E} \subseteq \mathcal{\Omega}\). eg. The event \(\mathcal{E} = \{\)die shows an even number\(\} = \{2, 4, 6\}\). Whenever the outcome \(z\) of the random experiment satisfies \(z \in \mathcal{E}\), the event \(\mathcal{E}\) has occurred. Multiple events can occur from the same outcome, say we have \(\mathcal{A} = \{3, 6\}\) “the result is divisible by 3” and \(\mathcal{B} = \{2, 4, 6\}\). \(z = 6\) satisfies both \(\mathcal{A}\) and \(\mathcal{B}\).

Probability function

The probability function maps events onto a real value \(P\colon \mathcal{E} \subseteq \mathcal{\Omega} \to [0, 1]\).

\(\operatorname{P}(\mathcal{E})\) is the probability associated with event \(\mathcal{E}\).

Properties

\(\operatorname{P}(\mathcal{E}) \geq 0\)
\(\operatorname{P}(\mathcal{\Omega}) = 1, \operatorname{P}(\mathcal{\emptyset}) = 0\)
\(\operatorname{P}(\mathcal{A} \cup \mathcal{B}) = \operatorname{P}(\mathcal{A}) + \operatorname{P}(\mathcal{B}) - \operatorname{P}(\mathcal{A} \cap \mathcal{B})\)
\(\operatorname{P}(\bigcup_{i=1}^{\infty} \mathcal{A}_i) = \sum_{i=1}^{\infty} \operatorname{P}(\mathcal{A}_i), \quad \mathcal{A}_i \cap \mathcal{A}_j = \emptyset\: \text{for all}\: i \neq j\) (= if all events \(\mathcal{A}_i\) are mutually exclusive)
\(\operatorname{P}(\mathcal{A} \cap \mathcal{B}) = \operatorname{P}(\mathcal{A} \mid \mathcal{B})\operatorname{P}(\mathcal{B})\)
\(\operatorname{P}(\mathcal{A} \cap \mathcal{B}) = \operatorname{P}(\mathcal{A})\operatorname{P}(\mathcal{B}) \iff \mathcal{A} \perp \mathcal{B}\) (eg. 2 fair dice rolls)
\(\mathcal{A} \perp \mathcal{B} \iff \operatorname{P}(\mathcal{A} \mid \mathcal{B}) = \operatorname{P}(\mathcal{A})\)

Random Variables

A random variable \(X\) is a measurable function (mapping) \(X \colon \mathcal{\Omega} \to \mathcal{E}\) from a sample space \(\mathcal{\Omega}\) as a set of possible outcomes to a measurable space \(\mathcal{E}\).

The probability that \(X\) takes on a value in a measurable set \(\mathcal{S} \in \mathcal{E}\) is written as \[ \operatorname{P}(X \in \mathcal{S}) = \operatorname{P}(\{\omega \in \mathcal{\Omega} \mid X(\omega) \in \mathcal{S}\}) \]

The probability that \(X\) takes a discrete value \(v\), denoted \(X = v\), is \(\operatorname{P}(X=v)\).

Expressions like \(X = v\) or \(X \geq v\) define events, i.e., subsets of \(\Omega\) whose probability can be measured.

Random variables allow us to go from outcomes to values, like \(X(\omega) = \omega\), the random variable that associates to each die its value (identity function). This is also an example of a discrete random variable.

When \(X\) is continuous, it doesn’t make sense to have events like \(X = v\) (and \(\operatorname{P}(X = v) = 0\)); rather we use \(v \leq X \leq w\) and probability densities. An example would be the height of a population. Probabilities are described via a probability density function \(p_X(x)\), with \[ \operatorname{P}(v \le X \le w) = \int_v^w p_X(x)\,dx \]

We denote the probability distribution of \(X\) as \(\operatorname{P}(X)\) (strictly speaking \(P_X\), but we often write \(P(X)\) for convenience).

Note

When the measurable space \(\mathcal{E}\) is multi-dimensional, like \(\mathbb{R}^m\), we call the random variable \(\mathbf{X} \in \mathbb{R}^m\) a random vector.

Multiple Random Variables

\(\operatorname{P}(A = a, B = b)\) is the joint probability of \(A = a\) and \(B = b\) (it’s the intersection of the events \(A = a\) and \(B = b\)). Equivalently it’s \(\operatorname{P}(\{A = a\} \cap \{B = b\})\), with an overloaded notation, the joint probability distribution becomes \(\operatorname{P}(A, B)\)

Obviously \[ \operatorname{P}(A = a, B = b) \leq \operatorname{P}(A=a) \quad \text{and} \quad \operatorname{P}(A = a, B = b) \leq \operatorname{P}(B=b) \]

Also, we can marginalize \[ \operatorname{P}(A = a) = \sum_v \operatorname{P}(A = a, B = v) \]

Because \(A = a\) and \(B = b\) are events, \[ \begin{aligned} \operatorname{P}(A = a, B = b) & = \operatorname{P}(A = a \mid B = b)\operatorname{P}(B = b) \\ \iff \operatorname{P}(A = a \mid B = b) & = \operatorname{P}(A = a, B = b)/\operatorname{P}(B = b) \end{aligned} \]

Bayes’ Theorem

From the properties and definitions above, we can derive the following formula

\[ \overbrace{\operatorname{P}(A \mid B)}^{\text{posterior probability}} = \dfrac{\overbrace{\operatorname{P}(B \mid A)}^{\text{likelihood}}\overbrace{\operatorname{P}(A)}^{\text{prior}}}{\underbrace{\operatorname{P}(B)}_{\text{observation}}} \]

prior/hypothesis: our estimate or current belief about the probability of \(A\)
observation/marginal likelihood/evidence: the evidence or observations we’ve made regarding \(B\)
likelihood: a measure of how compatible our hypothesis is with our observation

A simplified version is \(\operatorname{P}(A \mid B) \propto \operatorname{P}(B \mid A)\operatorname{P}(A)\)

Expectation

The expectation (or expected value) is the weighted average of the values of \(X\).

Discrete case:

\[ \operatorname{E}[X] = \operatorname{E}_{X \sim P}[X] = \sum_x x\operatorname{P}(X=x) \]

Continuous case:

\[ \operatorname{E}[X] = \int_{-\infty}^{\infty} x p(x) \;dx \]

To follow mathematical notation, sometimes we use \(\mu\) to denote this average.

Properties

Linearity: \(\operatorname{E}[\alpha A + B] = \alpha \operatorname{E}[A] + \operatorname{E}[B]\)
Equality: \(X = Y \; \text{a.s.} \implies \operatorname{E}[X] = \operatorname{E}[Y]\)
Constants: \(X = c \implies \operatorname{E}[X] = c\)
Tower property: \(\operatorname{E}[\operatorname{E}[X]] = \operatorname{E}[X]\)

Expectation of a Random Vector

For a vector-valued random variable — ie. the random vector \(\mathbf{X} \in \mathbb{R}^n\), we have \(\mathbf{\mu} = \operatorname{E}_{\mathbf{X} \sim P}[\mathbf{X}]\) with \(\mu_i = \operatorname{E}_{\mathbf{X} \sim P}[x_i]\) — the expectation of \(\mathbf{X}\) is a vector of the expectations of each element \(x_i\) of \(\mathbf{X}\).

Variance

The variance is a measure of dispersion, it quantifies how much values deviate from their expectation, on average. The variance is the expectation of the squared difference between the values and the expected value.

\[ \operatorname{Var}(X) = \operatorname{E}[(X - \operatorname{E}[X])^2] = \operatorname{E}[X^2] - (\operatorname{E}[X])^2 \]

Because

\[ \operatorname{E}[X^2 - 2X\operatorname{E}[X] + \operatorname{E}[X]^2] = \operatorname{E}[X^2] - 2(\operatorname{E}[X])^2 + (\operatorname{E}[X])^2 \]

Variance of a Random Vector

For a random vector \(\mathbf{X}\), we store the pairwise variances of elements and covariances in a covariance matrix (aka. auto-covariance matrix or variance matrix) noted \(\mathbf{\Sigma}\) or \(K_{\mathbf{x}\mathbf{x}}\) or \(\operatorname{Cov}_{\mathbf{x} \sim P}\), defined as

\[ \mathbf{\Sigma} = \operatorname{E}_{\mathbf{X} \sim P}[(\mathbf{X} - \mathbf{\mu})(\mathbf{X} - \mathbf{\mu})^\top] \]

\[ \begin{aligned} \mathbf{\Sigma} = K_{\mathbf{X}\mathbf{Y}} & = \operatorname{E}_{\mathbf{X} \sim P}[(\mathbf{X} - \mathbf{\mu})(\mathbf{X} - \mathbf{\mu})^\top] \\ & = \operatorname{E}[\mathbf{X}\mathbf{X}^\top] - \operatorname{E}[\mathbf{X}]\operatorname{E}[\mathbf{X}]^\top \end{aligned} \]

Note

Each entry \(\Sigma_{i, j} = \operatorname{Cov}(X_i, X_j)\) — see covariance), and by definition, for diagonal entries \(\Sigma_{i, i} = \operatorname{Cov}(X_i, X_i) = \operatorname{Var}(X_i)\)

Note

We have the following property when applying a linear transformation represented by the appropriately dimensioned matrix \(\mathbf{A}\)

\[ \begin{aligned} \operatorname{Cov}(\mathbf{AX}, \mathbf{AX}) & = \operatorname{E}[(\mathbf{AX} - \operatorname{E}[\mathbf{AX}])(\mathbf{AX} - \operatorname{E}[\mathbf{AX}])^\top] \\ & = \operatorname{E}[\mathbf{AX}(\mathbf{AX})^\top] - \operatorname{E}[\mathbf{AX}]\operatorname{E}[(\mathbf{AX})^\top] \\ & = \operatorname{E}[\mathbf{AX}\mathbf{X}^\top\mathbf{A}^\top] - \operatorname{E}[\mathbf{AX}]\operatorname{E}[\mathbf{X}^\top\mathbf{A}^\top] \\ & = \mathbf{A}\operatorname{E}[\mathbf{X}\mathbf{X}^\top]\mathbf{A}^\top - \mathbf{A}\operatorname{E}[\mathbf{X}]\operatorname{E}[\mathbf{X}^\top]\mathbf{A}^\top \\ & = \mathbf{A}(\operatorname{E}[\mathbf{X}\mathbf{X}^\top] - \operatorname{E}[\mathbf{X}]\operatorname{E}[\mathbf{X}^\top])\mathbf{A}^\top \\ & = \mathbf{A}\mathbf{\Sigma}\mathbf{A}^\top \end{aligned} \]

As a result of the linearity of expectation

Standard deviation

Because the variance is a squared difference, we can take its square root to get the standard deviation which has the benefit of being in the same unit as our random variable.

\[ \operatorname{Var}(X) = \sigma^2_X \iff \sigma_X = \sqrt{\operatorname{Var}(X)} \]

Covariance

Covariance is a measure of the joint variability of two random variables.

\[ \operatorname{Cov}(X, Y) = \operatorname{E}[(X - \operatorname{E}[X])(Y - \operatorname{E}[Y])] \]

Note

\(\operatorname{Cov}(X, X) = \operatorname{E}[(X - \operatorname{E}[X])^2] = \operatorname{Var}(X)\)

Covariance Matrix of two Random Vectors

For random vectors \(\mathbf{X} \in \mathbb{R}^m\), \(\mathbf{Y} \in \mathbb{R}^n\), the covariance matrix is a matrix \(K_{\mathbf{X}\mathbf{Y}}\) defined as

\[ \begin{aligned} K_{\mathbf{X}\mathbf{Y}} & = \operatorname{E}[(\mathbf{X} - \operatorname{E}[\mathbf{X}])(\mathbf{Y} - \operatorname{E}[\mathbf{Y}])^\top] \\ & = \operatorname{E}[\mathbf{X}\mathbf{Y}^\top] - \operatorname{E}[\mathbf{X}]\operatorname{E}[\mathbf{Y}]^\top \end{aligned} \]

We have \(\operatorname{Cov}(X_i, Y_j) = K_{X_iY_j} = \operatorname{E}[(X_i - \operatorname{E}[X_i])(Y_j - \operatorname{E}[Y_j])]\) found at index \((i, j)\) in \(K_{\mathbf{X}\mathbf{Y}}\)

If \(\mathbf{X} = \mathbf{Y}\) this is the auto-covariance matrix or variance matrix of this random vector \(\mathbf{X}\)

If \(\mathbf{X} \neq \mathbf{Y}\) this is the cross-covariance matrix of \(\mathbf{X}\) and \(\mathbf{Y}\)

Maximum Likelihood (and Negative Log-Likelihood)

Considering model parameters \(\boldsymbol{\theta}\) and data examples \(X\), the goal of Machine Learning is to find \(\boldsymbol{\theta}\) such that \[ \mathop{\mathrm{argmax}} P(\boldsymbol{\theta}\mid X) \]

By Bayes’ theorem

\[ \mathop{\mathrm{argmax}} P(\boldsymbol{\theta}\mid X) = \mathop{\mathrm{argmax}} \dfrac{P(X \mid \boldsymbol{\theta})P(\boldsymbol{\theta})}{P(X)} \]

However \(P(X)\) and \(P(\boldsymbol{\theta})\) can be dropped because \(P(X)\) is independent on \(\boldsymbol{\theta}\), and we have no prior information or “belief” on the best parameters \(\boldsymbol{\theta}\), so \(P(\boldsymbol{\theta})\) uninformative.

Hence, our best parameter estimation is the argument \(\boldsymbol{\theta}\) maximizing the likelihood (probability of seeing data \(X\) knowing parameters \(\boldsymbol{\theta}\)):

\[ \hat{\boldsymbol{\theta}} = \mathop{\mathrm{argmax}}_{\boldsymbol{\theta}}\, P(\boldsymbol{\theta}\mid X) = \mathop{\mathrm{argmax}}_{\boldsymbol{\theta}}\, P(X \mid \boldsymbol{\theta}) \]

Negative Log-Likelihood

In practice, we often use Negative Log-Likelihood in ML. The Log part comes from numerical stability and transforming products into sums. Imagine computing the likelihood for billions of data points, it would require too much precision for fp32. But the log-likelihood fits easily and precisely.

The negative part comes from loss/cost functions, which we want to minimize, not maximize. Because log is continuous and increasing we can transform the maximization problem into a minimization one by taking \(-P(X \mid \boldsymbol{\theta})\).

Information

Information is approximately the amount of “surprise” of an event or statement. Information tells you something you didn’t already know. The more surprising it is, the more information it carries.

Self-Information

For an event \(X\), with probability \(p\), it’s self-information is defined as

\[ \operatorname{I}(X) = \log \dfrac{1}{p} = -\log p \]

Note

We use \(\log = \log_2\) when measuring information in bits (typical in computer science and digital systems)

Entropy

The entropy of a random variable \(X\) with probability mass function (p.m.f., discrete case) or density function (p.d.f., continuous case) \(p(x)\) is the expected amount of information contained in its outcomes:

\[ \operatorname{H}[X] = \operatorname{E}[\operatorname{I}(X)] = -\operatorname{E}[\log p(X)]. \]

In explicit form for the discrete case:

\[ \operatorname{H}(X) = -\sum_x p(x) \log p(x) \]

and for the continuous case (differential entropy):

\[ \operatorname{H}(X) = -\int p(x) \log p(x) \, dx. \]

Cross-Entropy

Cross-entropy is the expected information content of samples drawn from a true distribution \(P\), when they are evaluated under an assumed (model) distribution \(Q\):

\[ \operatorname{H}(P, Q) = -\sum_x p(x) \log q(x) \quad \text{or} \quad -\int p(x) \log q(x) \, dx. \]

Note

It measures how well \(Q\) approximates \(P\): lower cross-entropy indicates a better model fit.

If the model is perfect, \(Q = P\), then

\[ \operatorname{H}(P, Q) = \operatorname{H}(P, P) = \operatorname{H}(P), \]

and in general,

\[ \operatorname{H}(P, Q) \ge \operatorname{H}(P), \]

meaning that incorrect assumptions about \(P\) can only increase the expected information (i.e., the coding cost).

Note

\(\operatorname{H}(P, Q) \neq \operatorname{H}(Q, P)\)

Kullback-Leibler Divergence (KL-Divergence)

Cross-Entropy measures how well \(Q\) approximates \(P\), including the irreducible uncertainty (the entropy of \(P\)). KL-Divergence isolates the extra information incurred by using \(Q\) instead of \(P\), defined as

\[ \operatorname{D_{KL}}(P | Q) = \operatorname{H}(P, Q) - \operatorname{H}(P), \]

making clear that KL-divergence is the cross-entropy minus the true entropy — the “extra surprise” from assuming \(Q\).

It is always non-negative and zero only when \(P = Q\).

\[ \operatorname{D_{KL}}(P | Q) = \sum_x p(x) \log \frac{p(x)}{q(x)} \quad \text{or} \quad \int p(x) \log \frac{p(x)}{q(x)} \, dx. \]

Note

Step by step \[ \begin{aligned} p(x) \log \frac{1}{q(x)} - p(x) \log \frac{1}{p(x)} & = p(x) [\log \frac{1}{q(x)} - \log \frac{1}{p(x)}] \\ & = p(x) [\log p(x) - \log q(x)] \\ & = p(x) \log \frac{p(x)}{q(x)} \\ \end{aligned} \]

Proofs

Later!

Notation

\(\mathcal{X}\): a set
\(\{a, b, c\}\): a set, with its elements
\(\emptyset\): the empty set
\(\mathcal{A} \subset \mathcal{B}\), \(\mathcal{A} \subsetneq \mathcal{B}\): \(\mathcal{A}\) is a proper/strict subset of \(\mathcal{B}\)
\(\mathcal{A} \subseteq \mathcal{B}\): \(\mathcal{A}\) is a subest of \(\mathcal{B}\)
\(\mathcal{A} \cap \mathcal{B}\): the intersection of sets \(\mathcal{A}\) and \(\mathcal{B}\) — “\(\mathcal{A}\) and \(\mathcal{B}\)”
\(\mathcal{A} \cup \mathcal{B}\): the union of sets \(\mathcal{A}\) and \(\mathcal{B}\) — “\(\mathcal{A}\) or \(\mathcal{B}\)”
\(\mathcal{A} \setminus \mathcal{B}\): set subtraction of \(\mathcal{B}\) from \(\mathcal{A}\), elements from \(\mathcal{A}\) but not in \(\mathcal{B}\)
\(\mathcal{S}\), \(\mathcal{\Omega}\): the sample space / universe (the set of all possible outcomes)
\(|\mathcal{X}|\): the cardinality of set \(\mathcal{X}\) (its number of events)
\(X\): a random variable
\(\mathbf{X}\): a random vector
\(P\): a probability distribution
\(X \sim P\): the random variable \(X\) follows the probability distribution \(P\)
\(a \propto b\): \(a\) is proportional to \(b\), eg. \(a = kb\)
\(\operatorname{P}(\cdot)\): the probability function, maps events to their probability and random variables to their probability distributions
\(\operatorname{P}(X)\): depending on the context, a probability distribution or the probability of any \(X=x\), meaning the formula is true for any value
\(\operatorname{P}(X=x)\): the probability assigned to the event where random variable \(X\) takes value \(x\)
\(\operatorname{P}(X \mid Y)\): the conditional probability distribution of \(X\) given \(Y\)
\(\operatorname{p}(\cdot)\): a probability density function (PDF) associated with distribution \(P\)
\(\operatorname{E}[X]\): expectation of a random variable \(X\)
\(X \perp Y\): random variables \(X\) and \(Y\) are independent
\(X \perp Y \mid Z\): random variables \(X\) and \(Y\) are conditionally independent given \(Z\)
\(\sigma_X\): standard deviation of random variable \(X\)
\(\operatorname{Var}(X)\): variance of random variable \(X\), equal to \(\sigma^2_X\)
\(\operatorname{Cov}(X, Y)\): covariance of random variables \(X\) and \(Y\)
\(\operatorname{\rho}(X, Y)\): the Pearson correlation coefficient between \(X\) and \(Y\), equals \(\frac{\operatorname{Cov}(X, Y)}{\sigma_X \sigma_Y}\)
\(\operatorname{I}(X)\): self-information of event \(X\)
\(\operatorname{H}(X)\): entropy of random variable \(X\)
\(D_{\operatorname{KL}}(P\|Q)\): the KL-divergence (or relative entropy) from distribution \(Q\) to distribution \(P\)