u, v = torch.arange(3), torch.arange(3) # torch.tensor([0, 1, 2])
torch.dot(u, v) # u @ vtensor(5)
Theo POMIES
August 29, 2025
September 12, 2025
A vector space (over a field, like \(\mathbb{R}\), which provides the scalars) is a set of vectors equipped with vector addition and scalar multiplication that satisfies the following conditions:
This might be useful later, but for all intents and purposes in these notes, the field will be \(\mathbb{R}\), vectors will be in \(\mathbb{R}^n\), and addition and scalar multiplication will be element-wise.
\[ (\mathbf{A}^\top)_{i,j} = \mathbf{A}_{j,i} \]
\[ \mathbf{u} \cdot \mathbf{v} = \sum_{i} u_i v_i \]
The dot product has a geometric meaning.
\[\mathbf{v}\cdot\mathbf{w} = \|\mathbf{v}\|\|\mathbf{w}\|\cos{\theta}\] \[\iff \theta = \arccos\left(\frac{\mathbf{v}\cdot\mathbf{w}}{\|\mathbf{v}\|\|\mathbf{w}\|}\right).\]
In linear algebra, vectors are column vectors by default, so \[ \mathbf{u} \in \mathbb{R}^{n \times 1}, \quad \mathbf{u} = \begin{bmatrix} u_1 \\ \vdots \\ u_n \end{bmatrix} \]
\[ (\mathbf{A}\mathbf{u})_{i} = \mathbf{a}_{i,:} \cdot \mathbf{u} = \sum_{j} a_{i,j} u_j \]
When we say “we apply a linear transformation \(\mathbf{A}\) to \(\mathbf{v}\)”, we mean \(\mathbf{A}\mathbf{v}\).
\[ (\mathbf{A}\mathbf{B})_{i,j} = \mathbf{a}_{i,:} \cdot \mathbf{b}_{:, j} = \sum_{k} a_{i,k} b_{k,j} \]
tensor([[ 3, 4, 5],
[ 9, 14, 19],
[15, 24, 33]])
\[ \mathbf{u} \otimes \mathbf{v} = \begin{bmatrix} u_1v_1 & u_1v_2 & \dots & u_1v_n \\ u_2v_1 & u_2v_2 & \dots & u_2v_n \\ \vdots & \vdots & \ddots & \vdots \\ u_mv_1 & u_mv_2 & \dots & u_mv_n \\ \end{bmatrix} = \mathbf{u}\mathbf{v}^\top\]
(tensor([[0, 0, 0],
[0, 1, 2],
[0, 2, 4]]),
tensor(True))
The outer product of two column vectors is the matrix whose entries are all products of an element in the first vector with an element in the second vector.
\[ \mathbf{A}\mathbf{A}^{-1} = \mathbf{A}^{-1}\mathbf{A} = \mathbf{I},\quad \mathbf{A} \in \mathbb{R}^{n \times n} \]
Might not exist if \(\mathbf{A}\) is not invertible, that is \(\det(\mathbf{A}) = 0\)
I think of the determinant of \(\mathbf{A}\) as the scaling factor of the linear transformation represented by \(\mathbf{A}\).
This explains why a matrix whose determinant is 0 is not invertible: it “collapses”, and two images might have the same original input \(\mathbf{x}\)
\[ \det(\mathbf{A}^\top) = \det(\mathbf{A}) \] \[ \det(\mathbf{A}\mathbf{B}) = \det(\mathbf{A})\det(\mathbf{B}) \] \[ \det(\mathbf{A}^{-1}) = \dfrac{1}{\det(\mathbf{A})} \iff \det(\mathbf{A}) \neq 0 \]
\[ A = \operatorname{diag}(a_1, \dots, a_n), \quad \det(A) = \prod_{i=1}^n a_i \]
\[ \det(T) = \prod_{i} t_{i,i}, \quad \text{where $T$ is triangular} \]
\[ \mathbf{A} \, \textnormal{is orthogonal} \iff \mathbf{A}^\top\mathbf{A} = \mathbf{I}\: (= \mathbf{A}\mathbf{A}^\top) \]
Interstingly,
\[ \mathbf{A} \, \textnormal{is orthogonal} \iff \mathbf{A}^{-1} = \mathbf{A}^\top \]
An elementary matrix is a matrix obtained from the application of a single elementary row operation to the identity matrix \(\mathbf{I}\).
Because of associativity, the product of an elementary matrix and another matrix is the same as applying the row operation to the other matrix.
Multiplying row \(i\) by \(m\)
\[\mathbf{D}_i(m) = \operatorname{diag}(1, \dots, 1, d_i = m, 1, \dots, 1)\]
Adding \(m\) times row \(j\) to row \(i\)
\[\mathbf{L}_{ij}(m) = \begin{bmatrix} 1 &&&&&& \\ & \ddots &&&&& \\ && 1 &&&& \\ &&& \ddots &&& \\ && l_{i,j} = m && 1 && \\ &&&&& \ddots & \\ &&&&&& 1 \end{bmatrix}\]
Swapping rows \(i\) and \(j\)
\[\mathbf{T}_{ij} = \begin{bmatrix} 1 &&&&&& \\ & \ddots &&&&& \\ && 0 && t_{j,i} = 1 && \\ &&& \ddots &&& \\ && t_{i,j} = 1 && 0 && \\ &&&&& \ddots & \\ &&&&&& 1 \end{bmatrix}\]
\(^{(*)}\) Consider that swapping two rows can be expressed as applying consecutive additions/subtractions as follows \[\begin{aligned} \mathbf{T}_{ij} & = \mathbf{D}_{i}(-1)\mathbf{L}_{ij}(-1)\mathbf{L}_{ji}(1)\mathbf{L}_{ij}(-1) \\ \iff \operatorname{det}(\mathbf{T}_{ij}) & = \operatorname{det}(\mathbf{D}_{i}(-1)) \cdot \operatorname{det}(\mathbf{L}_{ij}(-1)) \cdot \operatorname{det}(\mathbf{L}_{ji}(1)) \cdot \operatorname{det}(\mathbf{L}_{ij}(-1)) \\ \iff \operatorname{det}(\mathbf{T}_{ij}) & = -1 \\ \end{aligned}\]
We say of \(n\) vectors — \(\mathbf{v}_1, \dots, \mathbf{v}_n\) that they are linearly dependent if there exists scalars \(a_1, \dots, a_n\) not all equal to 0 satisfying \[\sum_{i=1}^n a_i\mathbf{v}_i = 0\]
If the only solution is \(a_i = 0\) for \(i\) in \(0, \dots, n\), they are linearly independent.
Linear dependence means that some vectors could be expressed as a weighted sum (linear combination) of others, hence carrying redudant information/operation.
The rank of a matrix is the size of the largest subset of linearly independent columns among its columns. Equivalently, it is the dimension of the column space.
This reflects that if some columns are linearly dependent, they are redundant: they do not contribute a new direction. Thus, the matrix can be expressed with fewer independent columns, and the image (output space) of the matrix has correspondingly fewer dimensions.
A basis (not the, since a vector space can have many different bases) of a vector space is a linearly independent subset that spans the space.
All bases of the same vector space have the same size = the dimension
That means: a basis is a set of vectors that you can combine (with weighted sums) to form any vector in the space — and none of them are redundant.
The dimension of a vector space is the number of vectors in a basis of that space.
Since every basis of a vector space has the same size, the dimension is well-defined. It tells you “how many independent directions” the space has.
An eigenvector is a vector whose direction (eg. “line”) is unchanged by a linear transformation (represented by a matrix). That means that when applying the linear transformation \(\mathbf{A}\) to an eigeinvector \(\mathbf{v}\) it simply gets scaled by a constant scalar factor, the eigenvalue \(\lambda\). Represented by this eigenequation. \[\begin{aligned} \mathbf{A}\mathbf{v} & = \lambda\mathbf{v} \\ \iff (\mathbf{A} - \lambda\mathbf{I})\mathbf{v} & = \mathbf{0} \label{eq:1} \end{aligned} \tag{1}\]
This equation 1 has a non-zero solution \(\mathbf{v}\) only if \(\operatorname{det}(\mathbf{A} - \lambda\mathbf{I}) = 0\) — called the characteristic equation of \(\mathbf{A}\). The values of \(\lambda\) that satisfy this equation are the egeinvalues of \(\mathbf{A}\).
The eigenvectors \(\mathbf{v}_{\lambda=n}\) corresponding to each eigenvalue can be found by plugging the values of \(\lambda\) in equation 1.
By definition of eigenvectors, for a square matrix \(\mathbf{A}\), if it has a full set of linearly independent eigenvectors (ie. it’s diagonalizable), we have
\[ \mathbf{A}\mathbf{V} = \mathbf{V}\mathbf{\Lambda} \]
where \(\mathbf{V}\) a matrix whose columns are the the eigenvectors of \(\mathbf{A}\), and \(\mathbf{\Lambda}\) the diagonal of associated eigenvalues.
Because \(\mathbf{V}\) is invertible, we can multiply \(\mathbf{V}^{-1}\) by both sides:
\[ \mathbf{A} = \mathbf{V}\mathbf{\Lambda}\mathbf{V}^{-1} \]
\[ \mathbf{A}^n = \overbrace{\mathbf{A}\dots\mathbf{A}}^{\text{n times}} = \overbrace{(\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{-1})\dots(\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{-1})}^{\text{n times}} = \mathbf{V}\overbrace{\mathbf{\Lambda}\dots\mathbf{\Lambda}}^{\text{n times}}\mathbf{V}^{-1} = \mathbf{V}\mathbf{\Lambda}^n\mathbf{V}^{-1}\]
For negative powers: \[\mathbf{A}^{-1} = \mathbf{V}\mathbf{\Lambda}^{-1}\mathbf{V}^{-1}\]
as long as all diagonal entry \(\lambda_{i,i}\) (ie. eigenvalues) are non-zero. Since \[\begin{aligned} \det(\mathbf{A}) & = \det(\mathbf{V})\cdot\det(\mathbf{\Lambda})\cdot\det(\mathbf{V}^{-1}) \\ & = \det(\mathbf{V})\cdot\det(\mathbf{\Lambda})\cdot\dfrac{1}{\det(\mathbf{V})} \\ & = \det(\mathbf{\Lambda}) = \prod_{i=1}^n \lambda_i \end{aligned}\]
(see determinant identities and determinant of diagonal matrix)
This reflects that a matrix is invertible if its determinant is 0.
Later!
Gaussian elimination — or row reduction – is an algorithm for solving systems of linear equations, based on the following elementary row operations:
todo
Computing the determinant of a Matrix is not trivial at first glance.
We know how to easily compute:
Knowing that, we want to find a representation of our original matrix \(\mathbf{A}\) that involves an Upper Triangular Matrix \(\mathbf{U}\), and one or more other matrices whose determinant is known or trivial to compute, as \(\mathbf{P}\mathbf{A} = \mathbf{L}\mathbf{U}\)
To go from \(\mathbf{A}\) to \(\mathbf{U}\) we’ll use Gaussian Elimination, \(\mathbf{P}\) tracks our permutations (row swaps) and \(\mathbf{L}\) tracks our row operations (row additions).
Now, because \(\mathbf{P}\) is orthogonal, we have \[ \mathbf{P}\mathbf{A} = \mathbf{L}\mathbf{U} \implies \mathbf{A} = \mathbf{P}^{-1}\mathbf{L}\mathbf{U} = \mathbf{P}^\top\mathbf{L}\mathbf{U} \]
Finally, this means that \[ \det(\mathbf{A}) = \det(\mathbf{P}^\top) \cdot \det(\mathbf{L}) \cdot \det(\mathbf{U}) = \det(\mathbf{P}) \cdot \det(\mathbf{L}) \cdot \det(\mathbf{U}) \]
Now, if we just keep track of row swaps, we can easily compute \(\det(\mathbf{A})\)!