Linear Algebra for AI: Path 3 –Theory Connector

Rigorous linear algebra for researchers with strong math backgrounds transitioning into machine learning. Linear Algebra for AI: Path 3 - Theory Connector.

Linear Algebra for AI: Path 3 –Theory Connector
Photo by Kevin Ku / Unsplash

Meet Carmen: You have a postgrad degree in physics, economics, neuroscience, chemistry, or another quantitative field. You're fluent in multivariable calculus, differential equations, probability, and statistics. You've seen linear algebra, but now you need the ML-specific perspective. You want rigor—understanding assumptions, proofs, and when methods fail. You're planning to publish papers using ML or transition into ML research.

Core Philosophy: Fast-track through basics, deep dive into ML theory, heavy emphasis on connections to your field, focus on mathematical rigor and research applications.


Week 1: Rapid Comprehensive Review

You need a quick refresh, not a ground-up build.

🎥 3Blue1Brown: Essence of Linear Algebra (Selected Videos)

Watch for geometric intuition (you know the algebra):

Why: Refresh the geometric perspective you might have missed in your formal training.


🎓 Strang 18.06: Selective High-Level Review

Rapid watch (2x speed, pause only if unclear):

Lecture 10: The Four Fundamental Subspaces

  • The big picture you need for ML

Lecture 14-16: Orthogonality and Projections

Critical for ML: This is the foundation of regression

Lecture 21: Eigenvalues and Eigenvectors

  • You know this, but watch for Strang's perspective

Lecture 25: Symmetric Matrices and Positive Definiteness

  • Positive definite matrices: Hessians, covariance matrices

Lecture 29: Singular Value Decomposition

  • SVD: The most important decomposition in ML

📖 Textbook Speed-Read

Linear Algebra and Learning from Data - Chapter 1

Read Chapter 1 in one sitting (~1 hour). Focus on:

  • Section I.1: Multiplication Ax using columns of A
  • Section I.2: Matrix-matrix multiplication AB
  • Section I.3: The four fundamental subspaces
  • Section I.6: Eigenvalues and eigenvectors
  • Section I.7: Singular value decomposition

Approach: Skim proofs (you can reconstruct them), focus on ML applications and insights.


Week 2-3: ML-Specific Linear Algebra Theory

Now dive into the ML-focused course with theoretical rigor.

🎓 Strang 18.065: Theoretical Foundations (Lectures 1-7)

Lecture 1: The Column Space of A Contains All Vectors Ax

  • Model capacity and representational power
  • Connection: Universal approximation vs. sample complexity

Lecture 2: Multiplying and Factoring Matrices

  • Computational complexity of factorizations
  • When to use LU vs. QR vs. Cholesky


Lecture 3: Orthonormal Columns in Q Give Q'Q = I

  • Gram-Schmidt and its numerical instability
  • Modified Gram-Schmidt, Householder reflections

Lecture 4: Eigenvalues and Eigenvectors

  • Spectral theorem for symmetric matrices
  • Diagonalization and its limitations

Lecture 5: Positive Definite and Semidefinite Matrices

  • Tests for positive definiteness
  • Connection to convex optimization

Lecture 6: Singular Value Decomposition (SVD)

  • Critical: Full mathematical development
  • Geometric interpretation: rotation-scaling-rotation
  • Computing SVD: Golub-Reinsch algorithm

Lecture 7: Eckart-Young: The Closest Rank k Matrix to A

  • Best low-rank approximation
  • Foundation for PCA and dimensionality reduction

📖 Textbook Deep Dive

Linear Algebra and Learning from Data - Chapters 2-3

Chapter 2: Large Matrices

  • Focus on computational complexity
  • Sparse matrices and iterative methods
  • Randomized algorithms

Chapter 3: Low Rank Approximation

  • Eckart-Young theorem (full proof)
  • Nuclear norm minimization
  • Matrix completion theory

Read actively: Work through all proofs. Pause and complete them before reading Strang's version.


💻 Theoretical Implementation Exercise

Project: Prove and Implement Eckart-Young Theorem

  1. Prove (on paper): Truncated SVD minimizes Frobenius norm error
  2. Implement: Numerical verification

python

   def verify_eckart_young(A, k):
       # Compute optimal rank-k approximation via SVD
       U, S, Vt = np.linalg.svd(A)
       A_k = U[:, :k] @ np.diag(S[:k]) @ Vt[:k, :]
       
       # Try random rank-k matrices, show SVD is better
       for trial in range(1000):
           B_k = random_rank_k_matrix(A.shape, k)
           assert frobenius_norm(A - A_k) <= frobenius_norm(A - B_k)
  1. Analyze: How does error decay with k? Plot singular value spectrum.
  2. Application: Matrix completion on MovieLens data

Week 3-4: Optimization Theory for ML

You need rigorous understanding of how learning algorithms work.

🎓 Strang 18.065: Optimization Sequence (Lectures 17-20)

Lecture 21: Minimizing a Function Step by Step

  • Introduction to iterative optimization methods
  • Newton's method and its variants
  • Trade-offs between convergence speed and computation

Lecture 22: Gradient Descent: Downhill to a Minimum

  • Gradient descent on convex functions: convergence guarantees
  • Lipschitz continuity and strong convexity
  • Convergence rates: O(1/k) vs. O(e^(-k))
  • After watching:
    • Prove convergence for strongly convex functions (standard result, good exercise)
    • Connect to your field: If you're from physics, link to minimizing energy functionals; if econ, to optimization in markets

Lecture 23: Accelerating Gradient Descent (Use Momentum)

  • Momentum: Heavy ball method from mechanics
  • Nesterov acceleration: Optimal first-order method
  • Your physics/applied math background is an advantage here

Lecture 25: Stochastic Gradient Descent

  • Mini-batch sampling and variance reduction
  • Convergence in expectation
  • Why SGD escapes local minima (exploration due to noise)
  • Learning rate schedules
  • Adaptive methods: AdaGrad, RMSprop, Adam
  • Research connection:
    • Statistical mechanics interpretation (Langevin dynamics)
    • Compare to Monte Carlo methods from your field

📖 Reading: Convex Optimization

Supplementary Text: Boyd & Vandenberghe, "Convex Optimization" (available free online). Free PDF 📖: Convex Optimization by Boyd & Vandenberghe

Read selected sections:

  • Chapter 2: Convex sets
  • Chapter 3: Convex functions
  • Chapter 9: Unconstrained minimization (focus on gradient descent)
  • Chapter 10: Equality constrained minimization (Lagrange multipliers)

Why: Most ML loss functions are convex (linear regression, logistic regression, SVM). Understanding convex optimization deeply will make you a better researcher.


💻 Research-Level Implementation

Project: Optimization Algorithm Comparison

Implement from scratch and rigorously analyze:

  1. Gradient Descent (full batch)
  2. SGD with constant and decaying learning rates
  3. Momentum (heavy ball)
  4. Nesterov Accelerated Gradient
  5. Adam

Test on:

  • Quadratic functions (analyze eigenvalue spectrum effects)
  • Logistic regression (convex but not quadratic)
  • Simple neural network (non-convex)

Analysis Requirements:

  • Convergence plots (log-scale loss vs. iteration)
  • Theoretical vs. empirical convergence rates
  • Sensitivity to hyperparameters (learning rate, momentum)
  • Effect of condition number on convergence

Writing: Produce a technical report with:

  • Mathematical definitions of each optimizer
  • Convergence guarantees (with proofs or citations)
  • Empirical results
  • Discussion of when each method excels

Week 5-6: Advanced Matrix Methods and Probabilistic Interpretations

🎓 Strang 18.065: Advanced Topics (Lectures 8-12)

Lecture 8: Norms of Vectors and Matrices

  • Vector norms: L1, L2, L∞
  • Matrix norms: Operator norm, Frobenius norm, nuclear norm
  • Condition number and numerical stability

Lecture 9: Four Ways to Solve Least Squares Problems

  • Normal equations vs. QR vs. SVD vs. iterative
  • Numerical stability analysis
  • When each method is preferred

Lecture 10: Survey of Difficulties with Ax = b

  • Ill-conditioned systems
  • Regularization: Ridge, Lasso, Elastic Net
  • Statistical interpretation: prior distributions
  • Connection to your field:
    • Physics: Regularization as adding energy penalty
    • Economics: Ridge regression as Bayesian prior (Gaussian)
    • Neuroscience: Sparse coding (L1 regularization)

Lecture 11: Minimizing ‖x‖ Subject to Ax = b

  • Constrained optimization
  • Lagrange multipliers
  • Kernel methods and RKHS (Reproducing Kernel Hilbert Space)

Lecture 12: Computing Eigenvalues and Singular Values

  • Power method and inverse iteration
  • QR algorithm for eigenvalues
  • Practical algorithms for SVD computation

📖 Textbook Study

Linear Algebra and Learning from Data - Chapters 4-5

Chapter 4: Eigenvalues and Eigenvectors (Applications)

  • Google PageRank
  • Markov chains and steady states
  • Graph Laplacians

Chapter 5: Computations with Large Matrices

  • Randomized SVD
  • Sketching and sampling methods
  • When to use iterative vs. direct solvers

Deep read: Don't rush. Work through computational examples.


💻 Research Application: Your Field × Linear Algebra

Choose a project connecting ML to your domain:

Physics:

  • Dimensionality reduction in high-dimensional phase spaces
  • PCA for quantum state tomography
  • Matrix product states (tensor decomposition)

Economics:

  • Factor models in finance (PCA on returns)
  • Network analysis of trade flows (spectral methods)
  • Causal inference with high-dimensional controls (Lasso)

Neuroscience:

  • Dimensionality reduction of neural activity (PCA, demixed PCA)
  • Connectivity matrices (graph theory, spectral analysis)
  • Decoding neural activity (linear classifiers, regularization)

Chemistry:

  • Molecular dynamics: PCA on conformational space
  • Quantum chemistry: Matrix diagonalization
  • Spectroscopy: Signal decomposition (NMF, ICA)

Ecology/Biology:

  • Species distribution modeling (matrix factorization)
  • Gene expression analysis (SVD, NMF)
  • Phylogenetic trees (distance matrices, spectral methods)

Deliverable: Technical write-up (5-10 pages) with:

  • Problem formulation
  • Linear algebra methods applied
  • Results and interpretation
  • Connection to existing literature in your field

Week 7-8: Deep Learning Theory

Now you're ready for neural networks with theoretical rigor.

🎓 Strang 18.065: Neural Network Theory (Lectures 26-27, 32-33)

Lecture 26: Structure of Neural Nets for Deep Learning

  • Neural network architecture fundamentals
  • Activation functions and their properties
  • Layer composition and depth

Lecture 27: Backpropagation: Find Partial Derivatives

  • Chain rule for compositions
  • Computational graph
  • Automatic differentiation: forward vs. reverse mode
  • Mathematical exercise:
    • Derive backprop equations for a 3-layer network by hand
    • Understand Jacobian matrices for each layer
    • Connect to adjoint methods (if you know optimal control)

Lecture 32: ImageNet is a Convolutional Neural Network (CNN), The Convolution Rule

  • Convolution as Toeplitz matrix multiplication
  • Translation equivariance
  • Parameter sharing reduces model complexity

Theoretical question: Why do CNNs work so well on images? (Translation invariance, local receptive fields, hierarchical features)

Lecture 33: Neural Nets and the Learning Function

  • Universal approximation theorem
  • Depth vs. width: Expressivity results
  • Sample complexity: How much data do you need?

📖 Advanced Reading: Deep Learning Theory

Textbook: "Deep Learning" by Goodfellow, Bengio, Courville

Read:

  • Chapter 6: Deep Feedforward Networks
  • Chapter 8: Optimization for Training Deep Models
  • Chapter 7: Regularization for Deep Learning
  • Chapter 15: Representation Learning (if interested in theory)

Approach: Mathematical rigor. Work through derivations.


📄 Paper Reading

Required Papers (foundational deep learning theory):

  1. "Understanding Deep Learning Requires Rethinking Generalization" (Zhang et al., 2017)
    • Deep networks can memorize random labels
    • Challenges traditional statistical learning theory
  2. "The Loss Surfaces of Multilayer Networks" (Choromanska et al., 2015)
    • Why local minima aren't a problem in practice
    • Random matrix theory applied to neural networks
  3. "Visualizing the Loss Landscape of Neural Nets" (Li et al., 2018)
    • Empirical analysis of loss surface geometry

For each paper:

  • Read carefully (2-3 hours per paper)
  • Summarize main theoretical contributions
  • Implement key experiments if feasible
  • Critique: What assumptions are made? Are they justified?

Week 9-10: Specialization and Research Project

Choose Your Research Direction

Option A: Advanced Optimization Theory

Focus: Second-order methods, natural gradients, Fisher information

Additional Lectures:

  • Review 18.065 Lecture 21: Minimizing a Function Step by Step. In particular Newton's method using second-order derivatives (Hessian matrix)
  • The trade-off between Newton's method (faster convergence but higher computational cost) and gradient descent

Project:

  • Implement natural gradient descent
  • Compare with SGD and Adam on neural network training
  • Analyze Fisher Information Matrix structure
  • Connection: Information geometry, Riemannian optimization

Deliverable: Research-quality technical report


Option B: Probabilistic Methods and Bayesian Deep Learning

Focus: Uncertainty quantification, variational inference

Reading:

  • "Weight Uncertainty in Neural Networks" (Blundell et al., 2015)
  • "Dropout as a Bayesian Approximation" (Gal & Ghahramani, 2016)

Project:

  • Implement Bayesian neural network
  • Compare point estimates vs. posterior distributions
  • Analyze predictive uncertainty
  • Application: Active learning, OOD detection

Your advantage: Strong probability/statistics background


Option C: Graph Neural Networks and Spectral Methods

Focus: Graph Laplacians, spectral graph theory

Watch:

Reading:

  • "Spectral Networks and Deep Locally Connected Networks on Graphs" (Bruna et al., 2014)
  • "Semi-Supervised Classification with Graph Convolutional Networks" (Kipf & Welling, 2017)

Project:

  • Implement graph convolutional network
  • Apply to citation network (Cora) or molecular data
  • Analyze: How does spectral filtering work?

Connection:

  • Physics: Many-body systems, quantum chemistry
  • Economics: Network effects, social networks
  • Neuroscience: Brain connectivity

Option D: Matrix Completion and Low-Rank Methods

Focus: Theoretical guarantees, nuclear norm minimization

Reading:

  • "Exact Matrix Completion via Convex Optimization" (Candès & Recht, 2009)
  • "A Singular Value Thresholding Algorithm for Matrix Completion" (Cai et al., 2010)

Project:

  • Implement matrix completion algorithms
  • Analyze incoherence conditions
  • Application: Recommender systems, missing data imputation

Theory:

  • When is exact recovery possible?
  • Sample complexity bounds
  • Connection to compressed sensing

💻 Capstone: Publish-Quality Research

Goal: Produce work suitable for submission to workshop or conference

Structure:

  1. Introduction: Motivation, problem formulation
  2. Background: Related work, theoretical foundations
  3. Methods: Your approach, algorithms, theoretical contributions
  4. Experiments: Rigorous empirical validation
  5. Results: Analysis, discussion, limitations
  6. Conclusion: Contributions, future work

Length: 6-8 pages (conference short paper format)

Formatting: Use LaTeX, follow NeurIPS/ICML/ICLR template

Peer Review: Share with colleagues or post on arXiv for feedback


📚 Complete Resource Compilation

Primary Courses

MIT 18.06 - Linear Algebra:

MIT 18.065 - Matrix Methods in ML:

Textbooks

Gilbert Strang:

Optimization:

Deep Learning:

Supplementary

3Blue1Brown:


🎯 Research Career Strategies

Publishing Path:

  1. Workshops: NeurIPS, ICML, ICLR workshops (less competitive, good for first papers)
  2. Short papers: 4-page format at some conferences
  3. Main conferences: NeurIPS, ICML, ICLR, AAAI (highly competitive)
  4. Journals: JMLR, Machine Learning, Pattern Recognition

Skill Development:

  • Code release: All papers should have public code (GitHub)
  • Reproducibility: Document experiments meticulously
  • Writing: Study well-written papers, practice technical writing
  • Presentation: Learn to give clear research talks

Networking:

  • arXiv: Post preprints, get early feedback
  • Twitter/X: ML community is active (follow researchers in your area)
  • Conferences: Attend virtually or in-person, network
  • Collaborations: Find collaborators with complementary skills

Transition Strategies:

From Academia to ML Research:

  • Postdoc in ML lab (if continuing in academia)
  • Research scientist at DeepMind, OpenAI, FAIR, Google Brain, Anthropic
  • Applied scientist at companies with research arms (Microsoft Research, Amazon Science)

From Industry to ML:

  • ML research engineer roles
  • Publish 1-2 strong papers first
  • Demonstrate both theory and implementation skills

Staying in Your Field but Adding ML:

  • Become the "ML person" in your department
  • Collaborate with CS/ML researchers
  • Apply ML to domain-specific problems (huge opportunity!)

💡 Advanced Study Tips

Proof Techniques:

  • Always attempt proofs before reading solutions
  • Understand proof strategies (contradiction, induction, construction)
  • Build intuition first, then formalize

Paper Reading:

  • Budget 3-4 hours per paper for deep reading
  • Take detailed notes
  • Implement key algorithms
  • Critique assumptions and limitations

Mathematical Maturity:

  • You have this! Use it.
  • Connect new concepts to things you already know
  • Don't accept hand-waving—demand rigor

Balancing Theory and Practice:

  • Theory without implementation is incomplete
  • Implementation without theory is fragile
  • Do both, always

Time Investment (6-8 hours/week):

  • 2 hours: Lectures (can go faster, 1.5-2x speed)
  • 2 hours: Reading papers/textbooks
  • 3 hours: Implementation and research
  • 1 hour: Writing and documentation

Your quantitative background is a massive advantage. Many ML practitioners lack deep mathematical training—you can bring rigor and theoretical insight that's sorely needed. Go publish! 🚀