AI · ML · DEEP LEARNING · PHD-LEVEL FOUNDATIONS

AI / Machine Learning
/ Deep Learning —
PhD-Level Foundations

Taught by a researcher who built ML models for nuclear nonproliferation at DTRA. Theory + implementation + research methodology — from someone who uses these tools in the field.

PhD Nuclear Engineering, AFIT  ·  Research at DTRA  ·  ML for nuclear nonproliferation  ·  HuggingFace: Dr-P

01
Foundation

Linear Algebra for ML

Key Concepts

  • Vectors, matrices, and tensor representations
  • Dot products, projections, and norms
  • Matrix multiplication and composition
  • Eigendecomposition and diagonalization
  • Singular Value Decomposition (SVD)
  • Principal Component Analysis (PCA)
  • Rank, null space, and linear independence
Python — SVD & PCA example
import numpy as np

# SVD decomposition
A = np.array([[3, 1], [2, 4], [0, 5]])
U, S, Vt = np.linalg.svd(A, full_matrices=False)
print(f"Singular values: {S}")

# Reconstruct A from SVD
A_approx = U @ np.diag(S) @ Vt

# PCA via SVD (zero-center first)
X = np.random.randn(100, 5)
X_centered = X - X.mean(axis=0)
U, S, Vt = np.linalg.svd(X_centered, full_matrices=False)
# First 2 principal components
X_pca = X_centered @ Vt[:2].T

Common Misconceptions

Matrix multiplication is not commutative: AB ≠ BA in general. Order matters for every neural network layer transformation.
Eigenvalues measure "stretch factor" in a direction — not the magnitude of the vector. An eigenvalue of zero means the matrix collapses that direction entirely.

Practice Problems

A weight matrix W is 512×256. An input batch has shape (32, 512). What is the output shape after the forward pass? Why does this order of multiplication matter?

Given a dataset matrix X (100 samples, 8 features), compute the covariance matrix by hand using SVD. Explain what the first singular value tells you about the data's variance.

02
Foundation

Probability & Statistics for ML

Key Concepts

  • Probability distributions (Gaussian, Bernoulli, Categorical)
  • Bayes' theorem and conditional probability
  • Maximum Likelihood Estimation (MLE)
  • Maximum A Posteriori (MAP) estimation
  • Expectation, variance, covariance
  • Central Limit Theorem and its ML implications
  • Hypothesis testing and p-values
Python — scipy.stats distributions
from scipy import stats
import numpy as np

# MLE: fit a normal distribution to data
data = np.random.normal(loc=5.0, scale=2.0, size=1000)
mu, sigma = stats.norm.fit(data)
print(f"MLE estimate: mu={mu:.3f}, sigma={sigma:.3f}")

# Log-likelihood (what MLE maximizes)
log_lik = np.sum(stats.norm.logpdf(data, mu, sigma))

# KL divergence between two Gaussians
p = stats.norm(0, 1)
q = stats.norm(0.5, 1.2)
x = np.linspace(-5, 5, 1000)
kl = np.trapz(p.pdf(x) * (p.logpdf(x) - q.logpdf(x)), x)
print(f"KL divergence: {kl:.4f}")

Common Misconceptions

Probability and likelihood are not the same. Probability is over outcomes given fixed parameters; likelihood is over parameters given fixed observed data.
MLE and MAP give the same answer only when the prior is uniform. MAP is MLE with regularization baked in — L2 regularization corresponds to a Gaussian prior.

Practice Problems

You flip a biased coin 20 times and observe 14 heads. Derive the MLE estimate for p (probability of heads) from scratch using calculus. Then compute the MAP estimate assuming a Beta(2,2) prior.

A classifier outputs probabilities [0.9, 0.05, 0.05] for three classes. Compute the cross-entropy loss against the true label (class 1). Explain why cross-entropy is the negative log-likelihood of a categorical distribution.

03
Core ML

Classical Machine Learning

Key Concepts

  • Linear and logistic regression
  • Decision trees and information gain
  • Random forests and bagging
  • Gradient boosting (XGBoost, LightGBM)
  • Support Vector Machines and kernels
  • K-means and EM clustering
  • Cross-validation and model selection
Python — scikit-learn pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier(
        n_estimators=100,
        max_depth=5,
        random_state=42
    ))
])

# 5-fold cross-validation
scores = cross_val_score(pipeline, X, y, cv=5, scoring='roc_auc')
print(f"AUC: {scores.mean():.3f} +/- {scores.std():.3f}")

# Feature importances
pipeline.fit(X_train, y_train)
importances = pipeline['clf'].feature_importances_

Common Misconceptions

Overfitting is not the same as high training accuracy. A model can memorize noise. Underfitting is also a failure — a model too simple to capture the true signal.
More data usually beats a better algorithm. A simple logistic regression on 1M examples often outperforms a tuned SVM on 10K examples. Data quality is the multiplier.

Practice Problems

Explain in your own words why a decision tree without depth limits perfectly memorizes training data but generalizes poorly. What does random forest do differently to fix this?

You train a logistic regression and get 98% training accuracy but 62% validation accuracy. List three concrete things you would do next and explain the reasoning for each.

04
Deep Learning

Deep Learning Foundations

Key Concepts

  • Neural network architecture and layer types
  • Activation functions (ReLU, sigmoid, softmax, GELU)
  • Loss functions and their probabilistic interpretations
  • Gradient descent (SGD, Adam, AdamW)
  • Backpropagation via the chain rule
  • Batch normalization and layer normalization
  • Dropout, L1/L2 regularization
Python — PyTorch neural network
import torch
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, in_dim, hidden, out_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_dim, hidden),
            nn.BatchNorm1d(hidden),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(hidden, out_dim)
        )
    def forward(self, x):
        return self.net(x)

model = MLP(128, 256, 10)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

# Training step
optimizer.zero_grad()
logits = model(x_batch)
loss = criterion(logits, y_batch)
loss.backward()
optimizer.step()

Common Misconceptions

Deeper is not always better. Without skip connections (residual networks) or careful initialization, very deep networks suffer from vanishing or exploding gradients.
Vanishing gradients are not just a training problem — they indicate the model architecture is poorly suited to the task. Switching to ReLU or adding batch norm often fixes this structurally.

Practice Problems

Derive the gradient of the cross-entropy loss with respect to the logits (before softmax). Show that it simplifies to (predicted probability - true label). Why does this make gradient descent intuitive?

You have a 10-layer network with sigmoid activations and find gradients near-zero in early layers. List two architectural changes you would make and explain why each helps numerically.

05
Deep Learning

Convolutional Neural Networks

Key Concepts

  • The convolution operation and discrete filters
  • Kernels, feature maps, and receptive fields
  • Pooling layers (max, average, adaptive)
  • Stride, padding, and output dimension formulas
  • Classic architectures: VGG, ResNet, EfficientNet
  • Transfer learning and fine-tuning strategies
  • 1D and 3D convolutions for non-image data
Python — PyTorch CNN
import torch.nn as nn
import torchvision.models as models

# Build a simple CNN
class SimpleCNN(nn.Module):
    def __init__(self, n_classes):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d((4, 4))
        )
        self.classifier = nn.Linear(64 * 16, n_classes)

# Transfer learning: freeze backbone, train head
backbone = models.resnet18(pretrained=True)
for param in backbone.parameters():
    param.requires_grad = False
backbone.fc = nn.Linear(512, n_classes)  # new head

Common Misconceptions

CNNs are not just for images. 1D convolutions are standard in time-series and NLP; 3D convolutions are used in video understanding and volumetric medical imaging.
Stride and padding are not just aesthetic choices. Output size = floor((W - K + 2P) / S) + 1. Getting this wrong causes silent shape mismatches in deep networks.

Practice Problems

An input image is 224×224×3. You apply a Conv2d(3, 64, kernel_size=7, stride=2, padding=3). What is the output shape? How many learnable parameters does this layer have?

You are fine-tuning ResNet-18 on a 5-class medical imaging dataset with only 800 training samples. Describe your strategy: which layers do you freeze, which do you train, and what augmentations do you apply? Justify each choice.

06
Modern AI

Transformers & LLMs

Key Concepts

  • Self-attention mechanism and scaled dot-product attention
  • Multi-head attention and why it works
  • Positional encoding (sinusoidal and rotary)
  • Encoder-only, decoder-only, encoder-decoder architectures
  • Pre-training objectives (MLM, CLM, T5-style)
  • Fine-tuning vs. parameter-efficient methods (LoRA, adapters)
  • Prompt engineering and in-context learning
Python — HuggingFace transformers
from transformers import AutoTokenizer, AutoModel
import torch

# Load a pretrained encoder
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

text = "Transformers changed NLP forever."
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

# [CLS] token embedding as sentence representation
cls_embedding = outputs.last_hidden_state[:, 0, :]
print(f"Embedding shape: {cls_embedding.shape}")

# Attention weights (requires output_attentions=True)
outputs_attn = model(**inputs, output_attentions=True)
attn = outputs_attn.attentions  # tuple of (batch, heads, seq, seq)

Common Misconceptions

Attention is not literally "looking at" words the way humans read. It is a weighted average of value vectors — a differentiable, soft dictionary lookup. The analogy is useful but breaks down under scrutiny.
Tokens are not words. "Unforgettable" may be 3-4 tokens depending on the tokenizer. Token counts determine cost, context length, and model behavior — always check your tokenization.

Practice Problems

Walk through the scaled dot-product attention formula: Attention(Q, K, V) = softmax(QK³ / √d_k) V. Explain what each matrix represents, why we scale by √d_k, and what happens if we skip the scaling.

You want to fine-tune GPT-2 for a domain-specific classification task with a small labeled dataset. Compare full fine-tuning vs. LoRA. What are the trade-offs in terms of parameters updated, memory, and generalization risk?

07
Research Methods

ML Research Methods

Key Concepts

  • Experimental design and controlling confounds
  • Ablation studies: isolating what actually matters
  • Standard benchmark datasets and their limitations
  • Reproducibility: seeds, environment pinning, logging
  • Reading ML papers effectively (title, abstract, results first)
  • Statistical significance in model comparisons
  • Compute budgets and scaling laws

Common Research Mistakes

HARKing (Hypothesizing After Results are Known): reporting post-hoc discoveries as pre-planned hypotheses. This inflates false positives and breaks the scientific method.
Not setting random seeds. A result that only holds for one seed is not a result. Always set torch.manual_seed(42), numpy.random.seed(42), and document your environment.
Evaluating on the test set more than once. Each peek leaks information — the test set is for final evaluation only. Use a held-out validation set during development.

Dr. Preston on HuggingFace

Deployed models and Spaces from real research work — including the AFOQT adaptive intelligence app built with ML.

Visit huggingface.co/Dr-P →

Practice Problems

You improve a baseline model's accuracy from 84.2% to 85.1% on a test set. What additional information do you need before concluding this is a meaningful improvement? Sketch the statistical test you would run.

Design an ablation study for a transformer model that uses both multi-head attention and a novel positional encoding scheme. What variants would you train, and what would each ablation tell you about which component drives performance?

Field-Tested Texts

Recommended Textbooks

These are the books Dr. Preston has personally worked through. Every recommendation is field-tested in real research and tutoring sessions.

Affiliate disclosure: links use tag fissionlab-20. You pay nothing extra; commissions help keep this resource free.

📚
Deep Learning
Goodfellow, Bengio & Courville

The definitive graduate-level reference. Covers probability theory, optimization, regularization, CNNs, RNNs, and generative models. Dense but indispensable.

View on Amazon →
🔬
Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow
Aurélien Géron

The best practical introduction. Implementation-first approach — every concept is anchored in code. Excellent coverage of end-to-end ML projects.

View on Amazon →
📊
Pattern Recognition and Machine Learning
Christopher Bishop

The probabilistic ML bible. Bayesian methods, graphical models, kernel machines, and the mathematical foundations that most practitioners skip. Read this for depth.

View on Amazon →
The Hundred-Page Machine Learning Book
Andriy Burkov

Surprisingly complete for its length. Great for building intuition fast, reviewing before interviews, or getting a second perspective on algorithms you've already studied.

View on Amazon →

Join the AI/ML Study Group

Bring your implementation questions, paper discussions, and debugging headaches. The Discord community includes students working through these exact topics — plus office hours with Dr. Preston.

Join the Discord Community →

Free. No spam. No upsells.

1:1 Tutoring

Want personalized guidance?

Sessions are structured around your specific goals — whether that's passing a qual exam, understanding backprop from scratch, or getting a research project off the ground. Theory and implementation, not just intuition.

Book a Free Intro Session →

Free Weekly Newsletter

ML Explained by a Practitioner

Implementation tips, paper breakdowns, and research insights from Dr. Preston — every week, free.

No spam. Unsubscribe anytime.