AI · ML · DEEP LEARNING · PHD-LEVEL FOUNDATIONS

AI / Machine Learning
/ Deep Learning —
PhD-Level Foundations

Taught by a researcher who built ML models for nuclear nonproliferation at DTRA. Theory + implementation + research methodology — from someone who uses these tools in the field.

Explore Topics Book Free Intro

PhD Nuclear Engineering, AFIT · Research at DTRA · ML for nuclear nonproliferation · HuggingFace: Dr-P

Foundation

Linear Algebra for ML

Key Concepts

Vectors, matrices, and tensor representations
Dot products, projections, and norms
Matrix multiplication and composition
Eigendecomposition and diagonalization
Singular Value Decomposition (SVD)
Principal Component Analysis (PCA)
Rank, null space, and linear independence

Python — SVD & PCA example

import numpy as np

# SVD decomposition
A = np.array([[3, 1], [2, 4], [0, 5]])
U, S, Vt = np.linalg.svd(A, full_matrices=False)
print(f"Singular values: {S}")

# Reconstruct A from SVD
A_approx = U @ np.diag(S) @ Vt

# PCA via SVD (zero-center first)
X = np.random.randn(100, 5)
X_centered = X - X.mean(axis=0)
U, S, Vt = np.linalg.svd(X_centered, full_matrices=False)
# First 2 principal components
X_pca = X_centered @ Vt[:2].T

Common Misconceptions

Matrix multiplication is not commutative: AB ≠ BA in general. Order matters for every neural network layer transformation.

Eigenvalues measure "stretch factor" in a direction — not the magnitude of the vector. An eigenvalue of zero means the matrix collapses that direction entirely.

Practice Problems

A weight matrix W is 512×256. An input batch has shape (32, 512). What is the output shape after the forward pass? Why does this order of multiplication matter?

Given a dataset matrix X (100 samples, 8 features), compute the covariance matrix by hand using SVD. Explain what the first singular value tells you about the data's variance.

Foundation

Probability & Statistics for ML

Key Concepts

Probability distributions (Gaussian, Bernoulli, Categorical)
Bayes' theorem and conditional probability
Maximum Likelihood Estimation (MLE)
Maximum A Posteriori (MAP) estimation
Expectation, variance, covariance
Central Limit Theorem and its ML implications
Hypothesis testing and p-values

Python — scipy.stats distributions

from scipy import stats
import numpy as np

# MLE: fit a normal distribution to data
data = np.random.normal(loc=5.0, scale=2.0, size=1000)
mu, sigma = stats.norm.fit(data)
print(f"MLE estimate: mu={mu:.3f}, sigma={sigma:.3f}")

# Log-likelihood (what MLE maximizes)
log_lik = np.sum(stats.norm.logpdf(data, mu, sigma))

# KL divergence between two Gaussians
p = stats.norm(0, 1)
q = stats.norm(0.5, 1.2)
x = np.linspace(-5, 5, 1000)
kl = np.trapz(p.pdf(x) * (p.logpdf(x) - q.logpdf(x)), x)
print(f"KL divergence: {kl:.4f}")

Common Misconceptions

Probability and likelihood are not the same. Probability is over outcomes given fixed parameters; likelihood is over parameters given fixed observed data.

MLE and MAP give the same answer only when the prior is uniform. MAP is MLE with regularization baked in — L2 regularization corresponds to a Gaussian prior.

Practice Problems

You flip a biased coin 20 times and observe 14 heads. Derive the MLE estimate for p (probability of heads) from scratch using calculus. Then compute the MAP estimate assuming a Beta(2,2) prior.

A classifier outputs probabilities [0.9, 0.05, 0.05] for three classes. Compute the cross-entropy loss against the true label (class 1). Explain why cross-entropy is the negative log-likelihood of a categorical distribution.

Core ML

Classical Machine Learning

Key Concepts

Linear and logistic regression
Decision trees and information gain
Random forests and bagging
Gradient boosting (XGBoost, LightGBM)
Support Vector Machines and kernels
K-means and EM clustering
Cross-validation and model selection

Python — scikit-learn pipeline

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier(
        n_estimators=100,
        max_depth=5,
        random_state=42
    ))
])

# 5-fold cross-validation
scores = cross_val_score(pipeline, X, y, cv=5, scoring='roc_auc')
print(f"AUC: {scores.mean():.3f} +/- {scores.std():.3f}")

# Feature importances
pipeline.fit(X_train, y_train)
importances = pipeline['clf'].feature_importances_

Common Misconceptions

Overfitting is not the same as high training accuracy. A model can memorize noise. Underfitting is also a failure — a model too simple to capture the true signal.

More data usually beats a better algorithm. A simple logistic regression on 1M examples often outperforms a tuned SVM on 10K examples. Data quality is the multiplier.

Practice Problems

Explain in your own words why a decision tree without depth limits perfectly memorizes training data but generalizes poorly. What does random forest do differently to fix this?

You train a logistic regression and get 98% training accuracy but 62% validation accuracy. List three concrete things you would do next and explain the reasoning for each.

Deep Learning

Deep Learning Foundations

Key Concepts

Neural network architecture and layer types
Activation functions (ReLU, sigmoid, softmax, GELU)
Loss functions and their probabilistic interpretations
Gradient descent (SGD, Adam, AdamW)
Backpropagation via the chain rule
Batch normalization and layer normalization
Dropout, L1/L2 regularization

Python — PyTorch neural network

import torch
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, in_dim, hidden, out_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_dim, hidden),
            nn.BatchNorm1d(hidden),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(hidden, out_dim)
        )
    def forward(self, x):
        return self.net(x)

model = MLP(128, 256, 10)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

# Training step
optimizer.zero_grad()
logits = model(x_batch)
loss = criterion(logits, y_batch)
loss.backward()
optimizer.step()

Common Misconceptions

Deeper is not always better. Without skip connections (residual networks) or careful initialization, very deep networks suffer from vanishing or exploding gradients.

Vanishing gradients are not just a training problem — they indicate the model architecture is poorly suited to the task. Switching to ReLU or adding batch norm often fixes this structurally.

Practice Problems

Derive the gradient of the cross-entropy loss with respect to the logits (before softmax). Show that it simplifies to (predicted probability - true label). Why does this make gradient descent intuitive?

You have a 10-layer network with sigmoid activations and find gradients near-zero in early layers. List two architectural changes you would make and explain why each helps numerically.

Deep Learning

Convolutional Neural Networks

Key Concepts

The convolution operation and discrete filters
Kernels, feature maps, and receptive fields
Pooling layers (max, average, adaptive)
Stride, padding, and output dimension formulas
Classic architectures: VGG, ResNet, EfficientNet
Transfer learning and fine-tuning strategies
1D and 3D convolutions for non-image data

Python — PyTorch CNN

import torch.nn as nn
import torchvision.models as models

# Build a simple CNN
class SimpleCNN(nn.Module):
    def __init__(self, n_classes):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d((4, 4))
        )
        self.classifier = nn.Linear(64 * 16, n_classes)

# Transfer learning: freeze backbone, train head
backbone = models.resnet18(pretrained=True)
for param in backbone.parameters():
    param.requires_grad = False
backbone.fc = nn.Linear(512, n_classes)  # new head

Common Misconceptions

CNNs are not just for images. 1D convolutions are standard in time-series and NLP; 3D convolutions are used in video understanding and volumetric medical imaging.

Stride and padding are not just aesthetic choices. Output size = floor((W - K + 2P) / S) + 1. Getting this wrong causes silent shape mismatches in deep networks.

Practice Problems

An input image is 224×224×3. You apply a Conv2d(3, 64, kernel_size=7, stride=2, padding=3). What is the output shape? How many learnable parameters does this layer have?

You are fine-tuning ResNet-18 on a 5-class medical imaging dataset with only 800 training samples. Describe your strategy: which layers do you freeze, which do you train, and what augmentations do you apply? Justify each choice.

Modern AI

Transformers & LLMs

Key Concepts

Self-attention mechanism and scaled dot-product attention
Multi-head attention and why it works
Positional encoding (sinusoidal and rotary)
Encoder-only, decoder-only, encoder-decoder architectures
Pre-training objectives (MLM, CLM, T5-style)
Fine-tuning vs. parameter-efficient methods (LoRA, adapters)
Prompt engineering and in-context learning

Python — HuggingFace transformers

from transformers import AutoTokenizer, AutoModel
import torch

# Load a pretrained encoder
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

text = "Transformers changed NLP forever."
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

# [CLS] token embedding as sentence representation
cls_embedding = outputs.last_hidden_state[:, 0, :]
print(f"Embedding shape: {cls_embedding.shape}")

# Attention weights (requires output_attentions=True)
outputs_attn = model(**inputs, output_attentions=True)
attn = outputs_attn.attentions  # tuple of (batch, heads, seq, seq)

Common Misconceptions

Attention is not literally "looking at" words the way humans read. It is a weighted average of value vectors — a differentiable, soft dictionary lookup. The analogy is useful but breaks down under scrutiny.

Tokens are not words. "Unforgettable" may be 3-4 tokens depending on the tokenizer. Token counts determine cost, context length, and model behavior — always check your tokenization.

Practice Problems

Walk through the scaled dot-product attention formula: Attention(Q, K, V) = softmax(QK³ / √d_k) V. Explain what each matrix represents, why we scale by √d_k, and what happens if we skip the scaling.

You want to fine-tune GPT-2 for a domain-specific classification task with a small labeled dataset. Compare full fine-tuning vs. LoRA. What are the trade-offs in terms of parameters updated, memory, and generalization risk?

Research Methods

ML Research Methods

Key Concepts

Experimental design and controlling confounds
Ablation studies: isolating what actually matters
Standard benchmark datasets and their limitations
Reproducibility: seeds, environment pinning, logging
Reading ML papers effectively (title, abstract, results first)
Statistical significance in model comparisons
Compute budgets and scaling laws

Common Research Mistakes

HARKing (Hypothesizing After Results are Known): reporting post-hoc discoveries as pre-planned hypotheses. This inflates false positives and breaks the scientific method.

Not setting random seeds. A result that only holds for one seed is not a result. Always set torch.manual_seed(42), numpy.random.seed(42), and document your environment.

Evaluating on the test set more than once. Each peek leaks information — the test set is for final evaluation only. Use a held-out validation set during development.

Dr. Preston on HuggingFace

Deployed models and Spaces from real research work — including the AFOQT adaptive intelligence app built with ML.

Visit huggingface.co/Dr-P →

Practice Problems

You improve a baseline model's accuracy from 84.2% to 85.1% on a test set. What additional information do you need before concluding this is a meaningful improvement? Sketch the statistical test you would run.

Design an ablation study for a transformer model that uses both multi-head attention and a novel positional encoding scheme. What variants would you train, and what would each ablation tell you about which component drives performance?

Field-Tested Texts

Join the AI/ML Study Group

Bring your implementation questions, paper discussions, and debugging headaches. The Discord community includes students working through these exact topics — plus office hours with Dr. Preston.

Join the Discord Community →

Free. No spam. No upsells.

1:1 Tutoring

Want personalized guidance?

Sessions are structured around your specific goals — whether that's passing a qual exam, understanding backprop from scratch, or getting a research project off the ground. Theory and implementation, not just intuition.

Book a Free Intro Session →

AI / Machine Learning/ Deep Learning —PhD-Level Foundations

Linear Algebra for ML

Key Concepts

Common Misconceptions

Practice Problems

Probability & Statistics for ML

Key Concepts

Common Misconceptions

Practice Problems

Classical Machine Learning

Key Concepts

Common Misconceptions

Practice Problems

Deep Learning Foundations

Key Concepts

Common Misconceptions

Practice Problems

Convolutional Neural Networks

Key Concepts

Common Misconceptions

Practice Problems

Transformers & LLMs

Key Concepts

Common Misconceptions

Practice Problems

ML Research Methods

Key Concepts

Common Research Mistakes

Practice Problems

Recommended Textbooks

Join the AI/ML Study Group

Want personalized guidance?

ML Explained by a Practitioner

AI / Machine Learning
/ Deep Learning —
PhD-Level Foundations