Neural Networks

Supervised learning (provided expected outputs)

Regression: output is a numerical estimate
Classification: output is a label/category

Unsupervised learning

Clustering

Data Pre-Processing

Quality

Missing Values
- Remove: Delete the entire row (if few) or column (if many missing).
- Impute: Fill with an “informed guess”, like the mean, median (best for outliers), or mode (best for categorical data).
Outlier Handling
- Detection: Find using statistical rules (e.g., outside 3 standard deviations) or visualization (e.g., box plots).
- Removal: Delete the outlier rows. (Use with caution, as it can be valid data).
- Clipping/Capping: Set a cap/floor. (e.g., set all values > 99th percentile to the 99th percentile value).
- Log Transform: Use for skewed data to pull in high values.

Encoding

Categorical Data (text labels)
- One-Hot Encoding
  - Converts k categories into k binary (0/1) columns.
  - Use for nominal data (e.g., “red”, “green”, “blue”).
- Label Encoding
  - Converts k categories into a single integer column (1, 2, 3…).
  - Use for ordinal data (e.g., “low”, “medium”, “high”).
Numerical Data
- Standardization (Z-score)
  - Rescales to mean=0, std=1
- Normalization (min-max scaling)
  - Rescales to a fixed range [0, 1]: using (x - \text{min}) / (\text{max} - \text{min})
  - Sensitive to outliers.
Cyclical Data (hours of day, months of year)
- Sine, Cosine encoding: each cycling input is encoded into both a sine and cosine component
  - Numerically shows that hour 23 and hour 0 are close together
Image Data
- Standardize: same dimensions and format
- Grey-scale: an image is a single matrix h \times w of pixels 0 to 1
- Color: an image is a tensor 3 \times h \times w for the 3 color channels
Text
- Simplify: e.g., lowercase, remove some punctuation
- Tokenization: converts text into numbered subcomponents
  - Character-level: smaller vocabulary required, but encodes less information
  - Word-level: requires large vocabulary
  - Sub-word: requires smaller vocabulary than word-level, but encodes more information
- One-hot encoding: tokens can be one-hot encoded, but this is inefficient
- Token embedding: each token is represented using an array of real numbers
- (Rotary) Positional embedding: incorporates the position of each pattern in the input by rotating each token’s embedding vector by a small angle

Split: randomly shuffle data first

Training Set (~70%): Data the model learns from.
Validation Set (~15%): Unseen data used during training to tune hyperparameters and check overfitting.
Test Set (~15%): Unseen data used only at the end to measure final model’s performance.

Architecture/Topology

Common Network Types

Feed-forward (MLP): General purpose for classification/regression.
Convolutional (CNN): Used for image recognition, video, and spatial data.
- AlexNet, VGG, ResNet
Recurrent (RNN): Used for sequential data (time series, text).
- has feedback connections (providing context values) from some neurons to neurons in a previous layer of the network
- trained using gradient descent but requires backpropagation through time
- Problem: Vanishing/exploding gradients.
- LSTM (long short-term memory): uses output, input, forget, and cell “gates”; solves the vanishing gradient problem and remember long-term dependencies.
- GRU (gated recurrent unit): uses reset, update and new “gates”; solves the vanishing gradient problem and remember long-term dependencies.
- Bi-Directional RNN
Self-Organizing Feature Maps (SOM): Unsupervised, used for clustering
Auto-encoders (AE): Unsupervised, used for dimensionality reduction and feature learning.
- trained to reproduce the inputs of the NN as the outputs
- inputs go through an encoder to a small latent vector which then goes through the decoder to the final outputs
- suffers from overfitting
- Variational AEs (VAE)
  - VAE encoder produces a Normal mean and variance value for each element of the latent vector
  - discourages overfitting
  - can be used to generate outputs given latent vector inputs
Generative Adversarial (GANs): Used for generating new data (e.g., images).
Transformer: Used for natural language processing.
- Self-Attention Mechanism, considers all patterns in a sequence
  - uses query, key and value tensors with attention calculated as A = f_\text{softmax}(QK^T)V -the QK^T tensor contains the similarities of each pattern in the key tensor to each pattern in the query tensor
- Multi-headed attention: allows applying different attention to different parts of the input
  - the output of the attention mechanism is the concatenation of the various attention heads A_i
- Attention Masking: used to hide future inputs when calculating outputs for each input
- e.g., BERT, GPT

Common Layers

Input Layer: One neuron per input feature.
Hidden (Dense): Fully-connected layer; performs most of the learning.
- Usually number of neurons should be <= 2 * (# input neurons) and > (# output neurons)
- Convolutional (Conv2D): Applies filters to learn local patterns (e.g., edges in an image).
- Pooling (Max/Avg): Downsamples feature maps to reduce computational load.
- Dropout: Can be implemented as a hidden layer.
- Batch Normalization: Normalizes activations within a batch to stabilize and speed up training.
Output Layer: One neuron per output (e.g., 1 for regression, 10 for 10-class classification).

Activation Functions

ReLU (Rectified Linear Unit): \max(0, x). Default for hidden layers.
Softplus: differentiable version of ReLU k\log(1+ e^x)
Leaky ReLU: Variant of ReLU with non-zero gradient for negative inputs.
Tanh: Scales output to [-1, 1]. Usually used by RNNs.
Linear: y = x. Used for regression output (to output any value).
Sigmoid: Scales output to [0, 1]. Used for binary classification output.
Softmax: Converts outputs into a probabilities (all outputs sum to 1). Used for multi-class classification.

Training

Key Hyperparameters

Learning Rate: How big of a step the optimizer takes.
Batch Size: Number of training samples used in one optimizer update.
Number of Epochs: Number of full passes through the training dataset.
Network Structure: # of layers, # of neurons per layer.

Weight Initialization

Uniform / Normal: Random values.
- Uniform between \pm 1/\sqrt{I}
- Can cause vanishing/exploding gradients.
Xavier / Glorot: “Smart” initialization, good for Sigmoid and Tanh activations.
He: “Smart” initialization, the default choice for ReLU activations.

Loss Functions (measures “how wrong” the model is)

Mean Squared Error (MSE) / RMSE: For regression tasks.
Binary Cross-Entropy (BCE): For binary classification (with Sigmoid).
Categorical Cross-Entropy: For multi-class classification (with Softmax).

Optimizers (how the model learns)

SGD (Stochastic Gradient Descent): Basic optimizer.
SGD + Momentum: Helps accelerate SGD and smooth out oscillations.
RMSprop: Adaptive learning rate optimizer, good for RNNs.
Adam: (Adaptive Moment Estimation) Combines Momentum and RMSprop. Often the best choice.

Regularization (techniques to prevent overfitting)

L1 / L2 (Weight Decay): Adds a penalty to the loss function for large weights. L1 can lead to sparse weights (features = 0).
Dropout: Randomly turns off neurons during training to prevent co-adaptation.
Early Stopping: Stop training when the validation loss stops improving.
Data Augmentation: Creating more training data by modifying existing data (e.g., rotating, flipping, or cropping images).

Training Loop

For each epoch:
- Training Phase
  - Forward Pass: Get predictions from the model on the training data.
  - Compute Loss: Compare predictions to actual values.
  - Backward Pass (Backpropagation): Calculate gradients (error) for all weights.
  - Optimization Step: Update weights using the optimizer to reduce the loss.
- Validation Phase
  - Forward Pass: Get predictions from the model on the validation data.
  - Compute Loss: Compare predictions to actual values.
  - Check if the model is overfitting (e.g., training loss down, validation loss up).
  - Check if the early stopping condition is met.

Evaluation

Test Phase

Forward Pass: Get predictions from the final model on the test data.
Compute Loss: Compare predictions to actual values.
Calculate metrics to report the model’s performance on unseen data.

Common Metrics

For Classification:
- Accuracy: % of correct predictions.
- Precision: % of positive predictions that were correct.
- Recall: % of actual positives that were correctly identified.
- F1-Score: The harmonic mean of Precision and Recall.
- Confusion Matrix: Table showing correct vs. incorrect predictions for each class.
For Regression:
- MSE / RMSE: Average squared/root-squared error (in the unit of the output).
- MAE (Mean Absolute Error): Average absolute error (easier to interpret).
- R-squared (R²): Proportion of variance in the output that the model can predict.

Key Concepts

Gradient Descent: The core optimization algorithm. Finds the minimum of the loss function by iteratively moving in the opposite direction of the gradient.
Backpropagation: The algorithm used to efficiently calculate the gradients for all weights in the network, starting from the output layer and moving backward.
Overfitting: When the model learns the training data too well (including its noise) and fails to generalize to new data.
Underfitting: When the model is too simple and fails to capture the underlying patterns in the training data.