Neural Network Fundamentals and Regularization Techniques

Posted by Anonymous and classified in Mathematics

Written on in English with a size of 205.18 KB

Chapter 1 - 

Linearity: As long as the values that vary (assuming any other values R
constant) R not themselves involved in anything more than
addition&
scalar multiplication


Chapter 2 -

Shallow network

y = f[x, ϕ]  
y = ϕ₀ + ϕ₁ a[θ₁₀ + θ₁₁x] + ϕ₂ a[θ₂₀ + θ₂₁x] + ϕ₃ a[θ₃₀ + θ₃₁x]

y = ϕ₀ + ϕ₁h₁ + ϕ₂h₂ + ϕ₃h₃

QwgRhv4EiRW0RvTacrkAAAAASUVORK5CYII=

Universal Approximation theorem - with enough hidden units, a shallow
neural network can describe any continuous function on a
compact subset ofto arbitrary precision 

Terminology - The hidden units themselves are sometimes referred to as neurons. Rhe values of the inputs to the hidden layer (i.E., before the ReLU functions are applied) are termed pre-activations. The values at the hidden layer (i.E., after the ReLU functions) are termed activations. For historical reasons, any neural network with at least onehidden layer is also called a multi-layer perceptron, or MLP for short. Networks with one hidden layer (as described in this chapter) are sometimes referred to as shallow neural networks. Networks with multiple hidden layers (as described in the next chapter) are referred to as deep neural networks. Neural networks in which the connections form an acyclic graph (i.E., a graph with no loops,as in all the examples in this chapter) are referred to as feed-forward networks. If every element in one layer connects to every element in the next (as in all the examples in this chapter), the network is fully connected. Linear regions in shallow network = D +1

Klayers =depth of network
Dkhidden units per layer =width of network

Capacity = number of hidden units Chapter 13 Measuring performance -


Noise - The data generation process includes the addition of noise. Is inherent uncertainty in the true mapping from input to output. Noise cannot be reduced Bias - is systematic deviation from the mean of the function we are modeling due to limitations in our model, the model is not flexible enough to fit the true function perfectly.  we can reduce this error by making the model more flexible. This is usually done by increasing the model capacity Variance - it is the uncertainty in fitted model due to choice of training set, variance results from limited noisy training data. Reduce the variance by increasing the quantity of training data

E_D [ E_y [ L[x] ]] = E_D [ (f[x, ϕ[𝒟]] − f_μ[x])² ] + (f_μ[x] − μ[x])² + σ²

// Components:
- E_D [ E_y [ L[x] ]]           → Expectation over noise in training and test data
- E_D [ (f[x, ϕ[𝒟]] − f_μ[x])² ] → Variance (Actual model vs best model)
- (f_μ[x] − μ[x])²              → Bias (Best possible model vs true function)
- σ²                            → Irreducible noise

Inductive Bias: the tendency of a model to choose one solution over another as it extrapolates between data points

interpolation threshold - model passes through each and every point in the training set, the model perfectly memorizes, or interpolates, the training data

Double descent - the testing error decreases after maximizing at the interpolation threshold. we instead observe strong test performance from very overfit, complex models. 

Double descent reasoning- Because with enough capacity, the model doesn’t just forcefully memorize the training points — it starts connecting those points in a smooth and sensible way. Since we don’t know what the data looks like between training points, assuming the pattern between them is smooth is a smart guess. This smooth guesswork tends to work well when predicting new, unseen data too. 

Curse of Dimensionality - So even if you have a decent amount of data, it gets very spread out in that large space. As a result, it becomes hard for the model to learn patterns, because it doesn’t see enough examples in any part of the space.

This problem — where adding more dimensions makes learning harder because the data becomes too sparse — is called the curse of dimensionality.

Choosing hyperparameters
In the classical regime: Don’t know bias (need to know true function) or variance
(need multiple independently sampled datasets to estimate)
In the modern regime: Don’t know how much capacity to add
How do we choose capacity in practice?
a. Model structure (number of hidden layers, number of units per layer)
b. Training algorithm
c. Learning rate
Find the best hyperparameters – chosen empirically (experimentally)
a. Hyperparameter search
b. Neural architecture search (when focused on network structure)
Procedure – use third data set: validation set
Train models with different hyperparameters on training set
Evaluate each hyperparameter setting (after training) with validation set ; select the best performing
hyperparameters

Chapter 14 Regularization -

1] Explicit Regularization reduces overfitting of model by adding a penalty term to the loss function AKA weight decay in NN

Regularization is equivalent to adding a prior over parameters

ϕ̂ = argmin_ϕ [ ∑ᵢ=1ᴵ ℓᵢ[xᵢ, yᵢ] + λ · g[ϕ] ]

- λ is the Lagrange multiplier ,λ > 0 controls the strength of the regularization
- g[ϕ] is the constraint function

L2/Ridge reg - discourages overfitting and encourages smoothness

L1 encourages weights to be 0 (i.E., sparse)
L2 encourages weights to be small
Empirically:
L1 is useful for feature selection / model inspection
L2 makes slightly more accurate predictions

L1:  L̃[ϕ] = L[ϕ] + λ ∑ᵢ |ϕᵢ|

L2:  L̃[ϕ] = L[ϕ] + λ ∑ᵢ ϕᵢ²

2] Implicit regularization - Gradient descent and stochastic gradient descent have a preference to choose some solution over the other. This preference is called implicit regularization.

Gradient descent disfavors areas where gradients are steep
SGD likes all batches to have similar gradients

Depends on learning rate – perhaps why larger learning rates generalize better

3] Early stopping refers to stopping the training procedure before it has fully converged.

4] Ensembling - Use multiple models to predict data, take average or median of all results of the models for the final output

Train ensemble with - Different initializations
Different models
Different subsets of the data resampled with replacement: bagging (Bagging)

5] Dropout - clamps a random subset (typically 50%) of hidden units to zero at each iteration of SGD. This makes the network less dependent on any given hidden unit and encourages the weights to have smaller magnitudes so that the change in the function due to the presence or absence of any specific hidden unit is reduced.

Inference after dropout - there are 2 methods - 

Weight Scaling inference - The network now has more hidden units than it was trained with at any given iteration, so we multiply the weights by one minus the dropout probability to compensate.

Monte Carlo dropout - we run the network multiple times with different random subsets of units clamped to zero (as in training) and combine the results.

6] Adds noise - add noise to the input data; this smooths out the learned function OR add noise to the weights. This encourages the network to make sensible predictions even for small perturbations of the weights. The result is that the training converges to local minima in the middle of wide, flat regions, where changing the individual weights does not matter much.

7] Bayesian approach - Treats the parameters as unknown variables and computes a distribution Pr(ϕ|{xi, yi}) over these parameters ϕ conditioned on the training data {xi, yi}
using Bayes’ rule:

8] Transfer learning, the network is pre-trained to perform a related secondary task for which data are more plentiful. The resulting model is then adapted to the original task. This is typically done by removing the last layer and adding one or more layers that produce a suitable output.

Multi-task learning is a related technique in which the network is trained to solve several problems concurrently.

9] Data Augmentation - expand the dataset. We can often transform each input data example in such a way that the label stays the same. 

eg. Rotate, flip, blur, or manipulate the color balance of the image.
CNN - 

mathematically, convolution is an operation that combines two functions to produce a third function, expressing how the shape of one function is modified by another. In CNN, convolution operation is used to apply filters (or kernels) to input data, such as images. The filters slide over the input and compute the convolution at each position, effectively extracting features from the input 

Invariance - A function is invariant to a transformation if the output doesn’t change, even when the input is transformed.  if f[t[x]] = f[x]: eg- image classification, rotate or shift pic ofd a cat, it's still a cat
Equivariance - A function isequivariant if the output changes in the same way as the input is transformed. : f[t[x]] = t[f[x]],eg- classification of each pixel, such as image segmentation; or regression of each pixel, such as depth map. If you shift image to right, output also shifts right.
Zero Padding - Treat positions that are beyond end of the input as zero.

bedrn3bgvuAAAAAElFTkSuQmCC

Valid convolution - compute outputs where the kernel fits within the input range. Now, the output will be smaller than the input.

Related entries: