Last modified: 2019-03-09 14:19

Jump to: Regression Classification Optimization Probability Neural Nets

Exam Format

You'll have the full class period (75 min) to complete. Target required time will be about 60 min.

It will be a pencil & paper exam. We'll provide the instructions and answer sheets.

You should BRING A PENCIL (better than a pen, you might need to erase).

You can BRING 1 sheet of notes on standard 8.5" x 11" paper. Typed or hand-written. Writing can be on both sides.

Keep in mind we will also provide a list of standard formulas on the instruction sheet, as in the practice exam PDF below.

Possible question types:

True/false questions
Multiple choice questions
Short answer concept questions
Questions that ask you to read / understand / debug Python code
Questions that ask you to read / understand / debug some math
Questions that ask you to produce Python code (at most 5-6 lines)
Questions that ask you to produce some math (at most 1-2 lines)
- You'll have access to all needed formulas as part of the instructions
- Formulas we'll provide: log loss, sigmoid function, needed basic derivatives, sum rule, product rule, etc. (see example in practice exam)

Practice Exam

We've prepared an example practice exam so you can see the kinds of questions that might appear on the real exam.

You can access the practice exam PDF here (requires Piazza credentials): midterm_practice_exam.pdf

Solution PDF: midterm_practice_solutions.pdf

What will be on the exam?

The exam covers everything from our in-class activities and out-of-class readings, starting from our first class and continuing up thru and including class on 2/27 ('Naive Bayes').

Below, we've listed the major concepts/skills that could appear on the exam.

Unit: Regression

Core concepts for regression

3 stages of ML: Training, Prediction, Evaluation
Setting up a regression problem
- Purpose of train/valid/test division of a dataset
- Underfitting and overfitting
- Selecting hyperparameters on a fixed validation set
- Selecting hyperparameters via K-fold Cross validation
Understanding evaluation metrics
- - Mean squared error
- - Mean absolute error
3 possible regression methods
- K-Nearest Neighbors regression
- Decision Tree regression
- Linear regression

Core skills for regression

When to do cross validation? When to use fixed validation set?
When evaluating a regressor, compare and contrast squared error vs absolute error.
When training, compare and contrast using squared error vs absolute error.
What kinds of regression functions can each method (k-NN, tree, linear models) learn?
What parameters can I tune for each method to avoid under/over-fitting?
What is computational complexity of training each method?
What is computational complexity of making predictions with each method?
How sensitive is each method to noisy or irrelevant features?
How sensitive is each method to the numerical scaling of different features?
How sensitive is each method to having enough training data?

Unit: Linear Regression

Core concepts for linear regression

Training algorithm 1: Exact formulas to estimate weights to minimize squared error
Training algorithm 2: Gradient descent
Linear regression applied to transformed features (e.g. polynomial)
Regularization via L1 and L2 penalties on weights (lasso vs ridge regression)
- 'L1' : sum of absolute values
- 'L2' : sum of squared values

Core skills for linear regression

Compute the sum-of-squares or sum-of-abs-values given a vector.
Explain why a bias is needed.
When would we prefer gradient descent to the exact 'least squares' formulas?
When do zero training error solutions exist for linear regression?
Write down the optimization objective for training linear regression to minimize mean-squared error.
Why do we add a regularization penalty?
When would we prefer an L1 penalty vs an L2 penalty?
Which training methods can be applied when using an L2 penalty?
Which training methods can be applied when using an L1 penalty?
What is the curse of dimensionality (in words)?

Unit: Binary Classification

Core concepts for binary clf

3 stages of ML: Training, Prediction, Evaluation
Setting up a regression problem
- Purpose of train/valid/test sets
- Understanding underfitting and overfitting
Choosing an evaluation metric for predicted binary labels
- True positive, false positive, true negative, false negatives
- Confusion matrices
Choosing an evaluation metric for predicted probabilities
- Log loss
- Receiver Operating Characteristic (ROC) curves
- Precision-Recall curves
3 possible regression methods
- K-Nearest Neighbors regression
- Decision Tree regression
- Logistic regression

Core skills for binary clf

How sensitive is each method (dtree, knn, logistic regression) to input feature preprocessing?
How sensitive is each method to noisy or irrelevant features?
How sensitive is each method to the numerical scaling of different features?
How sensitive is each method to having enough training data?
At training time, what is computational complexity of each method?
At prediction time, what is computational complexity of each method?
When training, compare and contrast using squared error vs absolute error.
What kinds of decision boundaries can each method (k-NN, tree, linear models) learn?
What parameters can I tune for each method to avoid under/over-fitting?
What is computational complexity of training each method?
What is computational complexity of making predictions with each method?
How is prediction for binary classifier different than a regressor?
How to compute TP, FP, TN, and FN given classifier's predicted labels?
How to plot a ROC curve given a classifier's predicted probabilities?
How to compute log loss given classifier's predicted probabilities?

Unit: Logistic Regression

Core concepts for logistic regression

Sigmoid function
Why find weights that minimize log loss? 3 justifications:
- Log loss is an upper bound of the error rate
- Interpret as minimizing cross entropy, a measure of coding error between probabilities and binary labels
- Interpret as maximizing log 'likelihood', so the labels are more likely given the data
Training algorithm 1: Gradient descent
Regularization via L2 or L1 penalty terms on the weights

Core skills for logistic regression

Provide good estimates of sigmoid output values at inputs like -2, -1, 0, 1, and 2.
Plot (by hand) the sigmoid function.
Explain what can go wrong with numerical implementations of the sigmoid function.
Explain why a bias feature is needed
Given a weight vector (including bias) and a 1-d or 2-d feature space, plot the decision boundary and predicted probabilities.
Explain why the decision boundary is linear
Explain why the log loss is an upper bound of the error rate.
Explain how to avoid underfitting or overfitting with logistic regression.
Explain how to interpret the trained parameters of logistic regression.

Unit: Optimization for Machine Learning

Core concepts for optimization

Gradient descent
- Step sizes
- Strategies to improve on constant step sizes:
- - Step size decay
- - Line search
- - Second order methods
- Assessing convergence
- - Trace plots of loss
- - Trace plots of gradient
- - Trace plots of parameters
Stochastic gradient descent

Core skills for optimization

Describe (in words) how gradient descent works
Describe (in words) how stochastic GD works
What are signs that a run of gradient descent has diverged?
What are signs that a run of gradient descent has converged?
Given a gradient descent task, how do you select an appropriate step size?
What evidence should you collect that your step size is good?
When will multiple random initializations of gradient descent always converge to the same solutions?
Why would including second-derivative information help improve optimization?
Why don't we always include second-derivative information in a practical application?

Unit: Probabilities and Statistical Thinking for Machine Learning

Core concepts for probabilities and statistical thinking

Random variable
- - Discrete random variables
- - Uniform random variables
- - Gaussian random variables (univariate, NOT multivariate)
- Probability mass function
- Probability density function
- Expected values
- Monte Carlo approximations of expected values
Distributions on two or more variables
- Joint probability
- Conditional probability
- Marginal probability
Rules of Probability
- Sum Rule
- Product Rule
Statistical inference
- Bayes Rule / Bayes Theorem
- Bayes Rule for classification
- Naive Bayes: independence assumptions
Statistical comparisons of models
- Bias (approximation error)
- Variance (estimation error)

Core skills for probabilities

How do we compute the expected value of a weighted dice roll?
What is the variance of a random variable (in words)?
What are the requirements for a valid probability mass function?
What are the requirements for a valid probability density function?
How do we compute a Monte Carlo approximation to an expectation? That is, how can drawing samples help us compute expectations?
What does it mean for a model to have bias (as in, from the bias-variance tradeoff lectures)?

Neural Networks for Supervised Learning

Core concepts for neural nets

One-hot encoding of multi-class labels
Softmax function
Components of a 'neuron'
- Linear function of input vector parameterized by weights and bias
- Activation function
Design of a neural network predictor (aka multi-layer perceptron)
- Selecting an output layer
- - Regression
- - Binary Classification
- - Multi-class classification
- Adding hidden units to a layer
- Adding hidden layers
- Possible activation functions
- - Zero-to-one step function
- - Logistic sigmoid
- - Rectified linear unit (ReLU)
Training Algorithms
- Back-propagation
Regularization to avoid overfitting
- Penalty terms on weights and biases

Core skills for neural nets

Why do we need one-hot encoding representations?
What is the shape of the output of a single 'neuron'?
What is the shape of the output of an MLP used for regression?
What is the shape and type of the output of an MLP used for multi-class classification?
Describe how to compute the output of a multi-layer neural network given an input vector and all the required weights and biases.
Why do we need non-linear activation functions for the hidden layers?
How do we use the softmax function to build a multi-class classifier?
What is relationship between sigmoid and softmax?
At a high-level, how does back-propagation work?

Midterm Exam Review