Last modified: 2019-04-29 13:32

Jump to: Exam Format Practice Problems

Jump to: Regression Classification Optimization Probability Neural Nets

Exam Date And Time

The exam is on Friday May 3 2019, from 330pm - 530pm.

This is our course's assigned standard exam time for Block I (capital "eye").

Location: In our normal classroom, Halligan 111A.

Exam Format

You'll have the full exam period (2 hours) to complete. Target required time will be about 1:30.

You should BRING A PENCIL (better than a pen, you might need to erase).

You can BRING 1 sheet of notes on standard 8.5" x 11" paper. Typed or hand-written. Writing can be on both sides.

It will be a pencil & paper exam. We'll provide the instructions and answer sheets.

Keep in mind we will also provide a list of standard formulas on the instruction sheet, as in the practice exam PDF below.

Possible question types:

True/false questions
Multiple choice questions
Short answer concept questions
Questions that ask you to read / understand / debug Python code
Questions that ask you to read / understand / debug some math
Questions that ask you to produce Python code (at most 5-6 lines)
Questions that ask you to produce some math (at most 1-2 lines)
- You'll have access to all needed formulas as part of the instructions
- Formulas we'll provide: log loss, sigmoid function, needed basic derivatives, sum rule, product rule, etc. (see example in practice exam)

Practice Exam

We intentionally designed HW5 and HW6 to mimic possible exam questions and formats, so look at those assignments and their solutions to get a sense of material.

You can look at the midterm practice exam to see the kinds of questions we might ask:

You can access the practice exam PDF here (requires Piazza credentials): midterm_practice_exam.pdf

Solution PDF: midterm_practice_solutions.pdf

What will be on the exam?

The exam covers everything from our in-class activities and out-of-class readings, starting from our first class back in January and continuing up thru and including class on 4/24 (last material on clustering). There will NOT be coverage of Reinforcement Learning on the exam.

Below, we've listed the major concepts/skills that could appear on the exam. This is not meant to be completely exhaustive, but should cover the vast majority of exam questions. (Another good place to look for study material are the blue "Unit Objective summaries" on lecture slides).

Unit: Regression

Core concepts for regression

3 stages of ML: Training, Prediction, Evaluation
Setting up a regression problem
- Purpose of train/valid/test division of a dataset
- Underfitting and overfitting
- Selecting hyperparameters on a fixed validation set
- Selecting hyperparameters via K-fold Cross validation
Understanding evaluation metrics
- - Mean squared error
- - Mean absolute error
3 possible regression methods
- K-Nearest Neighbors regression
- Decision Tree regression
- Linear regression

Core skills for regression

When to do cross validation? When to use fixed validation set?
When evaluating a regressor, compare and contrast squared error vs absolute error.
When training, compare and contrast using squared error vs absolute error.
What kinds of regression functions can each method (k-NN, tree, linear models) learn?
What parameters can I tune for each method to avoid under/over-fitting?
What is computational complexity of training each method?
What is computational complexity of making predictions with each method?
How sensitive is each method to noisy or irrelevant features?
How sensitive is each method to the numerical scaling of different features?
How sensitive is each method to having enough training data?

Unit: Linear Regression

Core concepts for linear regression

Training algorithm 1: Exact formulas to estimate weights to minimize squared error
Training algorithm 2: Gradient descent
Linear regression applied to transformed features (e.g. polynomial)
Regularization via L1 and L2 penalties on weights (lasso vs ridge regression)
- 'L1' : sum of absolute values
- 'L2' : sum of squared values

Core skills for linear regression

Compute the sum-of-squares or sum-of-abs-values given a vector.
Explain why a bias is needed.
When would we prefer gradient descent to the exact 'least squares' formulas?
When do zero training error solutions exist for linear regression?
Write down the optimization objective for training linear regression to minimize mean-squared error.
Why do we add a regularization penalty?
When would we prefer an L1 penalty vs an L2 penalty?
Which training methods can be applied when using an L2 penalty?
Which training methods can be applied when using an L1 penalty?
What is the curse of dimensionality (in words)?

Unit: Binary Classification

Core concepts for binary clf

3 stages of ML: Training, Prediction, Evaluation
Setting up a regression problem
- Purpose of train/valid/test sets
- Understanding underfitting and overfitting
Choosing an evaluation metric for predicted binary labels
- True positive, false positive, true negative, false negatives
- Confusion matrices
Choosing an evaluation metric for predicted probabilities
- Log loss
- Receiver Operating Characteristic (ROC) curves
- Precision-Recall curves
3 possible regression methods
- K-Nearest Neighbors regression
- Decision Tree regression
- Logistic regression

Core skills for binary clf

How sensitive is each method (dtree, knn, logistic regression) to input feature preprocessing?
How sensitive is each method to noisy or irrelevant features?
How sensitive is each method to the numerical scaling of different features?
How sensitive is each method to having enough training data?
At training time, what is computational complexity of each method?
At prediction time, what is computational complexity of each method?
When training, compare and contrast using squared error vs absolute error.
What kinds of decision boundaries can each method (k-NN, tree, linear models) learn?
What parameters can I tune for each method to avoid under/over-fitting?
What is computational complexity of training each method?
What is computational complexity of making predictions with each method?
How is prediction for binary classifier different than a regressor?
How to compute TP, FP, TN, and FN given classifier's predicted labels?
How to plot a ROC curve given a classifier's predicted probabilities?
How to compute log loss given classifier's predicted probabilities?

Unit: Logistic Regression

Core concepts for logistic regression

Sigmoid function
Why find weights that minimize log loss? 3 justifications:
- Log loss is an upper bound of the error rate
- Interpret as minimizing cross entropy, a measure of coding error between probabilities and binary labels
- Interpret as maximizing log 'likelihood', so the labels are more likely given the data
Training algorithm 1: Gradient descent
Regularization via L2 or L1 penalty terms on the weights

Core skills for logistic regression

Provide good estimates of sigmoid output values at inputs like -2, -1, 0, 1, and 2.
Plot (by hand) the sigmoid function.
Explain what can go wrong with numerical implementations of the sigmoid function.
Explain why a bias feature is needed
Given a weight vector (including bias) and a 1-d or 2-d feature space, plot the decision boundary and predicted probabilities.
Explain why the decision boundary is linear
Explain why the log loss is an upper bound of the error rate.
Explain how to avoid underfitting or overfitting with logistic regression.
Explain how to interpret the trained parameters of logistic regression.

Unit: Optimization for Machine Learning

Core concepts for optimization

Gradient descent
- Step sizes
- Strategies to improve on constant step sizes:
- - Step size decay
- - Line search
- - Second order methods
- Assessing convergence
- - Trace plots of loss
- - Trace plots of gradient
- - Trace plots of parameters
Stochastic gradient descent

Core skills for optimization

Describe (in words) how gradient descent works
Describe (in words) how stochastic GD works
What are signs that a run of gradient descent has diverged?
What are signs that a run of gradient descent has converged?
Given a gradient descent task, how do you select an appropriate step size?
What evidence should you collect that your step size is good?
When will multiple random initializations of gradient descent always converge to the same solutions?
Why would including second-derivative information help improve optimization?
Why don't we always include second-derivative information in a practical application?

Unit: Probabilities and Statistical Thinking for Machine Learning

Core concepts for probabilities and statistical thinking

Random variable
- - Discrete random variables
- - Uniform random variables
- - Gaussian random variables (univariate, NOT multivariate)
- Probability mass function
- Probability density function
- Expected values
- Monte Carlo approximations of expected values
Distributions on two or more variables
- Joint probability
- Conditional probability
- Marginal probability
Rules of Probability
- Sum Rule
- Product Rule
Statistical inference
- Bayes Rule / Bayes Theorem
- Bayes Rule for classification
- Naive Bayes: independence assumptions
Statistical comparisons of models
- Bias (approximation error)
- Variance (estimation error)

Core skills for probabilities

How do we compute the expected value of a weighted dice roll?
What is the variance of a random variable (in words)?
What are the requirements for a valid probability mass function?
What are the requirements for a valid probability density function?
How do we compute a Monte Carlo approximation to an expectation? That is, how can drawing samples help us compute expectations?
What does it mean for a model to have bias (as in, from the bias-variance tradeoff lectures)?

Neural Networks for Supervised Learning

Core concepts for neural nets

One-hot encoding of multi-class labels
Softmax function
Components of a 'neuron'
- Linear function of input vector parameterized by weights and bias
- Activation function
Design of a neural network predictor (aka multi-layer perceptron)
- Selecting an output layer
- - Regression
- - Binary Classification
- - Multi-class classification
- Adding hidden units to a layer
- Adding hidden layers
- Possible activation functions
- - Zero-to-one step function
- - Logistic sigmoid
- - Rectified linear unit (ReLU)
Training Algorithms
- Back-propagation
Regularization to avoid overfitting
- Penalty terms on weights and biases

Core skills for neural nets

Why do we need one-hot encoding representations?
What is the shape of the output of a single 'neuron'?
What is the shape of the output of an MLP used for regression?
What is the shape and type of the output of an MLP used for multi-class classification?
Describe how to compute the output of a multi-layer neural network given an input vector and all the required weights and biases.
Why do we need non-linear activation functions for the hidden layers?
How do we use the softmax function to build a multi-class classifier?
What is relationship between sigmoid and softmax?
At a high-level, how does back-propagation work?

Practical Prediction Performance and Hyperparameter Optimization

Core concepts

Data Augmentation
Early Stopping
Dropout
Grid Search
Random Search

Core skills

How to apply data augmentation to a prediction task?
When would early stopping be an effective choice?
- Could early stopping help on Project 2? If so, how?
- Could early stopping help on Project 3? If so, how?
What are advantages/disadvantages of random search vs. grid search?

Ensemble Methods: Bagging, Random Forests, and Boosting

Concepts

Bootstrap sampling
Bagging (Bootstrap -aggregating)
Random Forests
Boosting

Skills

How does a boosting classifier use multiple models differently than a bagging classifier?
What kinds of decision boundaries does a random forest have?
When does a random forest overfit?

Kernels for Regression/Classification

Concepts

Kernel function
Kernel matrix
Kernelized linear regression

Skills

What is the runtime cost of using a kernelized algorithm vs. featurized algorithm?
Describe (in words) the advantages of using kernels

Support Vector Machines for Classification

Concepts

Margins of linear classifiers
Maximizing the margin
Support vectors

Skills

Given a margin, how do we identify the support vectors?
Which representation is sparse, the weights (one per feature) or the alphas (one per example)?
Why does knowing which examples are support vectors help reduce runtime of predictions on a test set?

Recommendation Systems and matrix factorization

Concepts

Content-based prediction
Collaborative filtering
Matrix factorization
Precision and recall

Skills

Describe (in words) difference between collaborative filtering and content-based methods
Describe (in words) how you would solve the "warm start" problem of making predictions for a user for whom you have some ratings, but are missing others
Describe (in words) how you would solve the "cold start" problem of making predictions for a new user for whom you have zero ratings

Dimensionality reduction and PCA

Concepts

Eigenvalues and eigenvectors
Projection and Reconstruction
Model selection (how to choose the number of dimensions K)?

Skills

Why do we need to center the data before PCA?
Describe (in words and math) how we do projection to K -dimensions in PCA
Describe (in words and math) how we reconstruct the high-dim. data vector from its K-dim projection
Describe pros/cons of using PCA as a pre-processing step before some supervised prediction task
Describe how PCA is related to matrix factorization we did in Project 3

Clustering: K-means and Gaussian mixtures

Concepts

Hard (exclusive) vs. soft (probabilistic) cluster assignments
Euclidean distances in multi-dimensional spaces
K-means cost function
Coordinate descent algorithms
Random initialization vs. k-means++ initialization
Model selection (how to choose the number of dimensions K)?
Multivariate Gaussian probability distributions
Restrictions on covariance matrices (spherical, diagonal, full)

Skills

Why might multiple runs of k-means algorithm be useful?
How should we initialize k-means?
What are the cluster boundaries of k-means like?
How can we compute the probability of a data vector ("x_n") under a GMM?

Fairness and Ethics

Concepts

Accuracy
True positive rate / False positive rate
Positive predictive value / Negative predictive value
Parity across groups for the metrics above

Skills

Why might ensuring different groups have the same accuracy not be enough to say a classifier is fair?
Is it possible to enforce TPR/FPR and PPV/NPV parity? If so, when?

Final Exam Review

What will be on the exam?

Core concepts for regression

Core skills for regression

Unit: Linear Regression

Core concepts for linear regression

Core skills for linear regression

Core concepts for binary clf

Core skills for binary clf

Unit: Logistic Regression

Core concepts for logistic regression

Core skills for logistic regression

Core concepts for optimization

Core skills for optimization

Core concepts for probabilities and statistical thinking

Core skills for probabilities

Core concepts for neural nets

Core skills for neural nets

Core concepts

Core skills

Concepts

Skills

Concepts

Skills

Concepts

Skills

Concepts

Skills

Concepts

Skills

Concepts

Skills

Concepts

Skills