Final Exam Review


Last modified: 2019-04-29 13:32

Jump to: Exam Format   Practice Problems

Jump to: Regression   Classification   Optimization   Probability   Neural Nets

Exam Date And Time

The exam is on Friday May 3 2019, from 330pm - 530pm.

This is our course's assigned standard exam time for Block I (capital "eye").

Location: In our normal classroom, Halligan 111A.

Exam Format

You'll have the full exam period (2 hours) to complete. Target required time will be about 1:30.

You should BRING A PENCIL (better than a pen, you might need to erase).

You can BRING 1 sheet of notes on standard 8.5" x 11" paper. Typed or hand-written. Writing can be on both sides.

It will be a pencil & paper exam. We'll provide the instructions and answer sheets.

Keep in mind we will also provide a list of standard formulas on the instruction sheet, as in the practice exam PDF below.

Possible question types:

  • True/false questions
  • Multiple choice questions
  • Short answer concept questions
  • Questions that ask you to read / understand / debug Python code
  • Questions that ask you to read / understand / debug some math
  • Questions that ask you to produce Python code (at most 5-6 lines)
  • Questions that ask you to produce some math (at most 1-2 lines)
    • You'll have access to all needed formulas as part of the instructions
    • Formulas we'll provide: log loss, sigmoid function, needed basic derivatives, sum rule, product rule, etc. (see example in practice exam)

Practice Exam

We intentionally designed HW5 and HW6 to mimic possible exam questions and formats, so look at those assignments and their solutions to get a sense of material.

You can look at the midterm practice exam to see the kinds of questions we might ask:

You can access the practice exam PDF here (requires Piazza credentials): midterm_practice_exam.pdf

Solution PDF: midterm_practice_solutions.pdf

What will be on the exam?

The exam covers everything from our in-class activities and out-of-class readings, starting from our first class back in January and continuing up thru and including class on 4/24 (last material on clustering). There will NOT be coverage of Reinforcement Learning on the exam.

Below, we've listed the major concepts/skills that could appear on the exam. This is not meant to be completely exhaustive, but should cover the vast majority of exam questions. (Another good place to look for study material are the blue "Unit Objective summaries" on lecture slides).

Unit: Regression

Core concepts for regression

  • 3 stages of ML: Training, Prediction, Evaluation

  • Setting up a regression problem

    • Purpose of train/valid/test division of a dataset
    • Underfitting and overfitting
    • Selecting hyperparameters on a fixed validation set
    • Selecting hyperparameters via K-fold Cross validation
  • Understanding evaluation metrics

      • Mean squared error
      • Mean absolute error
  • 3 possible regression methods

    • K-Nearest Neighbors regression
    • Decision Tree regression
    • Linear regression

Core skills for regression

  • When to do cross validation? When to use fixed validation set?
  • When evaluating a regressor, compare and contrast squared error vs absolute error.
  • When training, compare and contrast using squared error vs absolute error.
  • What kinds of regression functions can each method (k-NN, tree, linear models) learn?
  • What parameters can I tune for each method to avoid under/over-fitting?
  • What is computational complexity of training each method?
  • What is computational complexity of making predictions with each method?
  • How sensitive is each method to noisy or irrelevant features?
  • How sensitive is each method to the numerical scaling of different features?
  • How sensitive is each method to having enough training data?

Unit: Linear Regression

Core concepts for linear regression

  • Training algorithm 1: Exact formulas to estimate weights to minimize squared error
  • Training algorithm 2: Gradient descent
  • Linear regression applied to transformed features (e.g. polynomial)
  • Regularization via L1 and L2 penalties on weights (lasso vs ridge regression)
    • 'L1' : sum of absolute values
    • 'L2' : sum of squared values

Core skills for linear regression

  • Compute the sum-of-squares or sum-of-abs-values given a vector.
  • Explain why a bias is needed.
  • When would we prefer gradient descent to the exact 'least squares' formulas?
  • When do zero training error solutions exist for linear regression?
  • Write down the optimization objective for training linear regression to minimize mean-squared error.
  • Why do we add a regularization penalty?
  • When would we prefer an L1 penalty vs an L2 penalty?
  • Which training methods can be applied when using an L2 penalty?
  • Which training methods can be applied when using an L1 penalty?
  • What is the curse of dimensionality (in words)?

Unit: Binary Classification

Core concepts for binary clf

  • 3 stages of ML: Training, Prediction, Evaluation

  • Setting up a regression problem

    • Purpose of train/valid/test sets
    • Understanding underfitting and overfitting
  • Choosing an evaluation metric for predicted binary labels

    • True positive, false positive, true negative, false negatives
    • Confusion matrices
  • Choosing an evaluation metric for predicted probabilities

    • Log loss
    • Receiver Operating Characteristic (ROC) curves
    • Precision-Recall curves
  • 3 possible regression methods

    • K-Nearest Neighbors regression
    • Decision Tree regression
    • Logistic regression

Core skills for binary clf

  • How sensitive is each method (dtree, knn, logistic regression) to input feature preprocessing?
  • How sensitive is each method to noisy or irrelevant features?
  • How sensitive is each method to the numerical scaling of different features?
  • How sensitive is each method to having enough training data?
  • At training time, what is computational complexity of each method?
  • At prediction time, what is computational complexity of each method?
  • When training, compare and contrast using squared error vs absolute error.
  • What kinds of decision boundaries can each method (k-NN, tree, linear models) learn?
  • What parameters can I tune for each method to avoid under/over-fitting?
  • What is computational complexity of training each method?
  • What is computational complexity of making predictions with each method?

  • How is prediction for binary classifier different than a regressor?

  • How to compute TP, FP, TN, and FN given classifier's predicted labels?
  • How to plot a ROC curve given a classifier's predicted probabilities?
  • How to compute log loss given classifier's predicted probabilities?

Unit: Logistic Regression

Core concepts for logistic regression

  • Sigmoid function
  • Why find weights that minimize log loss? 3 justifications:
    • Log loss is an upper bound of the error rate
    • Interpret as minimizing cross entropy, a measure of coding error between probabilities and binary labels
    • Interpret as maximizing log 'likelihood', so the labels are more likely given the data
  • Training algorithm 1: Gradient descent
  • Regularization via L2 or L1 penalty terms on the weights

Core skills for logistic regression

  • Provide good estimates of sigmoid output values at inputs like -2, -1, 0, 1, and 2.
  • Plot (by hand) the sigmoid function.
  • Explain what can go wrong with numerical implementations of the sigmoid function.
  • Explain why a bias feature is needed
  • Given a weight vector (including bias) and a 1-d or 2-d feature space, plot the decision boundary and predicted probabilities.
  • Explain why the decision boundary is linear
  • Explain why the log loss is an upper bound of the error rate.
  • Explain how to avoid underfitting or overfitting with logistic regression.
  • Explain how to interpret the trained parameters of logistic regression.

Unit: Optimization for Machine Learning

Core concepts for optimization

  • Gradient descent
    • Step sizes
    • Strategies to improve on constant step sizes:
      • Step size decay
      • Line search
      • Second order methods
    • Assessing convergence
      • Trace plots of loss
      • Trace plots of gradient
      • Trace plots of parameters
  • Stochastic gradient descent

Core skills for optimization

  • Describe (in words) how gradient descent works
  • Describe (in words) how stochastic GD works
  • What are signs that a run of gradient descent has diverged?
  • What are signs that a run of gradient descent has converged?
  • Given a gradient descent task, how do you select an appropriate step size?
  • What evidence should you collect that your step size is good?
  • When will multiple random initializations of gradient descent always converge to the same solutions?
  • Why would including second-derivative information help improve optimization?
  • Why don't we always include second-derivative information in a practical application?

Unit: Probabilities and Statistical Thinking for Machine Learning

Core concepts for probabilities and statistical thinking

  • Random variable
      • Discrete random variables
      • Uniform random variables
      • Gaussian random variables (univariate, NOT multivariate)
    • Probability mass function
    • Probability density function
    • Expected values
    • Monte Carlo approximations of expected values
  • Distributions on two or more variables

    • Joint probability
    • Conditional probability
    • Marginal probability
  • Rules of Probability

    • Sum Rule
    • Product Rule
  • Statistical inference

    • Bayes Rule / Bayes Theorem
    • Bayes Rule for classification
    • Naive Bayes: independence assumptions
  • Statistical comparisons of models

    • Bias (approximation error)
    • Variance (estimation error)

Core skills for probabilities

  • How do we compute the expected value of a weighted dice roll?
  • What is the variance of a random variable (in words)?
  • What are the requirements for a valid probability mass function?
  • What are the requirements for a valid probability density function?
  • How do we compute a Monte Carlo approximation to an expectation? That is, how can drawing samples help us compute expectations?
  • What does it mean for a model to have bias (as in, from the bias-variance tradeoff lectures)?

Neural Networks for Supervised Learning

Core concepts for neural nets

  • One-hot encoding of multi-class labels

  • Softmax function

  • Components of a 'neuron'

    • Linear function of input vector parameterized by weights and bias
    • Activation function
  • Design of a neural network predictor (aka multi-layer perceptron)

    • Selecting an output layer
      • Regression
      • Binary Classification
      • Multi-class classification
    • Adding hidden units to a layer
    • Adding hidden layers
    • Possible activation functions
      • Zero-to-one step function
      • Logistic sigmoid
      • Rectified linear unit (ReLU)
  • Training Algorithms

    • Back-propagation
  • Regularization to avoid overfitting

    • Penalty terms on weights and biases

Core skills for neural nets

  • Why do we need one-hot encoding representations?
  • What is the shape of the output of a single 'neuron'?
  • What is the shape of the output of an MLP used for regression?
  • What is the shape and type of the output of an MLP used for multi-class classification?
  • Describe how to compute the output of a multi-layer neural network given an input vector and all the required weights and biases.
  • Why do we need non-linear activation functions for the hidden layers?

  • How do we use the softmax function to build a multi-class classifier?

  • What is relationship between sigmoid and softmax?

  • At a high-level, how does back-propagation work?

Practical Prediction Performance and Hyperparameter Optimization

Core concepts

  • Data Augmentation
  • Early Stopping
  • Dropout

  • Grid Search

  • Random Search

Core skills

  • How to apply data augmentation to a prediction task?
  • When would early stopping be an effective choice?
    • Could early stopping help on Project 2? If so, how?
    • Could early stopping help on Project 3? If so, how?
  • What are advantages/disadvantages of random search vs. grid search?

Ensemble Methods: Bagging, Random Forests, and Boosting

Concepts

  • Bootstrap sampling
  • Bagging (Bootstrap -aggregating)
  • Random Forests
  • Boosting

Skills

  • How does a boosting classifier use multiple models differently than a bagging classifier?
  • What kinds of decision boundaries does a random forest have?
  • When does a random forest overfit?

Kernels for Regression/Classification

Concepts

  • Kernel function
  • Kernel matrix
  • Kernelized linear regression

Skills

  • What is the runtime cost of using a kernelized algorithm vs. featurized algorithm?
  • Describe (in words) the advantages of using kernels

Support Vector Machines for Classification

Concepts

  • Margins of linear classifiers
  • Maximizing the margin
  • Support vectors

Skills

  • Given a margin, how do we identify the support vectors?
  • Which representation is sparse, the weights (one per feature) or the alphas (one per example)?
  • Why does knowing which examples are support vectors help reduce runtime of predictions on a test set?

Recommendation Systems and matrix factorization

Concepts

  • Content-based prediction
  • Collaborative filtering
  • Matrix factorization
  • Precision and recall

Skills

  • Describe (in words) difference between collaborative filtering and content-based methods

  • Describe (in words) how you would solve the "warm start" problem of making predictions for a user for whom you have some ratings, but are missing others

  • Describe (in words) how you would solve the "cold start" problem of making predictions for a new user for whom you have zero ratings

Dimensionality reduction and PCA

Concepts

  • Eigenvalues and eigenvectors
  • Projection and Reconstruction
  • Model selection (how to choose the number of dimensions K)?

Skills

  • Why do we need to center the data before PCA?
  • Describe (in words and math) how we do projection to K -dimensions in PCA
  • Describe (in words and math) how we reconstruct the high-dim. data vector from its K-dim projection
  • Describe pros/cons of using PCA as a pre-processing step before some supervised prediction task
  • Describe how PCA is related to matrix factorization we did in Project 3

Clustering: K-means and Gaussian mixtures

Concepts

  • Hard (exclusive) vs. soft (probabilistic) cluster assignments
  • Euclidean distances in multi-dimensional spaces
  • K-means cost function
  • Coordinate descent algorithms
  • Random initialization vs. k-means++ initialization

  • Model selection (how to choose the number of dimensions K)?

  • Multivariate Gaussian probability distributions

  • Restrictions on covariance matrices (spherical, diagonal, full)

Skills

  • Why might multiple runs of k-means algorithm be useful?
  • How should we initialize k-means?
  • What are the cluster boundaries of k-means like?

  • How can we compute the probability of a data vector ("x_n") under a GMM?

Fairness and Ethics

Concepts

  • Accuracy
  • True positive rate / False positive rate
  • Positive predictive value / Negative predictive value
  • Parity across groups for the metrics above

Skills

  • Why might ensuring different groups have the same accuracy not be enough to say a classifier is fair?
  • Is it possible to enforce TPR/FPR and PPV/NPV parity? If so, when?