Midterm Exam Review


Last modified: 2019-03-09 14:19

Jump to: Exam Format   Practice Problems

Jump to: Regression   Classification   Optimization   Probability   Neural Nets

Exam Format

You'll have the full class period (75 min) to complete. Target required time will be about 60 min.

It will be a pencil & paper exam. We'll provide the instructions and answer sheets.

You should BRING A PENCIL (better than a pen, you might need to erase).

You can BRING 1 sheet of notes on standard 8.5" x 11" paper. Typed or hand-written. Writing can be on both sides.

Keep in mind we will also provide a list of standard formulas on the instruction sheet, as in the practice exam PDF below.

Possible question types:

  • True/false questions
  • Multiple choice questions
  • Short answer concept questions
  • Questions that ask you to read / understand / debug Python code
  • Questions that ask you to read / understand / debug some math
  • Questions that ask you to produce Python code (at most 5-6 lines)
  • Questions that ask you to produce some math (at most 1-2 lines)
    • You'll have access to all needed formulas as part of the instructions
    • Formulas we'll provide: log loss, sigmoid function, needed basic derivatives, sum rule, product rule, etc. (see example in practice exam)

Practice Exam

We've prepared an example practice exam so you can see the kinds of questions that might appear on the real exam.

You can access the practice exam PDF here (requires Piazza credentials): midterm_practice_exam.pdf

Solution PDF: midterm_practice_solutions.pdf

What will be on the exam?

The exam covers everything from our in-class activities and out-of-class readings, starting from our first class and continuing up thru and including class on 2/27 ('Naive Bayes').

Below, we've listed the major concepts/skills that could appear on the exam.

Unit: Regression

Core concepts for regression

  • 3 stages of ML: Training, Prediction, Evaluation

  • Setting up a regression problem

    • Purpose of train/valid/test division of a dataset
    • Underfitting and overfitting
    • Selecting hyperparameters on a fixed validation set
    • Selecting hyperparameters via K-fold Cross validation
  • Understanding evaluation metrics

      • Mean squared error
      • Mean absolute error
  • 3 possible regression methods

    • K-Nearest Neighbors regression
    • Decision Tree regression
    • Linear regression

Core skills for regression

  • When to do cross validation? When to use fixed validation set?
  • When evaluating a regressor, compare and contrast squared error vs absolute error.
  • When training, compare and contrast using squared error vs absolute error.
  • What kinds of regression functions can each method (k-NN, tree, linear models) learn?
  • What parameters can I tune for each method to avoid under/over-fitting?
  • What is computational complexity of training each method?
  • What is computational complexity of making predictions with each method?
  • How sensitive is each method to noisy or irrelevant features?
  • How sensitive is each method to the numerical scaling of different features?
  • How sensitive is each method to having enough training data?

Unit: Linear Regression

Core concepts for linear regression

  • Training algorithm 1: Exact formulas to estimate weights to minimize squared error
  • Training algorithm 2: Gradient descent
  • Linear regression applied to transformed features (e.g. polynomial)
  • Regularization via L1 and L2 penalties on weights (lasso vs ridge regression)
    • 'L1' : sum of absolute values
    • 'L2' : sum of squared values

Core skills for linear regression

  • Compute the sum-of-squares or sum-of-abs-values given a vector.
  • Explain why a bias is needed.
  • When would we prefer gradient descent to the exact 'least squares' formulas?
  • When do zero training error solutions exist for linear regression?
  • Write down the optimization objective for training linear regression to minimize mean-squared error.
  • Why do we add a regularization penalty?
  • When would we prefer an L1 penalty vs an L2 penalty?
  • Which training methods can be applied when using an L2 penalty?
  • Which training methods can be applied when using an L1 penalty?
  • What is the curse of dimensionality (in words)?

Unit: Binary Classification

Core concepts for binary clf

  • 3 stages of ML: Training, Prediction, Evaluation

  • Setting up a regression problem

    • Purpose of train/valid/test sets
    • Understanding underfitting and overfitting
  • Choosing an evaluation metric for predicted binary labels

    • True positive, false positive, true negative, false negatives
    • Confusion matrices
  • Choosing an evaluation metric for predicted probabilities

    • Log loss
    • Receiver Operating Characteristic (ROC) curves
    • Precision-Recall curves
  • 3 possible regression methods

    • K-Nearest Neighbors regression
    • Decision Tree regression
    • Logistic regression

Core skills for binary clf

  • How sensitive is each method (dtree, knn, logistic regression) to input feature preprocessing?
  • How sensitive is each method to noisy or irrelevant features?
  • How sensitive is each method to the numerical scaling of different features?
  • How sensitive is each method to having enough training data?
  • At training time, what is computational complexity of each method?
  • At prediction time, what is computational complexity of each method?
  • When training, compare and contrast using squared error vs absolute error.
  • What kinds of decision boundaries can each method (k-NN, tree, linear models) learn?
  • What parameters can I tune for each method to avoid under/over-fitting?
  • What is computational complexity of training each method?
  • What is computational complexity of making predictions with each method?

  • How is prediction for binary classifier different than a regressor?

  • How to compute TP, FP, TN, and FN given classifier's predicted labels?
  • How to plot a ROC curve given a classifier's predicted probabilities?
  • How to compute log loss given classifier's predicted probabilities?

Unit: Logistic Regression

Core concepts for logistic regression

  • Sigmoid function
  • Why find weights that minimize log loss? 3 justifications:
    • Log loss is an upper bound of the error rate
    • Interpret as minimizing cross entropy, a measure of coding error between probabilities and binary labels
    • Interpret as maximizing log 'likelihood', so the labels are more likely given the data
  • Training algorithm 1: Gradient descent
  • Regularization via L2 or L1 penalty terms on the weights

Core skills for logistic regression

  • Provide good estimates of sigmoid output values at inputs like -2, -1, 0, 1, and 2.
  • Plot (by hand) the sigmoid function.
  • Explain what can go wrong with numerical implementations of the sigmoid function.
  • Explain why a bias feature is needed
  • Given a weight vector (including bias) and a 1-d or 2-d feature space, plot the decision boundary and predicted probabilities.
  • Explain why the decision boundary is linear
  • Explain why the log loss is an upper bound of the error rate.
  • Explain how to avoid underfitting or overfitting with logistic regression.
  • Explain how to interpret the trained parameters of logistic regression.

Unit: Optimization for Machine Learning

Core concepts for optimization

  • Gradient descent
    • Step sizes
    • Strategies to improve on constant step sizes:
      • Step size decay
      • Line search
      • Second order methods
    • Assessing convergence
      • Trace plots of loss
      • Trace plots of gradient
      • Trace plots of parameters
  • Stochastic gradient descent

Core skills for optimization

  • Describe (in words) how gradient descent works
  • Describe (in words) how stochastic GD works
  • What are signs that a run of gradient descent has diverged?
  • What are signs that a run of gradient descent has converged?
  • Given a gradient descent task, how do you select an appropriate step size?
  • What evidence should you collect that your step size is good?
  • When will multiple random initializations of gradient descent always converge to the same solutions?
  • Why would including second-derivative information help improve optimization?
  • Why don't we always include second-derivative information in a practical application?

Unit: Probabilities and Statistical Thinking for Machine Learning

Core concepts for probabilities and statistical thinking

  • Random variable
      • Discrete random variables
      • Uniform random variables
      • Gaussian random variables (univariate, NOT multivariate)
    • Probability mass function
    • Probability density function
    • Expected values
    • Monte Carlo approximations of expected values
  • Distributions on two or more variables

    • Joint probability
    • Conditional probability
    • Marginal probability
  • Rules of Probability

    • Sum Rule
    • Product Rule
  • Statistical inference

    • Bayes Rule / Bayes Theorem
    • Bayes Rule for classification
    • Naive Bayes: independence assumptions
  • Statistical comparisons of models

    • Bias (approximation error)
    • Variance (estimation error)

Core skills for probabilities

  • How do we compute the expected value of a weighted dice roll?
  • What is the variance of a random variable (in words)?
  • What are the requirements for a valid probability mass function?
  • What are the requirements for a valid probability density function?
  • How do we compute a Monte Carlo approximation to an expectation? That is, how can drawing samples help us compute expectations?
  • What does it mean for a model to have bias (as in, from the bias-variance tradeoff lectures)?

Neural Networks for Supervised Learning

Core concepts for neural nets

  • One-hot encoding of multi-class labels

  • Softmax function

  • Components of a 'neuron'

    • Linear function of input vector parameterized by weights and bias
    • Activation function
  • Design of a neural network predictor (aka multi-layer perceptron)

    • Selecting an output layer
      • Regression
      • Binary Classification
      • Multi-class classification
    • Adding hidden units to a layer
    • Adding hidden layers
    • Possible activation functions
      • Zero-to-one step function
      • Logistic sigmoid
      • Rectified linear unit (ReLU)
  • Training Algorithms

    • Back-propagation
  • Regularization to avoid overfitting

    • Penalty terms on weights and biases

Core skills for neural nets

  • Why do we need one-hot encoding representations?
  • What is the shape of the output of a single 'neuron'?
  • What is the shape of the output of an MLP used for regression?
  • What is the shape and type of the output of an MLP used for multi-class classification?
  • Describe how to compute the output of a multi-layer neural network given an input vector and all the required weights and biases.
  • Why do we need non-linear activation functions for the hidden layers?

  • How do we use the softmax function to build a multi-class classifier?

  • What is relationship between sigmoid and softmax?

  • At a high-level, how does back-propagation work?