HW6: PCA and K-Means


Last modified: 2019-04-27 15:31

Status: RELEASED.

Due date: Mon. Apr. 29 at 11:59PM EST

Turn-in links:

Files to Turn In:

PDF report:

  • PDF file containing neatly formatted answers to the conceptual questions below
  • This document will be manually graded
  • Please: within Gradescope via the normal submission process, annotate for each subproblem which page(s) are relevant. This will save your graders so much time!

Updates:

  • 2019-04-27 3:30pm : "K-means algorithm" references below clarified as the "K-means coordinate ascent algorithm" discussed in class.

Jump to: Problem 1   Problem 2

Problem 1: Principal Components Analysis (PCA)

Consider Principal Components Analysis, which is described in the following resources:

You can assume you have a training set \(X\) with \(N\) examples, each one a feature vector with \(F\) features.

1a: Answer the following True/False questions, providing one sentence of explanation for each one

  • 1a(i): We compute principal components by finding eigenvectors of the dataset's feature matrix \(X\).

  • 1a(ii): To select the number of components \(K\) to use, we can find the value of \(K\) that minimizes reconstruction error on the training set. This choice will help us manage accuracy and model complexity.

  • 1a(iii): If we already have fit a PCA model with \(K=10\) components, fitting a PCA model with \(K=9\) components for the same dataset is easy.

1b: Suppose we had a dataset for a medical application, with many measurements describing the patient's height, weight, income, daily food consumption, and many other attributes.

  • 1b(i): Would the PCA components learned be the same if you used feet or centimeters to measure height? Why or why not?

  • 1b(ii): Before applying PCA to this dataset, what (if any) preprocessing would you recommend and why?

1c: Stella has a training dataset \(X = \{ x_n \}_{n=1}^N\), which has \(N\) example feature vectors \(x_n\) each with \(F\) features. She applies PCA to her training set to learn a matrix \(W\) of shape \((K, F)\), where each row represents the basis vector for component \(k\).

She would like to now project her test dataset \(X' = \{ x'_t \}_{t=1}^T\) using the same components \(W\). She remembers that she needs to center her data, so she applies the following 3 steps to project each \(F\)-dimensional test vector \(x'_t\) to a \(K\)-dimensional vector \(z'_t\):

\begin{align} m &= \frac{1}{T} \sum_{t=1}^T x'_t \qquad \text{(compute the mean)} \\ \tilde{x}'_t &= x'_t - m \qquad \text{(center the data by subtracting mean)} \\ z_t &= W \tilde{x}'_t \qquad \text{(project down to K-dimensions)} \end{align}

Is this correct? If so, explain why. If not, explain what Stella should do differently.

Problem 2: K-Means Clustering

Consider the k-means clustering algorithm, which is described in the following resources:

You can assume you have a training set \(X\) with \(N\) examples, each one a feature vector with \(F\) features.

2a: Answer the following True/False questions, providing one sentence of explanation for each one

  • 2a(i): We always get the same clustering of dataset \(X\) when applying K-means with \(K=1\), no matter how we initialize the cluster centroids \(\mu\)

  • 2a(ii): We always get the same clustering of dataset \(X\) when applying K-means with \(K=2\), no matter how we initialize the cluster centroids \(\mu\)

  • 2a(iii): The only way to find the cluster centroids \(\mu\) that minimize the K-means cost function (minimize sum of distances to nearest centroid) is to apply the K-means coordinate descent algorithm, alternatively updating assignments and cluster centroid locations.

  • 2a(iv): The K-means cost function requires computing the Euclidean distance from each example \(x_n\) to its nearest cluster centroid \(\mu_k\). Because the Euclidean distance requires a square root operation (e.g. np.sqrt or np.pow(___, 0.5)), no implementation of the K-means coordinate descent algorithm can be correct unless a "sqrt" operation is performed when computing distances between examples and clusters.

2b: Suppose you are given a dataset \(X\) with \(N\) examples, as well as a group of \(K=5\) cluster locations \(\{ \mu_k \}_{k=1}^K\) fit by applying the K-means coordinate descent algorithm to this dataset. You know \(N > 5\). Describe how you could initialize K-means using \(K = 6\) clusters to obtain a better cost than the \(K=5\) solution.

2c: Suppose you are using sklearn's implementation of K-means to fit 10 clusters to a dataset.

You start with code like this:

    kmeans = sklearn.cluster.KMeans(
        n_clusters=10, random_state=42, init='random', n_init=1, algorithm='full')
    kmeans.fit(X)

List at least two changes you might make to these keyword arguments to improve the quality of your 10 learned cluster location vectors (as measured by their K-means cost function applied to the training data).

2d: Consider the two steps of the k-means coordinate descent algorithm, assign_examples_to_clusters and update_cluster_locations. We've given docstrings for these steps below describing input and output, so you can be sure what happens in each step.

  • 2d(i): What is the big-O runtime of assign_examples_to_clusters? Justify your answer. Express in terms of \(N, K, and F\).

  • 2d(ii): What is the big-O runtime of update_cluster_locations? Justify your answer. Express in terms of \(N, K, and F\).

def assign_examples_to_clusters(x_NF, m_KF):
    ''' Assign each training feature vector to closest of K centroids

    Returned assignments z_N will, for each example n,
    provide the index of the row of m_KF that is closest to vector x_NF[n]

    Args
    ----
    x_NF : 2D array, size n_examples x n_features (N x F)
        Observed data feature vectors, one for each example
    m_KF : 2D array, size n_clusters x n_features (K x F)
        Centroid location vectors, one for each cluster

    Returns
    -------
    z_N : 1D array, size N
        Integer indicator of which cluster each example is assigned to.
        Example n is assigned to cluster k if z_N[n] = k
    '''
    pass


def update_cluster_locations(x_NF, z_N):
    ''' Update the locations of each cluster

    Returned centroid locations will minimize the distance between
    each cluster k's vector m_KF[k] and its assigned data x_NF[z_N == k]

    Args
    ----
    x_NF : 2D array, size n_examples x n_features (N x F)
        Observed data feature vectors, one for each example
    z_N : 1D array, size N
        Integer indicator of which cluster each example is assigned to.
        Example n is assigned to cluster k if z_N[n] = k

    Returns
    -------
    m_KF : 2D array, size n_clusters x n_features (K x F)
        Centroid location vectors, one for each cluster
    '''