# HW0: Numerical Programming Fundamentals

Status: RELEASED.

Due date: Wed. 09/16 11:59pm Anywhere on Earth (Thu 09/17 at 07:59am ET in Boston)

Files to Turn In:

ZIP file of source code, containing these files:

• hw0.py : Python file containing your solution code

Evaluation Rubric:

• Problem 1 code: 45 points possible (number of autograder tests passed)
• Problem 2 code: 45 points possible (number of autograder tests passed)
• Complete Reflection is worth 10 points

Prerequisites:

Make sure you've followed the COMP 135-specific Python Setup Instructions.

Be sure to activate the comp135_2020f_env environment whenever you run your code.

Starter code:

See the hw0 folder of the public assignments repo for this class:

https://github.com/tufts-ml-courses/comp135-20f-assignments/tree/master/hw0

## Problem 1: Finding Nearest Neighbors via Euclidean Distance

Suppose we have a dataset of measurements collected about related entities in the world. The instances of these entities could be these patients in the same hospital, or birds of the same species, or emails sent to the same person. Given any new "query" instance, we'd like to find the K "nearest neighbor" instances in the dataset. Here, K is an integer that could be 1 or 3 or 500.

We'll define nearest neighbors for a query feature vector $$\tilde{x}_q$$ as the K vectors in the dataset set (denoted $$x_1, \ldots x_N$$) that have the smallest Euclidean distance to $$\tilde{x}_q$$.

Given a specific dataset vector $$x_n$$ (of size $$F$$) and the query vector $$\tilde{x}_q$$ (also of size F), we define the Euclidean distance as:

$$\text{dist}_{\text{Euclidean}}(\tilde{x}_q, x_n) = \sqrt{ \sum_{f=1}^F \Big( \tilde{x}_{qf} - x_{nf} \Big) ^2 }$$

Edge case: When finding the k neighbors with smallest distance, sometimes two data vectors will have exact ties (both will be the same distance from the query). When this happens, you should always return exactly $$K$$ neighbor vectors from among the tied candidates. You should resolve ties by taking the first matches in the original dataset's order.

We'd like you to solve multiple k-nearest queries in one function call. That is, given a dataset of $$N$$ examples (stored as an N x F 2D NumPy array) and a set of $$Q$$ query examples (stored as a Q x F array), you should write effective NumPy code to return a 3D array (size Q x K x F) where each row (indexed by q) contains the $$K$$ nearest neighbor vectors of the q-th query vector.

We've drafted a function definition and defined a detailed specification in its docstring. You need to fill in the missing code.

https://github.com/tufts-ml-courses/comp135-20f-assignments/blob/master/hw0/hw0.py#L88

Use exclusively NumPy functions. No calls to any functions in sklearn or other libraries.

## Problem 2: Splitting Datasets into Training and Testing

A common task in ML is to divide a dataset of independent instances into a "training" set and a "test" set. These two sets can then be used to measure how well an ML method generalizes to data it hasn't seen before: we fit the method to the training set, then evaluate the trained method on the test set.

In this problem, you'll demonstrate basic understanding of NumPy array indexing and random number generation by writing a procedure to divide an input dataset of $$L$$ instances into a training set (of size $$M$$) and a test set (of size $$N$$). Each row of the original dataset should be exclusively assigned to either train or test.

How do we set the values of M and N? Your function will take a keyword argument frac_test that specifies the number of test examples (N) as a fraction of the overall dataset size. To compute $$N$$, we always want to round up to the nearest whole number: $$N = \text{ceil}(\textit{frac_test} * L)$$

We want the test set to be a uniform at random subset. You should look at the NumPy API for Random Sampling. Functions like shuffle or permutation might be helpful.

We also want the test set to be reproducible by specifying particular random seed. That is, if I run the code now to extract a train/test set, if I need to rerun the code later I'd like to be able to recover the exact same train/test assignments if needed. With NumPy, the common way to do this is by specifying a random_state keyword argument that can either take an integer seed (0, 42, 1337, etc.) or an instance of the RandomState class. Specifying the same random_state should deliver the same pseudo-randomness across multiple calls to a function.