HW0: Numerical Programming Fundamentals


Last modified: 2026-01-16 23:47

Status: RELEASED

Due date: Wed Jan 21, 2026 by end of day (11:59 pm ET) in Medford, MA

Turn-in Links:

Files to Turn In:

Please submit only the following two files on Gradescope:

  • hw0_split.py : Solution code to Problem 1
  • hw0_knn.py : Solution code to Problem 2

After a successful submission, the "Code" view on gradescope should look like this.

Evaluation Rubric:

  • 95% : score from autograder of Problems 1-2
  • 5% : completing the reflection form

Prerequisites:

Make sure you've followed the CS 135-specific Python Setup Instructions.

Be sure to activate the cs135_26s_env environment whenever you run your code.

Starter code:

See the hw0 folder of the public assignments repo for this class:

https://github.com/tufts-ml-courses/cs135-26s-assignments/tree/main/hw0

Clarification: Your solutions to both problems should use only the packages already imported in the starter code. Do not import any other packages.

Jump to: Problem 1   Problem 2

Problem 1: Split Dataset into Train and Test sets

A common task in ML is to divide a dataset of independent instances into a "training" set and a "test" set. These two sets can then be used to both create an ML prediction model and evaluate how its performance generalizes to data it hasn't seen before. First, fit the predictor using the training set (aka 'train set'). Then, use the predictor to make predictions on the testing set (aka 'test set'), and evaluate those to assess generalization.

In this problem, you'll demonstrate basic understanding of NumPy array indexing and random number generation by writing a procedure to divide an input dataset of \(L\) instances into a training set (of size \(M\)) and a test set (of size \(N\)). Each row of the original dataset should be exclusively assigned to either train or test.

How do we set the values of M and N? Your function will take a keyword argument frac_test that specifies the size of test set (N) as a float indicating the fraction of the overall dataset size.

To compute N, we always want to round up to the nearest whole number. Rounding up ensures at least one data point is assigned to test. Using np.ceil to round up, we have:

N = int(np.ceil(L * float(frac_test)))

We also want the test set to be reproducible by specifying particular random seed. That is, if I run the code now to extract a train/test set, if I need to rerun the code later I'd like to be able to recover the exact same train/test assignments if needed. With NumPy, the common way to do this is by specifying a random_state keyword argument that can take an integer seed (like 0 or 42 or 1337) which internally defines an instance of the RandomState class. Specifying the same seed should deliver the same pseudo-randomness across multiple calls to a function.

You should determine which inputs get assigned to train and test by

  • Use random_state.permutation to shuffle the row ids of all L input examples
  • Use the first M of the shuffled row ids to build your training array
  • Use the remaining N of the shuffled row ids to build your test array

You can follow this logic in the diagram below.

Demo of split_into_train_and_test

Visualization of how we intend that input should be split into train and test

Starter code: split_into_train_and_test

We've drafted a function definition and defined a detailed specification in the module's docstring.

https://github.com/tufts-ml-courses/cs135-26s-assignments/blob/main/hw0/hw0_split.py

Your task: Edit hw0_split.py to fill in the missing code.

To help you out, we've provided a doctest in the starter code that will test if your code produces the same output as the visual example above. The doctest can be run by downloading test_split.py and running (in the same directory as your submission file)

python -m doctest test_split.py

Your submitted code will be tested by the autograder on this and several other examples. We strongly urge you to run your own tests to verify correctness.

Problem 2: Finding Nearest Neighbors via Euclidean Distance

Problem Statement

Suppose we have a dataset of measurements collected about related entities in the world. The instances of these entities could be these patients in the same hospital, or birds of the same species, or emails sent to the same person. Given any new “query” instance, we’d like to find the \(K\) "nearest neighbor" instances in the dataset. Here, \(K\) is an integer that could be 1 or 3 or 500.

We'll define nearest neighbors of a query feature vector \(\tilde{x}_q\) as the \(K\) vectors in the dataset (denoted \(x_1,\dots,x_N\) with \(N\) total instances) with the smallest Euclidean distance to \(\tilde{x}_q\).

Notation: we'll follow common math convention of assuming a vector is a column vector.

  • Denote the first instance's feature vector as \(x_1 = [ x_{11}, x_{12}, \ldots x_{1F} ]^T\).
  • Denote the \(n\)-th instance's feature vector as \(x_n = [ x_{n1}, x_{n2}, \ldots x_{nF} ]^T\).

The entire dataset is denoted \(X = \{x_1, \ldots x_N\}\).

Euclidean distance: Given a query vector \(\tilde{x}_q\) (of size \(F\)) and a specific dataset vector \(x_n\) (also of size \(F\)), we define the Euclidean distance between them as:

$$ \text{dist}_{\text{Euclidean}}(\tilde{x}_q, x_n) = \sqrt{ \sum_{f=1}^F \Big( \tilde{x}_{qf} - x_{nf} \Big) ^2 } $$

Edge case: When finding the k neighbors with smallest distance, sometimes two data vectors will have exact ties (both will be the same distance from the query). When this happens, you should always return exactly \(K\) neighbor vectors from among the tied candidates. You should resolve ties by taking the first matches in the original dataset's order.

Coding instructions

We'd like you to solve multiple k-nearest queries in one function call, all using a common dataset of N different examples, each represented as an F-dimensional feature vector.

You'll be provided:

  • a reference array of 2D shape NxF, representing N distinct feature vectors
  • a query array of 2D shape QxF, representing Q distinct query vectors

You'll need to return

  • a result array of 3D shape QxKxF

where each row (indexed by q) contains the \(K\) nearest neighbor reference vectors of the q-th vector in the query array.

Starter code: find_k_nearest_neighbors

We've drafted a function definition and defined a detailed specification in the module's docstring. You need to fill in the missing code.

https://github.com/tufts-ml-courses/cs135-26s-assignments/blob/main/hw0/hw0_knn.py

To help you out, we've provided a doctest in the starter code with a few possible inputs and corresponding expected outputs. The doctest can be run by downloading test_knn.py and running (in the same directory as your submission file)

python -m doctest test_knn.py

Your submitted code will be tested by the autograder on these doctest examples and several other hidden examples. We strongly urge you to run your own tests to verify correctness.

Optional: Math Review

In course assessments, you will not be expected to do a great deal of math. However, math will appear frequently in lecture as a way of motivating different machine learning models' properties. Rather than simply stating those properties as fact, by investigating the underlying mathematics we can better understand when to use each model and why.

This course will make frequent use of vectors and matrices. If you are unfamiliar with them, we strongly encourage you to take some time before day03 to understand basic operations with them (e.g. vector addition, matrix multiplication etc.). Below, we've provided a couple resources to refresh your memory or familiarize yourself with the kind of linear algebra we will be using, as well as a short page of practice problems to help you callibrate your level of understanding.

We strongly encourage you to work through these practice problems. Solutions will be posted on Piazza for you to review. Answers to these questions will not be collected, but any member of the course staff can help you with them.

Link to math worksheet

The following resources can be used to help you through the worksheet if vectors and matrices are new to you: