Teaching Staff

Professor: J Singh, Jitendra.Singh@tufts.edu
TA: Lina Li, lina-li99@hotmail.com
TA: Hiba Eltigani, hiba.eltigani@tufts.edu

Class Meetings

Lectures: Tue, Thu 1:30 — 2:45 pm. Synchronously over Zoom. You will need to sign in using your tufts credentials.

Office Hours:
J:: Thu 4:00 pm — 5:00 pm. Synchronously over Zoom.
Hiba: Wed 4:30 pm — 5:30 pm. Synchronously over Zoom.
Lina: Sat 10:00 am — 12:00 pm. Synchronously over Zoom.

Additional Office Hours by Appointment

J's Office: 2nd Floor, Virtual Halligan

Zoom recording policy

To accommodate students who are unable to attend class for whatever reasons, the lectures and class discussion will be recorded.

By participating, you are consenting to the recording. If you have objections to being recorded, please contact me before class.

Description

Big Data deals with emerging applications in science and engineering disciplines that generate and collect data at unprecedented speed, scale, and complexity that need to be managed and analyzed efficiently.

This course introduces the latest techniques and infrastructures developed for big data including parallel and distributed database systems, map-reduce infrastructures, scalable platforms for complex data types, stream processing systems, and cloud-based computing.

The course content will be a blend of theory, algorithms and practical (hands on) work.

Prerequisites: Formally, there are no prerequisites for this course. Still, COMP-119 should not be your first programming course. Familiarity with database internals (COMP-115) will be helpful but is not required.

Programming Languages: Big Data work across the industry involves Python, Java, Scala, shell programming, SQL and GO, in order of popularity. Most of our work in this class will be in Python, but familiarity with the programming languages cited will give you a head start.

Reference Textbooks and Resources

It is a long list but all of the books are either on Safari Books Online (available free through the Library) or downloadable free for private use.

Conceptual Sources

The Datacenter as a Computer
Designing Warehouse-Scale Machines, Third Edition
 Luiz André Barroso
 Urs Hölzle
 Parthasarathy Ranganathan
Morgan & Claypool Publishers
Copyright: 2019
ISBN: 978-1681734330
Download for private use only from the publisher.
Data-Intensive Text Processing with MapReduce
 Jimmy Lin
 Chris Dyer
Morgan & Claypool Publishers
Copyright: 2010
ISBN: 978-1608453429
Download for private use only from the publisher.
Mining of Massive Datasets, 2nd edition 
 Jure Leskovec
 Anand Rajaraman
 Jeff Ullman
Cambridge University Press
Copyright: 2014
ISBN: 978-1107077232
Download for private use only from Stanford InfoLab.

Sources for Hands-On work

Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale, 4th Edition 
 Tom White
O'Reilly Media
Copyright: 2015
ISBN: 978-1491901632
Spark: Cluster Computing with Working Sets 
 Matei Zaharia
 Mosharaf Chowdhury
 Michael J. Franklin
 Scott Shenker
 Ion Stoica
Copyright: 2010
Download from MIT CSAIL.
Learning Spark 
 Holden Karau
 Andy Konwinski
 Patrick Wendell
 Matei Zaharia
O'Reilly Media
Copyright: 2015
ISBN: 978-1449358624

Seeking Help

Please keep in mind the following as regards to approaching the instructor for help.

  1. First, please utilize the discussion boards on Piazza - as it is monitored daily. Plus other students could likely benefit from the Q & A. Some students may answer your questions even faster!
  2. To get in touch with a TA or the instructor, please send them a message via Piazza. Second-best, send an email at the address provided above.
  3. To get in touch with the instructor for a matter unrelated to course content, please email the instructor. Please keep the use of email to confidential matters, not for general class discussion
  4. If no response from the above within 24 hrs, or in case of an emergency, please call the instructor at (617) 444-9640.


Grades for the Course

The grades will be allocated as follows:

Item
% score
Class Participation
  • Turning on your video camera and keeping it turned on!,
  • Thoughtful (and helpful) questions/comments in class and on Canvas,
  • Willingness to help peers when they are stuck, (without doing the work for them),
  • Participation in office hours.
10%
Take Home Quizzes
[Quiz 1: 8 points], [Quizzes 2-3: 16 points each]
40%
Five Cs Paper Reviews
The five Cs of any 5 of the readings for meetings 3 – 18 (except meeting 10) submitted within 1 week of the meeting. [3% each]
15%
Major Paper Review
A complete review of one paper chosen by you for deeper study. Your paper could be chosen from among
  • Recipient of the VLDB Best Paper Award
  • Recipient of the SIGMOD Best Paper Award
  • For your convenience in searching, Jeff Huang of Brown University has compiled a fairly comprehensive list of best papers at CS conferences.
  • Recipient of the VLDB Test of Time Award
  • Any other paper of interest to you – please get an OK from one of the TAs before investing significant time on the paper.
15%
Gedankenproject
Calling it a project would be a bit of a misnomer since you are not required to write any code! So we call it a Gedankenproject in the spirit of a Gedankenexperiment
20%

Week-by-week Schedule

Mtg Tu Th Topic Reading Quiz
1 9/9 Course Intro How to Read a Paper
Also watch How to Write a Great Research Paper
2 9/14 Petabyte Scale File Systems Barosso et al, chapters 1 and 8 Quiz 1, The Opioid Files
3 9/16 Google File System, BigQuery The Google File System,
An Inside Look at Google BigQuery
4 9/21 Redshift AWS Redshift WhitePaper
5 9/23 Page Rank Algorithm The anatomy of a large-scale hypertextual Web search engine (paper downloadable from the Tufts Library),
Leskovec et al, chapter 5
Quiz 1 due
6 9/28 Map Reduce Concepts MapReduce: Simplified Data Processing on Large Clusters
7 9/30 Map Reduce Implementation Lin et al Quiz 2 assigned
8 10/5 Resilient Distributed Datasets Resilient Distributed Datasets
9 10/7 Spark Architecture Spark: Cluster Computing with Working Sets,
10 10/12 Project Ideas Everybody Lies
11 10/14 Spark SQL Spark SQL
12 10/19 Leskovec et al, section 2.3 Map Reduce Algorithms
13 10/21 Latent Dirichlet Allocation Latent Dirichlet Allocation Quiz 2 Due 10/23
14 10/26 Streaming 101 Streaming 101 Quiz 3 assigned
15 10/28 Streaming 102 Streaming 102
16 11/2 Sketching Algorithms Watch sketching,
Leskovec et al, chapter 4
17 11/4 Spark Streaming Spark Discretized Streams
18 11/9 Apache Beam The Dataflow Model
11/11 Veteran's Day
19 11/16 Spark Libraries Spark MLlib
20 11/18 Frequent Itemsets Leskovec et al, chapter 6
21 11/23 Algorithms for Text Processing Lin et al, chapter 6
11/25 Thanksgiving Holiday
22 11/30 Big Data Algorithms III
23 12/2 Multimaster Updates, CAP Theorem Quiz 3 due
24 12/7 Multimaster Updates, RAFT Algorithm
25 12/9 Probabilistic Methods, Bloom Filters
26 12/14 High-Dimensional Search Leskovec et al, chapter 3
Finals 12/21 Paper/Project Report Due


Project

Note: This section has been superseded by updated guidance.

About the Instructor

I have worked in Cloud Computing, Big Data and Python since 2008. Python has been my programming language of choice ever since!

I received my Ph.D. in Electrical Engineering working on solving large-scale matrix problems in Electromagnetics. I was initially on the EE faculty. I spent a major part of my career in industry, mostly in Systems Architect roles, first Computer-Aided Design and later in Finance. Throughout my career, I have stayed close to data and databases as my area of focus.

Please call me Jitendra or J or Prof. J, whichever you prefer. (No period after the J)


Policies

Late Work Policy

Please plan to submit your work early and often! The last version submitted by the due date will be considered. Submissions after the due date may be taken into account but it's not guaranteed.

Academic Integrity

You are expected to be familiar with the Student Guide to Academic Integrity at Tufts (available here).

Academic Accommodations

If you need course adaptations or accommodations because of a disability, or if you have medical information to share with us that may impact your performance or participation in this course, please make an appointment with us as soon as possible.  

If you have approved accommodations, please request your accommodation letters online through the Office of Disability Services student portal. If you have not already done so, students with disabilities who need to utilize accommodations for this course are encouraged to contact the Office of Disability Services as soon as possible to ensure that such accommodations are implemented in a timely fashion.