- understand applications of statistical methods in bioinformatics and biomedicine
- design and implement such approaches in R
- estimate probabilities of events through simulation
- perform exploratory probabilistic data analysis to identify outliers, artifacts, and patterns
- formulate and assess hypotheses about heterogeneous high dimensional data
- appropriately visualize data to assess quality and support interpretation
- present statistical data analyses clearly and contextually to specific audiences
These aims will be achieved through problem sets, in-class exercises and discussions, readings, and a course project. There will be weekly assignments throughout the term.
Grading will be based on five problem set homework assignments, a course project broken up into several other homework assignments, and class participation.
Professor Donna Slonim (she/her) is the course instructor.
Computer Science PhD student Gizem Cicekli will be our primary teaching assistant, assisted by Nick Camarda from the MD/PhD program.
Email addresses are firstname dot lastname at tufts dot edu (Nick's first name is Nicholas for this purpose), but you can reach all the course staff via Piazza at all times.
Instructor Office Hours: Weds. 2-3:15 on Zoom; Thurs. 1:30-3 in-person; or by appointment. Dr. Slonim's office is currently in Halligan 107B (until we move into the Cummings Center mid-semester).
TA Office Hours: Mon. (Gizem) 1-2PM on Zoom; Tue. (Nick) 3-4PM on Zoom; Fri. (Gizem) 12-1PM in-person at Paige Hall 007; or by appointment.
We are expecting students to come in with some basic familiarity with using R for data analysis (as opposed to actual software development experience in R, which is unnecessary for this course).
If you've never worked with R before, a good introduction is Rafael Irizarry's R for Data Science book. Think about getting comfortable with the content in R basics (Ch 2) and Programming Basics (Ch 3) as a prerequisite. If you are a computer science or data science major with substantial programming experience, this will take you very little time. If you come from another field and have limited programming experience, you should make sure you understand variables, R data types including matrices and data frames, if and ifelse statements, for loops and why we avoid them in favor of vectorization whenever possible, and how to write and call a function with arguments. Some understanding of basic concepts of object-oriented programming, including classes and inheritance, may also be helpful.
The biology background of our students will be varied, but we expect most students taking this class have some reason to be interested in bioinformatics. We will be focusing on examples mostly from genomics and medicine, but many of the methods discussed can be used for a variety of problems in biology and beyond.
The course textbook is Modern Statistics for Modern Biology by Susan Holmes and Wolfgang Huber, published in 2019 by Cambridge University Press. You can find the textbook online here, or you can order a physical copy of it from your favorite bookstore.
Note that at the bottom of the textbook page linked above is a section called "Book supplements." This includes a zipped data directory of all the data sets used in the text, and an "Rfiles folder" containing all the code used in the examples in the book. This is extremely useful for trying out the code snippets without having to re-type everything. You will still have to install the R modules used on your own, as we do not have access to the author's own configuration files that are loaded in the examples. But in my experience, once you've installed the relevant modules, the code provided works as advertised.
You will need access to a computer with an internet connection and the ability to install R and RStudio (there are instructions for doing so here). Problem sets will be completed in R Markdown notebooks. The software for doing this is part of your RStudio installation. If you need help in obtaining computational resources, please contact the course staff as soon as possible.
You will need to create a Gradescope account to submit your work and receive feedback on it. The link to join the class in Gradescope and the entry code are on the private class page.
There is also a class Piazza site linked here that you are welcome to use to ask questions and discuss topics with your classmates and the course staff. We created it in response to student requests, and midterm evaluations from prior courses/terms suggest that using it more is something the students felt would help them get more out of the class. Please take advantage of this resource. You will get faster answers to your questions if you ask the entire group of students and staff at once than if you just email one of us individually. The code to join the class is on the private class page.
A note on reopening:
This semester, we will plan to meet in person in Halligan 108. Masks will be required in our class, as in all public indoor spaces at Tufts this fall. We will continue to follow the university's public health guidance throughout the semester; if conditions change, we will adjust as needed.
We also understand that class attendance will likely be more variable than usual. We encourage connecting with others in class early in the semester, or even reaching out on piazza to obtain notes or discuss material with others if you have to miss class. Slides from all lectures will be posted on the private web site, and staff office hours will be available both in person and online. Quarantine- and COVID-related absences will be dealt with individually as circumstances require, unless we find that need a more comprehensive protocol. In that case, though, probably the entire university will be going through similar issues.
We appreciate your flexibility and resilience in the face of continuous changes, and we will try to work with you to enable your learning as well as possible under the circumstances. Feel free to suggest ways we can do this better.
Many students come to this class with ongoing research projects. Our course project provides an opportunity to continue working on your research with some help from the course staff and applications of the course content.
If you don't have anything you're currently working on, we will encourage you to select a paper that involves substantial bioinformatics or biomedical data analysis in R and to attempt to replicate a part of the results, and perhaps to extend them. We will help you refine the scope of the project to make this manageable. We encourage you to attend office hours to discuss this process within the first month of class.
In either case, this will be a substantial piece of work, going into more depth about computational methods for a particular problem than we will likely be able to cover in class. The project will be broken up into small pieces across the term, starting very early with choosing and describing the proposed project. On the last day, we will have a poster session at which students present their project work to the class. Students will also submit a final written report on the full project, which will combine and extend the content from the written submissions throughout the term.
The basic project components appear on the course schedule below. More details about each piece of the project will be released later.
- Grading: Grades will be
based on five problem set assignments (50%), a course project broken up across several other assignments (45%), and class participation (5%).
- Late policy: Most assignments will be due on Fridays.
Late policy: Submissions are due by midnight on the indicated date;
Gradescope's timestamp is official. For late work,
we are going to use a token policy.
You will have 10 tokens for the term. You may use up to
3 tokens per assignment; each token gets you an extra
day (24 hours as counted by Gradescope). You don't need to tell anyone,
just submit and we will count the number of late days as the number of tokens
used. It is your job to keep track of your token usage.
Beyond the 10 tokens, we will not accept late submissions; submit what you have
by the deadline for partial credit. Turning work in on time is important both for consistency
in grading, and because it allows us to discuss homework in class in
a timely fashion.
As usual, in the case of serious illness or other truly exceptional circumstances (e.g., situations where your Academic Dean is involved), let us know and we will work something out.
- Diversity, Inclusion, and Collegiality:
Tufts, the Computer Science Department, and the course staff
intend to create a welcoming environment in which
all students feel supported and believe that their learning
needs and perspectives are valued. We
intend to present materials in ways that are respectful to
students of any background, ethnicity, race,
culture, gender, sexual orientation, or age.
your suggestions on how to improve course effectiveness for
yourself or others.
If you have religious conflicts with
class meetings or requirements, please connect with the course staff.
In this class, we will encourage questions, discussions, and some assignments that involve interacting in groups. While disagreements and differing opinions can be an important part of the learning experience, we expect all students to treat each other with collegiality and respect. Please reach out to course staff if there are any issues with inter-student interactions. While we do not expect this will be necessary, please be reminded that we will, if needed, follow the steps outlined in Tufts' sexual misconduct and non-discrimination policies.
Please also be aware that Tufts faculty are "mandated reporters": if we see, hear, or learn about any kind of discrimination or sexual misconduct, we are required to report it to the university. If you would like to access confidential counseling for an issue, you can find relevant resources here.
- Accomodation for Students with Disabilities:
Tufts University values the diversity of our
students, staff, and faculty, recognizing the important contribution
each student makes to our unique community. Tufts is committed to
providing equal access and support to all qualified students through
the provision of reasonable accommodations, so that each student may
fully participate in the Tufts experience.
If you have a disability that requires reasonable accommodations, please contact the Student Accessibility Services (SAS) office at Accessibility@tufts.edu or 617-627-4539 to make an appointment with an SAS representative to determine appropriate accommodations. Please be aware that accommodations cannot be enacted retroactively, making timeliness a critical aspect for their provision.
You can find more information on Tufts accessibility policies and procedures here.
In addition to following the standard procedures, if you have a disability and would like to discuss how we can better support your learning, please feel free to set up an appointment with course staff.
- Academic Integrity:
The Tufts academic integrity policy and code of conduct appears
here. In particular,
plagiarism will not be tolerated.
Please see our collaboration policy below describing what is and is not acceptable in the context of this course. If you are not certain what constitutes plagiarism, please see the academic integrity resources at the link above.
Please be aware that if Tufts faculty find evidence of academic misconduct, we are required to report it to the university.
- Collaboration Policy:
All written work and all substantive pieces of code submitted
should be your own unless you obtain
We understand that short code
snippets for common tasks will often be minor modifications of those in the
textbook, the manual, or from class; that will not count as plagiarism
in this context.
You are free to discuss assignments
with others in the class, but
you must write up your answers and code yourself. You are encouraged
to post questions on Piazza, to reach out to course staff, or
to research R and data science methods online (and, of course,
in manuals and vignettes as well as the course text).
However, if you don't know how to approach
a homework question and you need to ask someone else where to start,
you should cite their idea in your submission.
For the course project assignments, all writing must be your own. If you are (as in one project option) trying to replicate the work in a particular paper, the analysis plan will likely not be your own, but it should be clear that this is what you are doing. Please put all primary sources away before you start to write, so that all content will be in your own words. We reserve the right to use computational tools to identify instances of plagiarism or materials first written by someone else - whether published online or previously or concurrently submitted at Tufts.
All sources used should be cited. Directly copying text from any source without attribution is plagiarism and will be dealt with accordingly.
private course materials page. You will need to log in using your CS department account and password. An account will be created for all students registered for the course in SIS who do not already have one.
Tentative Course Schedule:Shaded rows refer to past dates.
|DATE||TOPICS||READING||DUE ON FRIDAY|
|Thurs., Sept. 9||Course introduction. Generative models. Probability distribution functions in R. Discrete and continuous distributions.|
|Tues., Sept. 14||Poisson and normal distribution functions; uses in Monte Carlo simulation. Extreme values.||Holmes and Huber (all readings are Chapter/Section numbers in this textbook), 1.1-1.3|
|Thurs., Sept. 16||R vectors, lists data frames. Problem set basics. Empirical cumulative distribution function. Multinomials.|
|Tues., Sept. 21||Simulations to estimate power. Quantiles. Fitting distributions with VCD.||1.4-1.5|
|Thurs., Sept. 23||Course project introduction. Data science principles. Fitting distributions to data in VCD.||3.1-3.3||Problem Set 1|
|Tues., Sept. 28|| Grammars. Graphics: base R and ggplot introduction. Maximum likelihood intro.
Data file for class: MFgenes.txt
|Thurs., Sept. 30||MAP vs MLE, Bayes' theorem. More on maximum likelihood.||2.9||Project Introduction|
|Tues., Oct. 5||The incidentalome. Mixture models.||4.1-4.2, Incidentalome paper|
|Thurs., Oct. 7|| More about mixture models. Zero-inflation. Dirichlet mixtures
In-class exercise on mixtures
|4.2; optional reading: Brown, et al. on Dirichlet mixture models for protein families.||Problem Set 2|
|Tues., Oct. 12|| Infinite mixtures; Non-parametric bootstrap
dbtable.txt, and Instruction files for class exercises
|Thurs., Oct. 14|| Variance stabilizing transformations; problem set 3.
Files for class exercises: vardata.txt, classexercise.rds, and instructions
|Project Methods and Data|
|Tues., Oct. 19||Single cell sequencing intro; hypothesis testing.||4.4.3-4.4.4|
| Thurs., Oct. 21
|Dimension reducing projections.||5.3, 7.1-7.5||Problem Set 3|
|Tues., Oct. 26|| Modeling: linear regression. Data filtering.
diabetes.rds, Longley economic data and data2filter.txt for possible in-class exercises.
|Thurs., Oct. 28||Dplyr, tidy data. Logistic regression.||R/stats Memes; Project Methods and Data Revisions (only if requested)|
|Tues., Nov. 2||Copy number variation. More on dplyr and the tidyverse.|
|Thurs., Nov. 4||RNA-sequencing: Intro and normalization.||8.7-8.8||Problem Set 4|
|Tues., Nov. 9||CNV, CNV+PCA, RNA-sequencing.||9.1-9.3||DUE TUESDAY Project schedule/implementation plan|
|Thurs., Nov. 11||NO CLASS: VETERANS' DAY|
|Tues., Nov. 16||RNA-sequencing, design matrices||8.9-8.10|
|Thurs., Nov. 18||Flow cytometry. Communicating across boundaries.||Initial project results|
|Tues., Nov. 23||Correspondance analysis; presentations||9.5||Preliminary integrated paper draft (combine and edit previous submissions as needed to ensure coherence)|
|Thurs., Nov. 25||NO CLASS: THANKSGIVING BREAK|
|Tues., Nov. 30||Functional enrichment; Canonical correlation analysis||9.6|
|Thurs., Dec. 2||Linear discriminant analysis||12.1-12.3||Problem Set 5|
|Tues., Dec. 7||Experimental design. Communicating visually.||13.1-13.6|
|Thurs., Dec. 9||Problem sets 4 and 5; moving forward||Posters|
|Tues., Dec. 14||Poster session||Remaining project results and completed paper revisions, submitted by last day of class|