Course Aims and Description
- understand applications of statistical methods in bioinformatics and biomedicine
- design and implement such approaches in R
- estimate probabilities of events through simulation
- perform exploratory data analysis to identify outliers, artifacts, and patterns
- formulate and assess hypotheses about heterogeneous high dimensional data
- appropriately visualize data to assess quality and support interpretation
- present statistical data analyses clearly and contextually to specific audiences
These aims will be achieved through problem sets, in-class exercises and discussions, readings, and a course project. There will be weekly assignments throughout the term.
Grading will be based on five problem set homework assignments, a course project broken up into several other homework assignments, online quizzes (probably not-quite-weekly), and class participation.
Professor Donna Slonim (she/her) is the course instructor.
Computer Science PhD student Hao Zhu will be our primary teaching assistant.
Email addresses are firstname dot lastname at tufts dot edu, but you can reach all the course staff via Piazza at all times.
- Mondays: Megan Gelement, 11am-12:30pm, JCC (3rd floor kitchen area)
- Wednesdays: Prof. Slonim, 10:30-12pm, JCC 322; Hao Zhu, 1-3pm, Zoom (link on private class page)
- Fridays: Prof. Slonim, 1-2:30pm, JCC 322; Hao Zhu, 11:30-1pm, JCC (3rd floor kitchen area)
- or any of us by appointment (we suggest using Piazza to reach all staff at once)
We are expecting students to come in with some basic familiarity with using R for data analysis (as opposed to actual software development experience in R, which is unnecessary for this course).
If you've never worked with R before, a good introduction is Rafael Irizarry's R for Data Science book. Think about getting comfortable with the content in R basics (Ch 2) and Programming Basics (Ch 3) as a prerequisite. If you are a computer science or data science major with substantial programming experience, this will take you very little time. If you come from another field and have limited programming experience, you should make sure you understand variables, R data types including matrices and data frames, if and ifelse statements, for loops and why we avoid them in favor of vectorization whenever possible, and how to write and call a function with arguments. Some understanding of basic concepts of object-oriented programming, including classes and inheritance, may also be helpful.
The biology background of our students will be varied, but we expect most students taking this class have some reason to be interested in bioinformatics. We will be focusing on examples mostly from genomics and medicine, but many of the methods discussed can be used for a variety of problems in biology and beyond.
The course textbook is Modern Statistics for Modern Biology by Susan Holmes and Wolfgang Huber, published in 2019 by Cambridge University Press. You can find the textbook online here, or you can order a physical copy of it from your favorite bookstore.
Note that at the bottom of the textbook page linked above is a section called "Book supplements." This includes a zipped data directory of all the data sets used in the text, and an "Rfiles folder" containing all the code used in the examples in the book. This is extremely useful for trying out the code snippets without having to re-type everything. You will still have to install the R modules used on your own, as we do not have access to the author's own configuration files that are loaded in the examples. But in my experience, once you've installed the relevant modules, the code provided works as advertised.
You will need access to a computer with an internet connection and the ability to install R and RStudio (there are instructions for doing so here). We would like to encourage you to bring to class a machine with Rstudio on it when possible, as we will do in-class exercises at times, and these will work best if at least one student in each group has a functional machine. If you don't have an easily portable machine or a functional battery, don't worry; you can team up with someone who does or we'll provide alternative approaches. Problem sets will be completed in R Markdown notebooks. The software for doing this is part of your RStudio installation. If you need help in obtaining computational resources, please contact the course staff or the CS/ECE help desk as soon as possible.
You will need to create a Gradescope account to submit your work and receive feedback on it. The link to join the class in Gradescope and the entry code are on the private class page.
There is also a class Piazza site linked here that you are welcome to use to ask questions and discuss topics with your classmates and the course staff. We created it in response to student requests, and midterm evaluations from prior courses/terms suggest that using it more is something the students felt would help them get more out of the class. Please take advantage of this resource. You will get faster answers to your questions if you ask the entire group of students and staff at once than if you just email one of us individually. The code to join the class is on the private class page.
A note on COVID policies:
This semester, we will plan to meet in person in JCC 140. I will be masking, and I'd like to encourage students to do so in class, at least initially. I understand that this is inconvenient, but the classroom is likely to be relatively full, and I have previously had highly vulnerable students in class who did not themselves initially know they were at increased risk. So out of consideration for others, we will try to minimize in-class risk for all.
We will also continue to follow the university's public health guidance throughout the semester; if conditions change, we will adjust as needed. If you do get sick, please do not attend class in person, but please also let the course staff know so that we can give you access to additional course materials.
We also understand that class attendance will likely be more variable than usual. We encourage connecting with others in class early in the semester, and/or reaching out on Piazza to obtain notes or discuss material with others if you have to miss class. Slides from all lectures will be posted on the private web site, and staff office hours will be available both in person and online. Health-related absences will be dealt with individually as circumstances require.
We appreciate your flexibility and resilience in the face of continuous changes, and we will try to work with you to enable your learning as well as possible under the circumstances. Feel free to suggest ways we can do this better.
Many students come to this class with ongoing research projects. Our course project provides an opportunity to continue working on your research with some help from the course staff and applications of the course content.
If you don't have anything you're currently working on, we will encourage you to select a paper that involves substantial bioinformatics or biomedical data analysis in R and to attempt to replicate a part of the results, and perhaps to extend them. We will help you refine the scope of the project to make this manageable. We encourage you to attend office hours to discuss this process within the first month of class.
In either case, this will be a substantial piece of work, going into more depth about computational methods for a particular problem than we will likely be able to cover in class. The project will be broken up into small pieces across the term, starting very early with choosing and describing the proposed project. On the last day, we will have a poster session at which students present their project work to the class. Students will also submit a final written report on the full project, which will combine and extend the content from the written submissions throughout the term.
The basic project components appear on the course schedule below. More details about each piece of the project will be released later.
- Grading: Grades will be
based on five problem set assignments (45%), a course project broken up across several other assignments (45%), online quizzes (5%), and class participation (5%).
- Late policy: Most assignments will be due on Fridays, except for the online quizzes, which will likely be due on Wednesdays.
The quizzes should not be time-consuming; they are
intended to solidify and cement understanding of the
reading or of concepts covered in class.
Late quizzes will not be accepted and there
will be no make-ups, but the
two lowest quiz scores will be dropped. We suggest you plan to
complete all of them, so that if you occasionally miss a deadline
for whatever reason, it won't affect your grade at all.
Submissions are due by midnight on the indicated date; Gradescope's timestamp is official. For late assignments (other than quizzes) we are going to use a token policy. You will have 10 tokens for the term. You may use up to 2 tokens per problem set or project assignment; each token gets you an extra day (24 hours as counted by Gradescope). You don't need to tell anyone, just submit and we will count the number of late days as the number of tokens used. It is your job to keep track of your token usage. Beyond the 10 tokens, we will not accept late submissions; submit what you have by the deadline for partial credit. Turning work in on time is important both for consistency in grading, and because it allows us to discuss homework in class in a timely fashion.
As always, in the case of serious illness or other truly exceptional circumstances (e.g., situations where your Academic Dean is involved), let us know and we will work something out.
- Diversity, Inclusion, and Collegiality:
Tufts, the Computer Science Department, and the course staff
intend to create a welcoming environment in which
all students feel supported and believe that their learning
needs and perspectives are valued. We
intend to present materials in ways that are respectful to
students of any background, ethnicity, race,
culture, gender, sexual orientation, or age.
your suggestions on how to improve course effectiveness for
yourself or others.
If you have religious conflicts with
class meetings or requirements, please connect with the course staff.
In this class, we will encourage questions, discussions, and some assignments that involve interacting in groups. While disagreements and differing opinions can be an important part of the learning experience, we expect all students to treat each other with collegiality and respect. Please reach out to course staff if there are any issues with inter-student interactions. While we do not expect this will be necessary, please be reminded that we will, if needed, follow the steps outlined in Tufts' sexual misconduct and non-discrimination policies.
Please also be aware that Tufts faculty are "mandated reporters": if we see, hear, or learn about any kind of discrimination or sexual misconduct, we are required to report it to the university. If you would like to access confidential counseling for an issue, you can find relevant resources here.
- Accomodation for Students with Disabilities:
Tufts University values the diversity of our
students, staff, and faculty, recognizing the important contribution
each student makes to our unique community. Tufts is committed to
providing equal access and support to all qualified students through
the provision of reasonable accommodations, so that each student may
fully participate in the Tufts experience.
If you have a disability that requires reasonable accommodations, please contact the StAAR Center at StaarCenter@tufts.edu or 617-627-4539. Please be aware that accommodations cannot be enacted retroactively, making timeliness a critical aspect for their provision.
Academic Support at the StAAR Center: The StAAR Center offers a variety of FREE resources to all students. Students may make an appointment to work on any writing-related project or assignment, attend subject tutoring in a variety of disciplines, or meet with an academic coach to hone skills like time management and navigating procrastination. Students can make an appointment for any of these services by visiting https://students.tufts.edu/staar-center.
In addition to following the standard procedures, if you have a disability and would like to discuss how we can better support your learning, please feel free to set up an appointment with course staff.
- Policy on sharing class materials: This course is designed for everyone to feel comfortable participating in discussion, asking questions, learning, and facilitating the learning of others. In order for that atmosphere to be maintained, any recordings of our conversations will only be shared with the enrolled students in the class (i.e., not posted publicly). Similarly, we have specifically designed the syllabus, quizzes, handouts, and lectures for the people who are enrolled in the course this term and those may not be shared outside this course. It is against Tufts policy for anyone to share any content made available in this course, including course syllabi, reading materials, problems sets, videos, handouts, and quizzes, with anyone outside of the course without the express permission of the instructor. This especially includes any posting or sharing of videos or other recordings on publicly accessible websites or forums. Any such sharing or posting could violate copyright law or law that protects the privacy of student educational records.
- Academic Integrity:
The Tufts academic integrity policy and code of conduct appears
here. In particular,
plagiarism will not be tolerated.
Please see our collaboration policy below describing what is and is not acceptable in the context of this course. If you are not certain what constitutes plagiarism, please see the academic integrity resources at the link above. We will spend time on this early in the semester as well.
Please be aware that if Tufts faculty find evidence of academic misconduct, we are required to report it to the university.
- Collaboration Policy:
All written work and all substantive pieces of code submitted
should be your own unless you obtain
We understand that short code
snippets for common tasks will often be minor modifications of those in the
textbook, the manual, or from class; that will not count as plagiarism
in this context.
You are free to discuss assignments
with others in the class, but
you must write up your answers and code yourself. You are encouraged
to post questions on Piazza, to reach out to course staff, or
to research R and data science methods online (and, of course,
in manuals and vignettes as well as the course text).
However, if you don't know how to approach
a homework question and you need to ask someone else where to start,
you should cite their idea in your submission.
For the course project assignments, all writing must be your own. If you are (as in one project option) trying to replicate the work in a particular paper, the analysis plan will likely not be your own, but it should be clear that this is what you are doing. Please put all primary sources away before you start to write, so that all content will be in your own words. We reserve the right to use computational tools to identify instances of plagiarism or materials first written by someone else - whether published online or previously or concurrently submitted at Tufts.
All sources used should be cited. Directly copying text from any source without attribution is plagiarism and will be dealt with accordingly.
Course MaterialsFor homeworks, slides, and other class information, go to the private course materials page. You will need to log in using your CS department account and password. An account will be created for all students registered for the course in SIS who do not already have one.
Tentative Course Schedule:Shaded rows refer to past dates.
|DATE||TOPICS||READING||DUE ON FRIDAY (Unless otherwise stated)|
|Weds., Sept. 7||Course introduction. Generative models. Probability distribution functions in R. Discrete and continuous distributions.|
|Mon., Sept. 12||Poisson and normal distribution functions; uses in Monte Carlo simulation. Extreme values.||Holmes and Huber (all readings are Chapter/Section numbers in this textbook), 1.1-1.3|
|Weds., Sept. 14||R vectors, lists data frames. Problem set basics. Empirical cumulative distribution function. Multinomials.||Plagiarism workshop from University of Guelph|
| Mon., Sept. 19
(ADD DATE: Sept. 20)
|Simulations to estimate power. Quantiles. Fitting distributions with VCD.||1.4-1.5|
|Weds., Sept. 21||Course project introduction. Data science principles. Fitting distributions to data in VCD.||3.1-3.3||Problem Set 1|
|Mon., Sept. 26|| Grammars. Graphics: base R and ggplot introduction. Maximum likelihood intro.
Data file for class: MFgenes.txt
|Weds., Sept. 28||MAP vs MLE, Bayes' theorem. More on maximum likelihood.||2.9||Project Introduction|
|Mon., Oct. 3||Mixture models. Zero-inflation. Non-parametric bootstrap.||4.1-4.4|
|Weds., Oct. 5||Incidentalome. Paper discussion questions. Boostrapping.||Incidentalome paper||Problem Set 2|
|Mon., Oct. 10||NO CLASS: INDIGINOUS PEOPLES' DAY|
|Weds., Oct. 12||Problem Set 3 prep: grep, RDS; transcriptomics intro. Classexercise.rds. and instructions. Also, dbtable.txt, Bootstrap exercise instructions||Project Methods and Data|
|Mon., Oct. 17||Dimension reducing projections. Hypothesis testing||7.1-7.5; 6.1-6.9|
|Weds., Oct. 19|| RNA sequencing data; Variance stabilizing transformation; heteroskedicity.
Files for class exercises: vardata.txt, pcaExercise.pdf data2filter.txt
|4.4.3-4.4.4; 8.1-8.5||Problem Set 3|
|Mon., Oct. 24|| RNA-sequencing, design matrices
DESeq exercise, ratcounts.txt, ratmorphineannot.csv
|Weds., Oct. 26|| Quiz 4 review; flow cytometry / PS 4 prep. Data filtering with dplyr.
Instructions and data2filter.txt for possible in-class exercise.
|R/stats Memes; Project Methods and Data Revisions (only if requested)|
|Mon., Oct. 31|| dplyr and tidy data; faceting.
diabetes.rds and db.instructions.pdf
|Weds., Nov. 2||Problem set 3. Project time management. Functional enrichment.||Problem Set 4|
|Mon., Nov. 7||Functional enrichment; Linear regression.||10.3; 7.3-7.4|
|Weds., Nov. 9|| Linear regression. Project planning. Effective data presentation.
Longley economic data and instructions for in-class exercise.
|DUE WEDNESDAY, 11/9: Project schedule/implementation plan|
|Mon., Nov. 14|| Logistic regression, CNVs
diabetes.rds, logistic regression exercise instructions
|Weds., Nov. 16||Multitable canonical correlation analysis; research and project discussion||9.6; (optional: Kashyap paper)||Initial Project Results done by Friday|
|Mon., Nov. 21|| PS 4 follow-up; correspondence analysis
preterm.txt and Instructions for in class exercise
|9.4-9.5||DUE MONDAY, 11/21: Preliminary integrated paper draft (combine and edit previous submissions as needed to ensure coherence; include online initial project results)|
|Weds., Nov. 23||NO CLASS: THANKSGIVING BREAK|
|Mon., Nov. 28|| Correspondence analysis trajectories, temporal gradients
In-class poster design questions
Poster template critique
|Weds., Nov. 30|| Linear discriminant analysis and more about classification
|12.1-12.4||Problem Set 5|
|Mon., Dec. 5||Outlier detection, Quiz 6, more on functional programming|
|Weds., Dec. 7||Problem set follow up; Experimental design. Moving forward.||13.1-13.6||DUE WEDNESDAY, 12/7: Posters|
|Mon., Dec. 12||Poster session||DUE MONDAY, 12/12: Completed paper revisions, including remaining project results|