Comp 150ML: Machine Learning
Department of Computer Science
Tufts University
Fall 2005

Course Web Page (this page): http://www.cs.tufts.edu/comp/150ML/

Announcement(s):

(12/19) Exam 2 and the project have been graded and can be picked up in the CS main office.

Syllabus:

Description: The course covers the main paradigms in machine learning including supervised learning, unsupervised learning and reinforcement learning. The focus is on practical aspects: ideas underlying various methods, design of algorithms using these ideas, and their empirical evaluation. We will discuss well established techniques as well as new developments from recent research.

Prerequisites:

COMP160 (Algorithms) or permission.

Class Times:

Tuesday, Thursday 5:30-6:45 Halligan

Instructor:

Roni Khardon
Office: Halligan 230
Office Hours: Mon 3-4:30, Tue 1:45-3:15
Phone: 1-617-627-5290
Fax: 1-617-627-3220
Dept.: 1-617-627-3217
Email: roni@cs.tufts.edu

Course Work and Marking

The course mark will be determined by a combination of

Homework assignments (30%): These will include both "pencil and paper" exercises reviewing the material and experimental machine learning work. The latter will include both programming assignments and use of existing machine learning software.
Rules for late submissions: All work must be turned in on the date specified. Please notify me of special circumstances at least two days in advance. Otherwise, If you haven't finished an assignment, turn in what you have on the due date, and it will be evaluated for partial credit.
Final project (30%): A large individual or group-run experimental project. The project will be graded based on the quality of work/experiments/programming as well as a final project report. Details to be announced.
In-class exam (Tuesday, Oct 25, 20%)
In-class exam (Thursday, Dec 8, 20%)

Collaboration:

Unless you are doing a group project all work must be done individually and written up individually. However, I encourage discussion among students on the topics of exercises as this often improves the learning experience. If you discuss the work with other students, please state briefly but clearly, at the top of your writeup, whom you discussed the work with and to what extent.

Tentative List of Topics (8/05)

[We are likely to skip one or two sub-topics]

Supervised Learning Basics: Introduction, decision trees, linear threshold elements and neural networks. Experimental evaluation.
Unsupervised learning and clustering: simple clustering algorithms, statistical (maximum likelihood and Bayesian) models of learning, k-means as clustering. The EM algorithm.
Related supervised methods: Naive-Bayes classifier. Instance based learning.
Data Mining: association rules
Supervised Learning Revisited: Utilizing relations among examples. Learning rules and Inductive Logic Programming. Kernel methods and support vector machines. Attribute selection. Multi-class problems. Boosting and related methods.
Reinforcement learning: Markov Decision Processes. Temporal difference and Q learning.

Textbooks and Material Covered

No text covers all the material. The Mitchell text covers a large portion and I will be assigning readings from it. Some of the material is from recent research articles (see reading list below).

Machine Learning. Tom M. Mitchell, McGraw-Hill, 1997
An introduction to support vector machines : and other kernel-based learning methods. N. Cristianini and J. Shawe-Taylor. Cambridge University Press, 2000.
Ian H. Witten, Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. [Describes algorithms and background on the weka system]
Pattern Classification (2nd edition), by R. Duda, P. Hart, and D. Stork, John Wiley & Sons, 2001.

Software

Weka: we will be using the weka java package for various things. Weka is accessible on our linux and sun machines You can also download and install it on your computer from the weka home page.
Using Weka: To set up the java and weka environments run "source /comp/150ML/files/setup/setup.weka.sun" or "source /comp/150ML/files/setup/setup.weka.linux". See the bottom of set up files for examples of running the system. Here is a local version (with added local instructions) of the README file of the system.

Resources

Reading and References

(topic 1) Introduction to Machine Learning Reading: Chapter 1 of Mitchell's text.
(topic 2) Decision Trees Reading: Chapter 3 of Mitchell's text.
Some of the material discussed in class is from
T. Dietterich, M. Kearns, and Y. Mansour Decision Tree Learning and Boosting Applying the Weak Learning Framework to Understand and Improve C4.5. International Conference on Machine Learning, 1996.
D. Page & S. Ray (2003). Skewing: An Efficient Alternative to Lookahead for Decision Tree Induction International Joint Conference on Artificial Intelligence, 2003.
(topic 3) Version Spaces Reading: Chapter 2 of Mitchell's text.
(topic 4) Neural Networks Reading: Chapter 4 of Mitchell's text.
Some of the material discussed in class is from
D.P. Helmbold and J. Kivinen, and M. Warmuth Relative Loss Bounds for Single Neurons IEEE Transactions on Neural Networks, Vol. 10(6), pp. 1291--1304, November 1999
Yoav Freund, Robert E. Schapire Large Margin Classification Using the Perceptron Algorithm , Machine Learning Journal 1999,
(topic 5) Evaluating Hypotheses and Algorithms Reading: Appendix C.1-4 in Cormen et. al. Algorithms text, Chapter 5 of Mitchell's text.
Additional references (only partly discussed in class) for this topic are:
Foster Provost, Tom Fawcett, Ron Kohavi The Case Against Accuracy Estimation for Comparing Induction Algorithms Proc. 15th International Conf. on Machine Learning, 1998.
T. Dietterich, Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms Neural Computation 10(7), 1998.
Stephen Salzberg On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach Data Mining and Knowledge Discovery, 1997.
(topic 6) Unsupervised learning and clustering + related supervised methods We are not following any text very closely. Here are the topics covered, in order, and readings for them:
- Clustering: sections 10.6, 10.7, 10.9 of [Duda et al] text. [Optional additional reading: weka book section 6.6]
- Maximum likelihood and Bayesian estimation: Section 6.2 of Mitchell's text. [Optional additional reading [Duda et al] sections 3.1-3.4.]
- The EM algorithm: Slide copies from class; Section 6.12 of Mitchell's text.
- Bayes/ML methods for classification problems: Sections 6.2, 6.9, 6.10 of Mitchell's text.
- [Optional (not covered in class): MAP decisions over labels (in contrast with hypotheses): Sections 6.3, 6.7 of Mitchell's text.]
- Instance based learning: Sections 8.1-8.4 of Mitchell's text.
Association Rules
Read (at least) pages 1-16 of the following survey: Survey on Frequent Pattern Mining B. Goethals. Manuscript, 2003
You should at least skim through the following article that introduced the association rules problem: Mining Association Rules between Sets of Items in Large Databases Rakesh Agrawal, Tomasz Imielinski, Arun Swami Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, 1993.
Several other sources mentioned in class are:
Real World Performance of Association Rule Algorithms Zijian Zheng, Ron Kohavi, Llew Mason Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2001.
Dynamic Itemset Counting and Implication Rules for Market Basket Data Sergey Brin, Rajeev Motwani, Jeffrey D. Ullman, Shalom Tsur. SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, 1997.
Mining the Most Interesting Rules J. Bayardo Jr. and R. Agrawal. In Proc. of the 5th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining, pages 145-154, August 1999
The algorithm using transversals is from (see pages 17-22):
Discovering All Most Specific Sentences D. Gunopulos, R. Khardon , H. Mannila, S. Saluja, H. Toivonen, and R. S. Sharma, ACM Transactions on Database Systems , Vol 28, pages 140-174, June 2003.
Learning Rules and Inductive Logic Programming Reading Chapter 10 of Mitchell's text.
Some of the material discussed in class is from
I. Bratko and S.H. Muggleton, Applications of Inductive Logic Programming. Communications of the ACM, 38(11):65-70, 1995.
Additional Information about RIPPER (special pruning) can be found in William W. Cohen, Fast Effective Rule Induction Proc. of the 12th International Conference on Machine Learning, 1995.
Additional Information about FOIL (greedy method) can be found in Ross Quinlan, Learning First-Order Definitions of Functions, JAIR vol 5, 139-161, October 1996.
Additional Information about LogAnH (LGG based method) can be found in Marta Arias, Roni Khardon and Jerome Maloberti, Learning Horn Expressions with LogAn-H Technical Report (Long version of papers from ICML 2000 and ILP 2004)
Additional Information about Progol (Inverse Entailment method) can be found in S.H. Muggleton, Inverse entailment and Progol. New Generation Computing, 13:245-286, 1995.
Kernel Methods and Support Vector Machines Reading: N. Cristianini and J. Shawe-Taylor text listed above (on reserve in library) pages: 9-19 and 26-32 (you may skip the proofs on pages 14,17). You may also want to skim Chapter 8 describing various applications.
Additional information, tutorials etc are available through the kernel-machines site
Boosting and other Aggregation Methods
I recommend reading the first paper and at least the first few pages of the second paper.
An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Dietterich, T. Machine Learning, 40 (2) 139-158, 2000.
Improved boosting algorithms using confidence-rated predictions Robert E. Schapire and Yoram Singer Machine Learning, 37, 1999.
Boosting the margin: A new explanation for the effectiveness of voting methods. Robert E. Schapire, Yoav Freund, Peter Bartlett and Wee Sun Lee. The Annals of Statistics, 26(5):1651-1686, 1998.
Additional information, tutorials etc are available through the boosting site

Information for exams:

Information for Projects:

Topics for Projects: The projects are intended to provide experience with the complete life-cycle of experimental machine learning work. So you should design experiments, carry them out and analyze the results. Different types of projects are possible.
You could introduce a new problem or data set (ideally ones that require thinking and experimentation on the best representation and/or task for them) and apply some machine learning methods to it. You could investigate some less well understood properties or variants of learning algorithms. You could try to reproduce results from the literature. Scanning through recent ICML papers (proceedings linked above) can be a good source for inspiration.
You can also choose one of the projects listed below.
I can help you explore ideas to find a project topic and delimit a reasonable scope for it. Please come to talk to me to discuss your project topic.
Proposal: Once we agree on the topic and scope you should write a short "project proposal" (at most 1/2 page of text) describing the work you intend to do and hand in a copy by Thursday Nov 10 (preferably before but should be done by this date) .
Project Presentations: The last one or two classes will be devoted to project presentations where you will give a short talk to describe your project and results.
Project Reports: Your project writeup (5-10 pages) is due on Monday Dec 12 noon. The report should explain what problem you were studying, what you were trying to test discover or verify, the software written for this purpose, and the results. You should provide numerical results (e.g. accuracies) as well as qualitative evaluations/summary of these. Wherever possible you should qualify comparisons (e.g. "algorithm A is better than B for this task") using statistical tests/intervals. In addition you should submit copies of programs, and samples of data where appropriate.
Project Option I: feature selection In this project you will experiment with two methods for feature subset selection. The first method applies the following simple idea or a variant of it: we can compute the information gain for each feature on the data (NB you will have to decide how to handle numeric data - one option is to translate each such feature to a set of binary features, based on plausible split points and run the method using binary features). Then we choose the top n features and run the algorithm only with these. The number of features to use n can be chosen using some iterative method and cross validation with a validation set (not the test set). The second uses the wrapper method, first proposed in:
Irrelevant Features and the Subset Selection Problem George H. John, Ron Kohavi, Karl Pfleger International Conference on Machine Learning, 1994.
and expanded on in
Wrappers for Feature Subset Selection Ron Kohavi, George H. John Artificial Intelligence, 1996.
you should pick one variant (forward/backward search, the type of search) and run experiments with it.
You should pick several datasets (from WEKADATA and ones used in homeworks or others of your choice). In order to test sensitivity of the methods and/or learning algorithms you can artificially add irrelevant features to these datasets. You should then test the effect of feature selection on Perceptron and EG (your code from homework 2) and two additional algorithms (using weka).
Project Option II: study discretization methods Repeat and/or modify some of the experiments reported in
Supervised and unsupervised discretization of continuous features. James Dougherty, Ron Kohavi, and Mehran Sahami. International Conference on Machine Learning, 1995.
You may also want to look at the following for reference Error-Based and Entropy-Based Discretization of Continuous Features. Ron Kohavi and Mehran Sahami, Knowledge Discovery in Databases 1996.
In particular, you should compare Fayyad and Irani's entropy based method (section 3.3) with equal frequency (see first paragraph of section 2 and then cf. section 3.1), or with a method of your choice. For methods that require the number of partitions as a parameter (as in equal frequency) you should use an iterative method to choose the best value using a validation set (not the test set). You should then test the effect of feature discretization on Naive Bayes (using weka), the Perceptron and EG (your code from homework 2) and one additional algorithm (using weka).
Project Option III: mining closed itemsets Repeat and/or modify some of the experiments reported in
An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases. Takeaki Uno, Tatsuya Asai, and Yuzo Uchida, Discovery Science 2004.
You may also want to look at the following for reference LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno, Masashi Kiyomi, Hiroki Arimura, IEEE ICDM'04 Workshop FIMI'04.
In particular you should implement the base algorithm and experiment with it and then study the effect of some of the heuristics proposed.
Project Option IV: on line algorithms for spam filtering Devise a setup, feature construction and training algorithms for spam filtering with small storage and training costs.

Assignments

(Weeks 1/2) Please familiarize yourself with using the weka system. Read the README file above and try to use the system on the given examples to see that you can use it.
Homework Assignment 1 (due 9/27): Text Classification using Decision Trees: postscript, pdf
Homework Assignment 2 (due 10/18): GD and EG algorithms for single neural unit postscript, pdf
Homework Assignment 3 (due 11/3): Evaluating hypotheses and algorithms postscript, pdf

Comp 150ML: Machine Learning Department of Computer Science Tufts University Fall 2005