# Comprehensive exam study guide

This post is going to make a boring read.

As I explained in my last post, my program’s comprehensive exams are bespoke for the student. I’m going to use this post as a high-level study guide to keep me on track. I’ll update it as I get more clarification etc.

The basic structure of the exam is as follows:

There is a one-day (up to six-hour) written exam taken in-person at the university and proctored by one committee member. The results are classified as “Pass”, “Fail”, or “Oral Exam Needed”. The latter is used if I need to clarify or expand on my written answer. The exam questions are submitted by 3 of my 5 committee members. The questions are designed to track courses that I took as part of my degree “concentration” courses.

After the written portion, there is a computational exam which I will have two weeks to complete. Two of my committee members are taking on putting together this portion of the exam.

Finally, I will convene in-person with my committee and (a) present on the results of the computational portion of the exam and (b) undergo any oral examination needed from the answers to my written questions.

### Supervised learning

This section is still a bit “TBD” but what I know for sure follows:

The books I’m using here are

1. The Elements of Statistical Learning by Hastie et al.
2. TBD likely either Tan et al. (see below) or Data Mining by Witten et al.
3. I likely won’t be using Deep Learning by Goodfellow et al. but in case someone comes across this page as a resource to study ML themselves, I feel I’d be remiss not to mention it. This is an excellent book for those of us with a more mathematical bent.

The topics I’ll be focusing on are

1. SVMs from Hastie et. al (Ch. 12)
2. TBD

### Unsupervised learning

The books I’m using here are

1. The Elements of Statistical Learning by Hastie et al.
2. Introduction to Data Mining by Tan et al.

The topics I’ll be focusing on are

1. Clustering from Hastie et al. (Ch. 14, specifically section 14.3)
2. Clustering from Tan et al. (Ch. 8 - 9 in 1st ed. TBD on whether or not I do 1st or 2nd ed.)
3. Anomaly detection from Tan et al. (Ch. 10 in 1st ed.)

#### Clustering

1. K-means & K-medoids
2. Agglomorative hierarchical clustering
3. DBSCAN
4. Gaussian mixture model
5. Other clustering algorithms
• Prototype-based clustering
• Density-based clustering (non-DBSCAN)
• Graph-based clustering (including page-rank)
6. Cluster evaluation
7. Which clustering algorithm?

#### Dimensionality reduction

1. Principal component analysis
2. Non-negative matrix factorization
3. Independent component analysis
4. Multidimensional scaling

#### Anomaly detection

1. Statistical approaches
2. Proximity-based outlier detection
3. Density-based outlier detection
4. Clustering-based techniques

### Other topics in machine learning

Other topics will be

1. Overfitting
2. Generalization
4. Model selection

These are covered Hastie et al. Ch. 7.

I am encouraged to look into the introductory chapters of Pattern Recognition and Machine Learning

## The computational portion

I don’t really need to “study” for the computation portion. From discussions, what we are going to do is do some flag planting for a paper that I intend to write as part of the dissertation. I’ve implemented a couple forms of transfer learning for LDA (or an LDA-like model) in a new R package I’m working on. So we’d be looking at a preliminary study of that. Thing is, we don’t know how permanent/forgetful LDA is in this paradigm. And I don’t know that the way I implemented it is optimal. Topics to consider might be

1. Weighting/reweighting of the previous model’s topics in the prior of a new model. This tunes how much the new model “remembers” the old model.
2. Initialization strategies. Algorithms for LDA shuffle around counts of document-token-topic assignments. If you want to “transfer” you should initialize your counts in proportion to that of the previously-trained model. Unfortunately, corpora don’t have the same number of tokens overall or per-document. So, how do you initialize those counts then? (FWIW, I am not convinced that the current way I did it in my in-development R package is the best strategy.)