Comprehensive exam study guide

Last updated on Mar 18, 2020 5 min read

This post is going to make a boring read.

As I explained in my last post, my program’s comprehensive exams are bespoke for the student. I’m going to use this post as a high-level study guide to keep me on track. I’ll update it as I get more clarification etc.

The basic structure of the exam is as follows:

There is a one-day (up to six-hour) written exam taken in-person at the university and proctored by one committee member. The results are classified as “Pass”, “Fail”, or “Oral Exam Needed”. The latter is used if I need to clarify or expand on my written answer. The exam questions are submitted by 3 of my 5 committee members. The questions are designed to track courses that I took as part of my degree “concentration” courses.

After the written portion, there is a computational exam which I will have two weeks to complete. Two of my committee members are taking on putting together this portion of the exam.

Finally, I will convene in-person with my committee and (a) present on the results of the computational portion of the exam and (b) undergo any oral examination needed from the answers to my written questions.

As a personal aside/musing: looking at what I’m being tested on, it feels like my degree concentration is more “machine learning” than “computational statistics”. Though, TBH these days, I feel like that’s more of a cultural statement than a mathematical one. I still self identify as a statistician and regularly attend JSM. ¯\(ツ)/¯

The written portion

The courses we chose to base the written portion of the exam were

Bayesian Inference and Decision Theory which I actually took at Georgetown through the Consortium of Universities in the Washington Metropolitan Area, which lets me take comparable courses at any university with less bureaucracy than using transfer credits.
Principles of Knowledge Mining which is really a machine learning course but focused on data mining
Computational Learning and Discovery which is also a machine learning course but focused more on the mathematics of the various methods

Since Principles of Knowledge Mining and Computational Learning and Discovery had significant overlap in material we decided to split the exam based on supervised learning and unsupervised learning, without distinction of which course it came from.

Bayesian stats

Technically, my main study resource for this will be the notes, homework, and exams from my Bayesian stats class. (This is an advantage of having the instructor for the course on your committee.) However, two books that I may use as additional reference are

Bayesian Data Analysis by Gelman et al.
A First Course in Bayesian Statistical Methods by Hoff

And I’ve been told that everything in the course is fair game. That said, I’m going to focus on the areas that feel the most rusty to me, namely

Metropolis Hastings (this still seems like magic to me)
Gibbs sampling
Model checking and evaluation
Bayesian regression - linear and ridge regression
The Dirichlet multinomial
The Dirichlet multinomial with a hierarchical uniform prior

WRT that last two, I definitely know they’re going to come into play during my dissertation itself. I’m planning to (a) implement a NUTS sampler for LDA as a derivative of MH sampler in the WarpLDA algorithm and (b) would like to implement an LDA derivative that puts a hierarchical uniform prior on \(\boldsymbol\alpha\).

Supervised learning

This section is still a bit “TBD” but what I know for sure follows:

The books I’m using here are

The Elements of Statistical Learning by Hastie et al.
TBD likely either Tan et al. (see below) or Data Mining by Witten et al.
I likely won’t be using Deep Learning by Goodfellow et al. but in case someone comes across this page as a resource to study ML themselves, I feel I’d be remiss not to mention it. This is an excellent book for those of us with a more mathematical bent.

The topics I’ll be focusing on are

SVMs from Hastie et. al (Ch. 12)
TBD

Unsupervised learning

The books I’m using here are

The Elements of Statistical Learning by Hastie et al.
Introduction to Data Mining by Tan et al.

The topics I’ll be focusing on are

Clustering from Hastie et al. (Ch. 14, specifically section 14.3)
Clustering from Tan et al. (Ch. 8 - 9 in 1st ed. TBD on whether or not I do 1st or 2nd ed.)
Anomaly detection from Tan et al. (Ch. 10 in 1st ed.)

Clustering

K-means & K-medoids
Agglomorative hierarchical clustering
DBSCAN
Gaussian mixture model
Other clustering algorithms
- Prototype-based clustering
- Density-based clustering (non-DBSCAN)
- Graph-based clustering (including page-rank)
Cluster evaluation
Which clustering algorithm?

Dimensionality reduction

Principal component analysis
Non-negative matrix factorization
Independent component analysis
Multidimensional scaling

Anomaly detection

Statistical approaches
Proximity-based outlier detection
Density-based outlier detection
Clustering-based techniques

The computational portion

I don’t really need to “study” for the computation portion. From discussions, what we are going to do is do some flag planting for a paper that I intend to write as part of the dissertation. I’ve implemented a couple forms of transfer learning for LDA (or an LDA-like model) in a new R package I’m working on. So we’d be looking at a preliminary study of that. Thing is, we don’t know how permanent/forgetful LDA is in this paradigm. And I don’t know that the way I implemented it is optimal. Topics to consider might be

Weighting/reweighting of the previous model’s topics in the prior of a new model. This tunes how much the new model “remembers” the old model.
Initialization strategies. Algorithms for LDA shuffle around counts of document-token-topic assignments. If you want to “transfer” you should initialize your counts in proportion to that of the previously-trained model. Unfortunately, corpora don’t have the same number of tokens overall or per-document. So, how do you initialize those counts then? (FWIW, I am not convinced that the current way I did it in my in-development R package is the best strategy.)

Tommy Jones

I like answering boring statistical questions about exciting machine learning models.