Comprehensive exam study guide

This post is going to make a boring read.

As I explained in my last post, my program’s comprehensive exams are bespoke for the student. I’m going to use this post as a high-level study guide to keep me on track. I’ll update it as I get more clarification etc.

The basic structure of the exam is as follows:

There is a one-day (up to six-hour) written exam taken in-person at the university and proctored by one committee member. The results are classified as “Pass”, “Fail”, or “Oral Exam Needed”. The latter is used if I need to clarify or expand on my written answer. The exam questions are submitted by 3 of my 5 committee members. The questions are designed to track courses that I took as part of my degree “concentration” courses.

After the written portion, there is a computational exam which I will have two weeks to complete. Two of my committee members are taking on putting together this portion of the exam.

Finally, I will convene in-person with my committee and (a) present on the results of the computational portion of the exam and (b) undergo any oral examination needed from the answers to my written questions.

As a personal aside/musing: looking at what I’m being tested on, it feels like my degree concentration is more “machine learning” than “computational statistics”. Though, TBH these days, I feel like that’s more of a cultural statement than a mathematical one. I still self identify as a statistician and regularly attend JSM. ¯\(ツ)

The written portion

The courses we chose to base the written portion of the exam were

  1. Bayesian Inference and Decision Theory which I actually took at Georgetown through the Consortium of Universities in the Washington Metropolitan Area, which lets me take comparable courses at any university with less bureaucracy than using transfer credits.
  2. Principles of Knowledge Mining which is really a machine learning course but focused on data mining
  3. Computational Learning and Discovery which is also a machine learning course but focused more on the mathematics of the various methods

Since Principles of Knowledge Mining and Computational Learning and Discovery had significant overlap in material we decided to split the exam based on supervised learning and unsupervised learning, without distinction of which course it came from.

Bayesian stats

Technically, my main study resource for this will be the notes, homework, and exams from my Bayesian stats class. (This is an advantage of having the instructor for the course on your committee.) However, two books that I may use as additional reference are

  1. Bayesian Data Analysis by Gelman et al.
  2. A First Course in Bayesian Statistical Methods by Hoff

And I’ve been told that everything in the course is fair game. That said, I’m going to focus on the areas that feel the most rusty to me, namely

  1. Metropolis Hastings (this still seems like magic to me)
  2. Gibbs sampling
  3. Model checking and evaluation
  4. Bayesian regression - linear and ridge regression
  5. The Dirichlet multinomial
  6. The Dirichlet multinomial with a hierarchical uniform prior

WRT that last two, I definitely know they’re going to come into play during my dissertation itself. I’m planning to (a) implement a NUTS sampler for LDA as a derivative of MH sampler in the WarpLDA algorithm and (b) would like to implement an LDA derivative that puts a hierarchical uniform prior on \(\boldsymbol\alpha\).

Supervised learning

This section is still a bit “TBD” but what I know for sure follows:

The books I’m using here are

  1. The Elements of Statistical Learning by Hastie et al.
  2. TBD likely either Tan et al. (see below) or Data Mining by Witten et al.
  3. I likely won’t be using Deep Learning by Goodfellow et al. but in case someone comes across this page as a resource to study ML themselves, I feel I’d be remiss not to mention it. This is an excellent book for those of us with a more mathematical bent.

The topics I’ll be focusing on are

  1. SVMs from Hastie et. al (Ch. 12)
  2. TBD

Unsupervised learning

The books I’m using here are

  1. The Elements of Statistical Learning by Hastie et al.
  2. Introduction to Data Mining by Tan et al.

The topics I’ll be focusing on are

  1. Clustering from Hastie et al. (Ch. 14, specifically section 14.3)
  2. Clustering from Tan et al. (Ch. 8 - 9 in 1st ed. TBD on whether or not I do 1st or 2nd ed.)
  3. Anomaly detection from Tan et al. (Ch. 10 in 1st ed.)

Clustering

  1. K-means & K-medoids
  2. Agglomorative hierarchical clustering
  3. DBSCAN
  4. Gaussian mixture model
  5. Other clustering algorithms
    • Prototype-based clustering
    • Density-based clustering (non-DBSCAN)
    • Graph-based clustering (including page-rank)
  6. Cluster evaluation
  7. Which clustering algorithm?

Dimensionality reduction

  1. Principal component analysis
  2. Non-negative matrix factorization
  3. Independent component analysis
  4. Multidimensional scaling

Anomaly detection

  1. Statistical approaches
  2. Proximity-based outlier detection
  3. Density-based outlier detection
  4. Clustering-based techniques

Other topics in machine learning

Other topics will be

  1. Overfitting
  2. Generalization
  3. Bias/variance tradeoff
  4. Model selection

These are covered Hastie et al. Ch. 7.

I am encouraged to look into the introductory chapters of Pattern Recognition and Machine Learning

The computational portion

I don’t really need to “study” for the computation portion. From discussions, what we are going to do is do some flag planting for a paper that I intend to write as part of the dissertation. I’ve implemented a couple forms of transfer learning for LDA (or an LDA-like model) in a new R package I’m working on. So we’d be looking at a preliminary study of that. Thing is, we don’t know how permanent/forgetful LDA is in this paradigm. And I don’t know that the way I implemented it is optimal. Topics to consider might be

  1. Weighting/reweighting of the previous model’s topics in the prior of a new model. This tunes how much the new model “remembers” the old model.
  2. Initialization strategies. Algorithms for LDA shuffle around counts of document-token-topic assignments. If you want to “transfer” you should initialize your counts in proportion to that of the previously-trained model. Unfortunately, corpora don’t have the same number of tokens overall or per-document. So, how do you initialize those counts then? (FWIW, I am not convinced that the current way I did it in my in-development R package is the best strategy.)
Avatar
Tommy Jones

I like answering boring statistical questions about exciting machine learning models.