Amateur Hour

The Statistical Wasteland

Or an alternative syllabus for Stats 101
May 29, 2016. Filed under Technical
Tags: statistics, teaching

TLDR: Columbia’s introductory statistics classes are horrible, and I propose a syllabus for a revamped intro to stats class.

In case it isn’t obvious, people suck at statistics. Even professional researchers mess up sometimes (and that’s not even touching the current replication crisis.

The amount of basic statistical ignorance among otherwise intelligent and informed people is astounding. I’ve lost track of how many times I need to point out that “correlation is not causation”, or “median and mean measure different things, especially in skewed data, or “statistically significant does not mean true”, or something equally basic.

It’s not just that good statistics is hard to do, although that’s probably part of it. No, I’m convinced that statistical ignorance is so hard because people’s statistical education is terrible.

Part of this is because of rampant misinformation and deliberate oversimplification. Last year, for example, Frontiers of Science (Columbia’s required science class) managed to teach a factually incorrect definition of a confidence interval (we essentially learned that they were Bayesian credible intervals). Even introductory statistics courses aren’t necessarily much better – I’ve heard absolutely terrible things about Columbia’s introduction to probability and statistics (SIEO 4150), which is required for the computer science major in SEAS.

But the problems go a lot deeper than that. I had a phenomenal statistics teacher in high school – but even with some of the best teaching that I’ve ever had, my introductory stats class seems hopelessly lacking in retrospect. Even when properly taught, I feel like the standard introductory statistics curriculum is just poorly designed. Instead of predictive modeling or causal inference (the applications people actually want to use statistics for), the typical Stats 101 course focuses mainly on memorizing a bunch of different null-hypothesis significance testing procedures. Even within hypothesis testing, there’s literally no discussion about bootstrapping, permutation tests, or other resampling techniques. There’s no non-parametric hypothesis testing introduced. And I don’t think we even mentioned the word “overfitting”, much less discussed cross-validation or other model validation techniques.

A New Hope

As you might imagine from my strong objections, I think that we can do a lot better than our current introductory stats classes. In fact, I’ve taken the liberty to reimagine an entirely new Stats 101 syllabus, one which is much more practical than the current ones.

Obviously, I don’t think it’s perfect. I’m not even a statistics major, let alone an actually qualified professional. Still, I think it represents a big improvement over traditional Stats 101 curricula (or at least the one I went through).

Philosophically, the biggest difference is that this class is about understanding and interpreting statistics instead of doing statistics. Frankly, if you want to become a practitioner and actually do stats, you should probably take more advanced courses than just the intro class. This class focuses on the logic and intuitions behind statistics, not the formalisms or mathematics (the only concession to mathematics is the first unit on probability, which I feel is unavoidable). Stats 101 is most people’s only introduction to statistics, so it’s important that they learn the big ideas first.

As part of the emphasis on big ideas and intuitions, there’s also a much stronger focus on modeling assumptions. Essentially all statistical procedures assume a simplified model of the world, and it’s important to remember what those simplifying assumptions are. Go ahead and look at Li’s Gaussian copula formula and the 2007 financial crisis for the consequences if you don’t. To highlight just how important I think this is, I’ve included a lesson in almost every united specifically devoted to the pitfalls and horrors that incorrect assumptions can lead you into.

As you might have guessed, this curriculum also covers many more topics than before. To gain time for causal inference and predictive modeling, I’ve lumped together all the traditional hypothesis tests – z-tests, t-tests, \(\chi^2\)-tests, etc. – into one lesson. They all rely on pretty similar ideas and assumptions, so there’s really no reason to go the mathematical details distinguishing them until an upper level course. Honestly, I’d be pretty happy if students just got a flow chart telling them when to use each test and what the appropriate Python/R/Excel code is to run them.

One objection to this compacting is that there’s really not enough time to teach students how each test works. We’ll essentially just tell them – “because of limiting theorems (like the central limit theorem), this statistic will have this sampling distribution, which we can use to calculate p-values”. Although I agree that this is wholly intellectually unsatisfying, I don’t really see an alternative. It’s not like you get real explanations in the current system either – any real explanation for why the t-statistic has a Student’s t distribution relies on some fairly heavy-duty probability which beginners won’t have.

Homework wise, I’d want this class to be as hands-on and practical as possible. That means actually doing analysis (preferably in Python or R, but SPSS or Excel or Minitab would be acceptable too) and reading/critiquing existing analyses. It’s hard to see all the nuances in data analysis until you actually have to do them yourself.

Obviously, this curriculum isn’t perfect. In particular, my own knowledge of causal inference is pretty spotty, so I don’t really know what to put there (as evidenced by the lack of detail in the pitfalls section). Another worry that I have is that this covers too much material and would be too hard for people who’ve never done stats before. Someone with more experience actually teaching a college course would have to be the judge of that.

Well, with all that out of the way, here’s the actual syllabus:

Syllabus

  1. Course overview: syllabus and logistics, what this course is about, why you should care

  2. Probability 101: what is probability? Bayesianism vs Frequentism, Informal Kolmogorov Axioms and consequences.

  3. Probabilistic Reasoning: Conditional vs unconditional probability, Bayes Rule, Gambler’s Fallacy, Base Rate Fallacy

  4. Probability Distributions: Informal random variables, binomial distribution, geometric distribution, normal distribution, Pareto distribution

  5. Exploratory Data Analysis (1 Variable): Center, shape, and spread. Charts (Histogram, Bar Chart, Boxplot). Descriptive statistics (mean, median, mode, outliers, standard deviation, quartiles).

  6. Exploratory Data Analysis (2+ Variables): Correlation (Pearson’s linear coefficient, Kendall’s Tau, Mutual Information), Limitations (Anscombe’s Quartet , linear vs nonlinear associations, correlation vs causation), Charts (scatterplots, heatmaps, pair plots, violin plots)

  7. Effective Plotting: Misleading graphs, Chartjunk, Color maps. Multiple case studies from mainstream news and scientific journals.

  8. Review Session

  9. Mid-term 1

  10. Sampling: Simple Random Samples, Stratified Samples, Convenience Samples, Census taking

  11. Biases and Error: Sampling error, non-response bias, exclusion bias, survivor bias, Berkson’s Fallacy. Mention techniques to correct for unrepresentative samples.

  12. Sampling Distributions: Informal intro to bootstrapping, central limit theorem, fat tails.

  13. Bayesian Statistical Inference: Prior vs Posterior distributions, choosing a prior. Credible intervals and Bayes factors using simple proportion tests. Emphasis on goals and interpretation of statistical inference.

  14. Frequentist Statistical Inference: Logic of confidence intervals and p-values using proportion (z-tests). Interpretation of confidence intervals and p-values (see see http://stats.stackexchange.com/questions/2272/whats-the-difference-between-a-confidence-interval-and-a-credible-interval and http://www.ejwagenmakers.com/inpress/HoekstraEtAlPBR.pdf).

  15. Hypothesis Testing: Type I and Type II Errors, null hypothesis testing. Type-S and type-M errors.

  16. Pitfalls and Errors: Case studies of incorrect assumptions (e.g. fat tails). Garden of forking paths and p-hacking. Andrew Gelman’s garden of forking paths paper should be mandatory reading.

  17. Review Session

  18. Midterm II

  19. Linear Regression: Correlation and \(r^2\). Interaction between terms. Residual analysis. Linear regression t-test and assumptions. Mathematical transforms.

  20. Logistic Regression: Intuition. Interpretation of output and coefficients. Accuracy, precision, recall. ROC Curve.

  21. Pitfalls: Bias-variance trade-off. Overfitting and cross-validation. Regularization. No free lunch theorem.

  22. Experimental Design: Definition of causal inference based on counterfactuals. Confounding variables. Randomization. Control group. Double-blind trials and Placebo effects.

  23. Observational Studies: Why they’re necessary (ethics & feasibility). Case-control, longitudinal, vs cross-sectional studies. Propensity score matching and alternatives.

  24. Pitfalls: Lots of case studies.

  25. Review Session

  26. Final