Causal Inference: Notes from a Cornell PhD-Level Econometrics Course (1)
Starter notes from a PhD-level econometrics course on how to think clearly about causality in observational data.
Course taken: Cornell PAM 6090 | Time taken: Fall 2021
This article is about the overview of causal inference, which includes:
How to Read an Economics Paper
Research Design vs. Econometric Technique
Goals of Estimation (And Their Tradeoffs)
Learning Econometrics with Monte Carlo Experiments
Core Statistical Concepts That Commonly Mislead Applied Researchers
Appendix:
A Researcher’s Checklist for Interpreting Statistical Significance
Add-on to 3.2. Correct Inference: A Practical Rule of Thumb
Why This Post Exists
These notes summarize how economists evaluate causal claims, drawn from a PhD-level causal inference course at Cornell University.
This is written for:
Applied researchers
Policy analysts
Economists
Data scientists working with observational data
1. How to Read an Economics Paper
Before touching equations, ask:
What is the research question?
What variation identifies the effect?
What is the punchline table or figure?
What could falsify the claim?
Key Papers
Freedman, David A. "Statistical models and shoe leather." Sociological methodology (1991): 291-313.
DiNardo, John E., and Jörn-Steffen Pischke. "The returns to computer use revisited: Have pencils changed the wage structure too?." The Quarterly journal of economics 112.1 (1997): 291-303.
What kind of falsification/placebo test can we do? (what’s the placebo test in the Returns to Computer paper? I think it’s where they ran the model (it was just OLS I believe) on other things common to white collar jobs that didn’t necessarily involve skill (like sitting at a desk or using a pencil) and saw the same positive returns as to using a computer so they said the real wage return was to intrinsic unobservable characteristics of white collar workers (ie innate ability, personality) not actually being able to use a computer )
2. Research Design vs. Econometric Technique
A recurring mistake: defending methods instead of assumptions.
Techniques don’t rescue bad design
“Robust” estimators only help when models are nearly correct
When assumptions fail, technical fixes distract from real problems
original notes: Some readers may be concerned to defend the technique of regression modeling: According to them, the technique is sound and only the applications are flawed. Other readers may think that the criticisms of regression modeling are merely technical, so that technical fixes-e.g., robust estimators, generalized least squares, and specification tests-will make the problems go away.
original notes: Are the assumptions valid? Moreover, technical fixes become relevant only when models are nearly right. For instance, robust estimators may be useful if the error terms are independent, identically distributed, and symmetric but long-tailed. If the error terms are neither independent nor identically distributed and there is no way to find out whether they are symmetric, robust estimators probably distract from the real issues.
3. Goals of Estimation (And Their Tradeoffs)
Hierarchy of Goals
Identification (What parameter are we estimating?) (Unbiasedness / Consistency)
When we compare estimators, there might be tradeoff btw robustness vs. efficiency:
Robustness: need less assumptions to achieve consistency, have desirable property robust to some change
Efficiency: estimate the quantity of interest in some ‘best possible’ manner, precision of your estimation, the smaller the variation, the more efficient
Correct Inference: SE to be actual approximation of the sample variation / df (Please scroll down to the bottom to see examples of Add-on to 3.2. Correct Inference: A Practical Rule of Thumb.)
Transparency & Reproducibility
eg. clean rhetorical design (help assess the validity); ease of coding; not too computationally burdensome
4. Learning Econometrics with Monte Carlo Experiments
Monte Carlo (MC) simulations help build intuition when theory falls short.
Why MC?
Understand small-sample behavior: Reveal bias, variance, and inference failures in the sample sizes researchers actually use.
Explore asymptotics: Show how quickly (or slowly) large-sample theory becomes a good approximation as N grows.
Power calculations before running studies: Estimate the likelihood of detecting true effects under realistic data-generating processes before collecting data.
Basic MC Workflow
Step 1 – Design the D.G.P., and choose the parameters
Step 2 – Create a sample of data
Step 3 – Estimate the model (if comparing 2 models, estimate both)
Step 4 – Do steps (2) and (3) many times (1000? More sometimes needed)
Step 5 – Look at results of (4) to make judgments
Disadvantages of Monte Carlo Experiments
1. Not Generalizable
Monte Carlo results are specific to the chosen data-generating process (DGP), parameter values, and assumptions. Unlike economic theory, which aims to produce results that hold across broad classes of environments, Monte Carlo simulations only inform us about the particular scenario being simulated.
As a result, conclusions from a Monte Carlo study do not automatically extend to other settings, designs, or populations without rerunning the simulations under alternative assumptions.
2. Coding and Implementation Costs
Designing a credible Monte Carlo experiment requires careful specification of the DGP, repeated simulation, estimation, and analysis. This can be time-consuming and computationally intensive, especially when models are complex or when many alternative scenarios must be explored.
Poorly written or misunderstood code can also introduce errors, making Monte Carlo results only as reliable as their implementation.
5. Core Statistical Concepts That Commonly Mislead Applied Researchers
SD vs. SE
SD: how data is distributed around the mean
SE: how the mean is distributed - SE is the SD of the mean
The SD of data should be similar to mean of SE
R² as Model Quality
Misconception:
A higher R² means a better causal model.
Reality:
R² measures fit, not identification.
Bad controls often increase R² while worsening bias.
“No Effect” Means Zero
Misconception:
Insignificant results imply no effect.
Reality:
Insignificance often reflects:
Low power
Noisy outcomes
Weak first stages
Absence of evidence is not evidence of absence.
Balance Implies Identification
Misconception:
Covariate balance guarantees causal validity.
Reality:
Balance on observables says nothing about unobservables.
In the next article, I will revisit the note of Toolkit for Causal Identification, which answers the following questions:
What are the main sources of bias that invalidate OLS estimates?
How do we choose between methods like DiD, IV, RD, matching, and synthetic control in practice?
What kinds of robustness checks meaningfully increase credibility?
What do different methods actually identify (ATE, LATE, ATT)?
How should we think about internal versus external validity when interpreting results?
Subscribe and stay tuned :)
XOXO
Appendix:
A Researcher’s Checklist for Interpreting Statistical Significance
(Pin This Before Running Regressions)
I. Functional Form & Estimation
Have I imposed linearity without justification?
Did I check sensitivity to alternative specifications?
Would a nonparametric or semi-parametric approach change the story?
II. Dependence & Clustering
What is the level of identifying variation?
Are regressors and errors correlated within groups?
Am I clustering at the correct level?
Do I have enough clusters for asymptotic theory to apply?
III. Inference Validity
Are my standard errors robust to heteroskedasticity?
If clusters are few, did I use:
Wild cluster bootstrap?
Bias-corrected CRVE?
Cluster-level degrees of freedom?
IV. Finite-Sample Concerns
Am I relying on asymptotic results with small samples?
Would bootstrapping improve inference?
Does my bootstrap respect the data structure (e.g., clustered bootstrap)?
V. Interpretation Discipline
Would my conclusion survive wider confidence intervals?
Am I interpreting statistical significance as economic importance?
Have I clearly stated what assumptions make my inference valid?
Add-on to 3.2.
Correct Inference: A Practical Rule of Thumb
Before trusting your standard errors, ask yourself:
What variation is left after fitting the model?
How many independent pieces of information remain?
Does my SE formula reflect that?
If the answer to the last question is no, your inference is almost surely wrong.
Below I explain what each question means in practice—and how ignoring it leads to invalid SEs, CIs, and hypothesis tests.
1. What variation is left after fitting the model?
What this really means
Once you condition on:
regressors
fixed effects
trends
controls
what randomness is actually left?
Inference is about unexplained variation, not total variation.
Common failures
Mistake A: Treating residuals as i.i.d. when they aren’t
You include unit and time fixed effects, but residuals are still:
serially correlated
spatially correlated
correlated within clusters
Example:
DiD with state-year data
Shocks persist within states
You treat each state-year as independent
Result: you vastly overstate the amount of information.
Mistake B: Overfitting absorbs noise
You add:
many controls
flexible polynomials
rich interactions
Residuals shrink mechanically. That does not mean uncertainty disappeared—you just fit noise.
What breaks
SEs too small
Inflated t-stats
Excess false positives
2. How many independent pieces of information remain?
What this really means
After accounting for dependence and estimation:
What is my effective sample size?
This is a degrees-of-freedom question, not a raw-n question.
Common failures
Mistake A: Counting observations instead of clusters
10,000 students
50 schools
Treatment varies at school level
You compute SEs as if n = 10,000.
Reality: independent variation ≈ 50.
Mistake B: Few clusters
Cluster-robust SEs rely on:
number of clusters → ∞
With 10–20 clusters, even clustered SEs are biased.
Mistake C: Ignoring estimated fixed effects
Each fixed effect:
consumes degrees of freedom
constrains residuals
Ignoring this mechanically understates variance.
What breaks
Wrong reference distribution (normal instead of t)
Over-rejection
Poor CI coverage
3. Does my SE formula reflect that?
What this really means
Does your variance estimator match:
the dependence structure?
estimation uncertainty?
the relevant asymptotics?
SE formulas are not plug-and-play.
Common failures
Mistake A: Default SEs for non-default problems
Robust SEs when clustering is needed
Cluster SEs with too few clusters
OLS SEs in panel settings
Mistake B: Wrong asymptotics
You assume:
observations → ∞
But reality is:
clusters fixed
short panels
rare treatment
The theory behind the SE no longer applies.
Mistake C: Weak identification
Weak IV
Near collinearity
Rare treatment events
SE formulas assume strong identification—when that fails, inference collapses.
What breaks
“95%” CIs that miss far more than 5%
Meaningless p-values
Results that don’t replicate

