Causal Inference: Notes from a Cornell PhD-Level Econometrics Course (1)

Starter notes from a PhD-level econometrics course on how to think clearly about causality in observational data.

Dec 27, 2025

Course taken: Cornell PAM 6090 | Time taken: Fall 2021

This article is about the overview of causal inference, which includes:

How to Read an Economics Paper
Research Design vs. Econometric Technique
Goals of Estimation (And Their Tradeoffs)
Learning Econometrics with Monte Carlo Experiments
Core Statistical Concepts That Commonly Mislead Applied Researchers
Appendix:
A Researcher’s Checklist for Interpreting Statistical Significance
Add-on to 3.2. Correct Inference: A Practical Rule of Thumb

Why This Post Exists

These notes summarize how economists evaluate causal claims, drawn from a PhD-level causal inference course at Cornell University.

This is written for:

Applied researchers
Policy analysts
Economists
Data scientists working with observational data

1. How to Read an Economics Paper

Before touching equations, ask:

What is the research question?
What variation identifies the effect?
What is the punchline table or figure?
What could falsify the claim?

Key Papers

Freedman, David A. "Statistical models and shoe leather." Sociological methodology (1991): 291-313.
DiNardo, John E., and Jörn-Steffen Pischke. "The returns to computer use revisited: Have pencils changed the wage structure too?." The Quarterly journal of economics 112.1 (1997): 291-303.
- What kind of falsification/placebo test can we do? (what’s the placebo test in the Returns to Computer paper? I think it’s where they ran the model (it was just OLS I believe) on other things common to white collar jobs that didn’t necessarily involve skill (like sitting at a desk or using a pencil) and saw the same positive returns as to using a computer so they said the real wage return was to intrinsic unobservable characteristics of white collar workers (ie innate ability, personality) not actually being able to use a computer )

2. Research Design vs. Econometric Technique

A recurring mistake: defending methods instead of assumptions.

Techniques don’t rescue bad design
“Robust” estimators only help when models are nearly correct
When assumptions fail, technical fixes distract from real problems

original notes: Some readers may be concerned to defend the technique of regression modeling: According to them, the technique is sound and only the applications are flawed. Other readers may think that the criticisms of regression modeling are merely technical, so that technical fixes-e.g., robust estimators, generalized least squares, and specification tests-will make the problems go away.
original notes: Are the assumptions valid? Moreover, technical fixes become relevant only when models are nearly right. For instance, robust estimators may be useful if the error terms are independent, identically distributed, and symmetric but long-tailed. If the error terms are neither independent nor identically distributed and there is no way to find out whether they are symmetric, robust estimators probably distract from the real issues.

3. Goals of Estimation (And Their Tradeoffs)

Hierarchy of Goals

Identification (What parameter are we estimating?) (Unbiasedness / Consistency)
1. When we compare estimators, there might be tradeoff btw robustness vs. efficiency:
  - Robustness: need less assumptions to achieve consistency, have desirable property robust to some change
  - Efficiency: estimate the quantity of interest in some ‘best possible’ manner, precision of your estimation, the smaller the variation, the more efficient
Correct Inference: SE to be actual approximation of the sample variation / df (Please scroll down to the bottom to see examples of Add-on to 3.2. Correct Inference: A Practical Rule of Thumb.)
Transparency & Reproducibility
eg. clean rhetorical design (help assess the validity); ease of coding; not too computationally burdensome

4. Learning Econometrics with Monte Carlo Experiments

Monte Carlo (MC) simulations help build intuition when theory falls short.

Why MC?

Understand small-sample behavior: Reveal bias, variance, and inference failures in the sample sizes researchers actually use.
Explore asymptotics: Show how quickly (or slowly) large-sample theory becomes a good approximation as N grows.
Power calculations before running studies: Estimate the likelihood of detecting true effects under realistic data-generating processes before collecting data.

Basic MC Workflow

Step 1 – Design the D.G.P., and choose the parameters
Step 2 – Create a sample of data
Step 3 – Estimate the model (if comparing 2 models, estimate both)
Step 4 – Do steps (2) and (3) many times (1000? More sometimes needed)
Step 5 – Look at results of (4) to make judgments

Disadvantages of Monte Carlo Experiments

1. Not Generalizable

Monte Carlo results are specific to the chosen data-generating process (DGP), parameter values, and assumptions. Unlike economic theory, which aims to produce results that hold across broad classes of environments, Monte Carlo simulations only inform us about the particular scenario being simulated.

As a result, conclusions from a Monte Carlo study do not automatically extend to other settings, designs, or populations without rerunning the simulations under alternative assumptions.

2. Coding and Implementation Costs

Designing a credible Monte Carlo experiment requires careful specification of the DGP, repeated simulation, estimation, and analysis. This can be time-consuming and computationally intensive, especially when models are complex or when many alternative scenarios must be explored.

Poorly written or misunderstood code can also introduce errors, making Monte Carlo results only as reliable as their implementation.

5. Core Statistical Concepts That Commonly Mislead Applied Researchers

SD vs. SE

SD: how data is distributed around the mean
SE: how the mean is distributed - SE is the SD of the mean
The SD of data should be similar to mean of SE

R² as Model Quality

Misconception:

A higher R² means a better causal model.

Reality:

R² measures fit, not identification.

Bad controls often increase R² while worsening bias.

“No Effect” Means Zero

Misconception:

Insignificant results imply no effect.

Reality:

Insignificance often reflects:

Low power
Noisy outcomes
Weak first stages

Absence of evidence is not evidence of absence.

Balance Implies Identification

Misconception:

Covariate balance guarantees causal validity.

Reality:

Balance on observables says nothing about unobservables.

In the next article, I will revisit the note of Toolkit for Causal Identification, which answers the following questions:

What are the main sources of bias that invalidate OLS estimates?
How do we choose between methods like DiD, IV, RD, matching, and synthetic control in practice?
What kinds of robustness checks meaningfully increase credibility?
What do different methods actually identify (ATE, LATE, ATT)?
How should we think about internal versus external validity when interpreting results?

Subscribe and stay tuned :)

XOXO

Appendix:

A Researcher’s Checklist for Interpreting Statistical Significance

(Pin This Before Running Regressions)

I. Functional Form & Estimation

Have I imposed linearity without justification?
Did I check sensitivity to alternative specifications?
Would a nonparametric or semi-parametric approach change the story?

II. Dependence & Clustering

What is the level of identifying variation?
Are regressors and errors correlated within groups?
Am I clustering at the correct level?
Do I have enough clusters for asymptotic theory to apply?

III. Inference Validity

Are my standard errors robust to heteroskedasticity?
If clusters are few, did I use:
- Wild cluster bootstrap?
- Bias-corrected CRVE?
- Cluster-level degrees of freedom?

IV. Finite-Sample Concerns

Am I relying on asymptotic results with small samples?
Would bootstrapping improve inference?
Does my bootstrap respect the data structure (e.g., clustered bootstrap)?

V. Interpretation Discipline

Would my conclusion survive wider confidence intervals?
Am I interpreting statistical significance as economic importance?
Have I clearly stated what assumptions make my inference valid?

Add-on to 3.2.

Correct Inference: A Practical Rule of Thumb

Before trusting your standard errors, ask yourself:

What variation is left after fitting the model?
How many independent pieces of information remain?
Does my SE formula reflect that?

If the answer to the last question is no, your inference is almost surely wrong.

Below I explain what each question means in practice—and how ignoring it leads to invalid SEs, CIs, and hypothesis tests.

1. What variation is left after fitting the model?

What this really means

Once you condition on:

regressors
fixed effects
trends
controls

what randomness is actually left?

Inference is about unexplained variation, not total variation.

Common failures

Mistake A: Treating residuals as i.i.d. when they aren’t

You include unit and time fixed effects, but residuals are still:

serially correlated
spatially correlated
correlated within clusters

Example:

DiD with state-year data
Shocks persist within states
You treat each state-year as independent

Result: you vastly overstate the amount of information.

Mistake B: Overfitting absorbs noise

You add:

many controls
flexible polynomials
rich interactions

Residuals shrink mechanically. That does not mean uncertainty disappeared—you just fit noise.

What breaks

SEs too small
Inflated t-stats
Excess false positives

2. How many independent pieces of information remain?

What this really means

After accounting for dependence and estimation:

What is my effective sample size?

This is a degrees-of-freedom question, not a raw-n question.

Common failures

Mistake A: Counting observations instead of clusters

10,000 students
50 schools
Treatment varies at school level

You compute SEs as if n = 10,000.

Reality: independent variation ≈ 50.

Mistake B: Few clusters

Cluster-robust SEs rely on:

number of clusters → ∞

With 10–20 clusters, even clustered SEs are biased.

Mistake C: Ignoring estimated fixed effects

Each fixed effect:

consumes degrees of freedom
constrains residuals

Ignoring this mechanically understates variance.

What breaks

Wrong reference distribution (normal instead of t)
Over-rejection
Poor CI coverage

3. Does my SE formula reflect that?

What this really means

Does your variance estimator match:

the dependence structure?
estimation uncertainty?
the relevant asymptotics?

SE formulas are not plug-and-play.

Common failures

Mistake A: Default SEs for non-default problems

Robust SEs when clustering is needed
Cluster SEs with too few clusters
OLS SEs in panel settings

Mistake B: Wrong asymptotics

You assume:

observations → ∞

But reality is:

clusters fixed
short panels
rare treatment

The theory behind the SE no longer applies.

Mistake C: Weak identification

Weak IV
Near collinearity
Rare treatment events

SE formulas assume strong identification—when that fails, inference collapses.

What breaks

“95%” CIs that miss far more than 5%
Meaningless p-values
Results that don’t replicate

Discussion about this post

Ready for more?