Adam Visokay

What is "predicted data"?

In machine learning, "predicted data" are often thought of as the outputs from a complicated algorithm. I opt for an even broader definition: any measure of a conceptual variable where a better, more direct measure exists. This definition includes predictions from black-box AI models like chatGPT, but also includes other data we rely upon as social scientists, like survey responses, interviews, imputations, statistical estimates, derived measures, and a whole host of other proxies. Below is a table with some examples I have come across in my own research.

Every conceptual variable comes with different measurement challenges, but in general, more precise measurements are also more expensive to collect. A stylized image below shows that expensive and precise ground truth measures tend to live in the light blue region, with predicted data everywhere else. Because of this cost-quality tradeoff, we often resort to working with predicted data in practice. But not all predicted data are created equally! The best are those depicted in the green region - relatively precise and cheap compared to the red region - noisy and expensive.

Measurements vary in both cost and precision

Variable	Ground Truth	Predicted
Cause of Death	Vital Registration	Verbal Autopsy
Obesity	Fat Percentage	BMI
Income	Admin Data	Self Reported
Environmental Attitude	Questionnaire	NLP Sentiment

What does it mean for inference on predicted data to be invalid?

In this context, valid statistical inference refers to both un-biased point estimates and precise uncertainty bounds. Relative to inference performed with "ground truth" outcomes, inference on predicted data may have biased point estimates due to systematic differences between predictions and the ground truth, and the reported uncertainty will be deceivingly narrow because it doesn't account for any of the prediction error.

Why does this matter? Consider a very simple hypothesis test where the p-value tells us whether or not an observed relationship between X and y is statistically significant. This conclusion is a function of both the point estimate and the uncertainty around that point estimate. The stylized diagram below demonstrates how bias and conservative uncertainty might lead to very different scientific conclusions.

Inference can have bias and/or misleading uncertainty

So how do you perform valid inference on predicted data?

There are several existing methods for performing the bias correction for valid inference with predicted data. While the technical details differ, these methods are built upon the same intuition. At its simplest, you incorporate what you learn when you have access to both ground truth and predicted outcomes into downstream inference where you rely solely on predicted outcomes. The two step procedure looks like this:

Using side-by-side ground truth and predicted measures of the outcome variable, estimate the IPD rectifier, Δ. This tells you how differences between Y and Ŷ are associated with covariates X for the same observation.

(Y_i - Ŷ_i) = ΔX_i

Now, when you perform inference with predicted outcomes in the absence of ground truth measured outcomes, you incorporate the rectifier Δ into the naive parameters you estimate to recover valid IPD estimates.

Invalid IPD → Ŷ₂ = θX₂

Correct IPD → Ŷ₂ = (θ+Δ;)X₂

Cartoon example: height and basketball ability

We are interested in the association between a person's height and an index of their basketball ability on a scale from 1-10. Height can be measured directly, or from a self-report. Some people might report correctly, others might not, so we consider the self-reported height predicted data relative to directly measured height as ground truth data.

Oops! It looks like some people report being a couple inches taller than they actually are... How does this affect our conclusion about the association between height and basketball ability when we are relying on mostly self-reported height outcomes? Let's see. This is what it looks like to learn the rectifier Δ from the labeled data to correct inference performed on the unlabeled data.

IPD References

(in reverse order of publication date)

Methods for correcting inference based on outcomes predicted by machine learning. (PostPI)
Wang, McCormick and Leek. 2020 PNAS.
Prediction-powered inference. (PPI)
Angelopoulos, Bates, Fannjiang, Jordan and Zrnic. 2023a Science.
PPI++: Efficient Prediction-Powered Inference. (PPI++)
Angelopoulos, Duchi and Zrnic. 2023b arxiv.
Assumption-Lean and Data-Adaptive Post-Prediction Inference. (PSPA)
Miao, Miao, Wu, Zhao and Lu. 2023 arxiv.
Do We Really Even Need Data? 🦏
Hoffman, Salerno, Afiaz, Leek and McCormick. 2024 arxiv.
From Narratives to Numbers: Valid Inference Using Language Model Predictions from Verbal Autopsies (multiPPI++)
Fan, Visokay, Hoffman, Salerno, Liu, Leek and McCormick. 2024 COLM.
Code respository for the `ipd` package can be found here.

Inference on Predicted Data (IPD)