Adam Visokay

What is "predicted data"?

In machine learning, "predicted data" are often thought of as the outputs from a complicated algorithm. I opt for an even broader definition: any measure of a conceptual variable where a better, more direct measure exists. This definition includes predictions from black-box AI models like chatGPT, but also includes other data we rely upon as social scientists, like survey responses, interviews, imputations, statistical estimates, derived measures, and a whole host of other proxies. Below is a table with some examples I have come across in my own research.

Every conceptual variable comes with different measurement challenges, but in general, more precise measurements are also more expensive to collect. A stylized image below shows that expensive and precise ground truth measures tend to live in the light blue region, with predicted data everywhere else. Because of this cost-quality tradeoff, we often resort to working with predicted data in practice. But not all predicted data are created equally! The best are those depicted in the green region - relatively precise and cheap compared to the red region - noisy and expensive.

Predicted vs Ground Truth Data

Measurements vary in both cost and precision

Variable	Ground Truth	Predicted
Cause of Death	Vital Registration	Verbal Autopsy
Obesity	Fat Percentage	BMI
Income	Admin Data	Self Reported
Environmental Attitude	Questionnaire	NLP Sentiment

What does it mean for inference on predicted data to be invalid?

In this context, valid statistical inference refers to both un-biased point estimates and precise uncertainty bounds. Relative to inference performed with "ground truth" outcomes, inference on predicted data may have biased point estimates due to systematic differences between predictions and the ground truth, and the reported uncertainty will be deceivingly narrow because it doesn't account for any of the prediction error.

Why does this matter? Consider a very simple hypothesis test where the p-value tells us whether or not an observed relationship between X and y is statistically significant. This conclusion is a function of both the point estimate and the uncertainty around that point estimate. The stylized diagram below demonstrates how bias and conservative uncertainty might lead to very different scientific conclusions.

Inference can have bias and/or misleading uncertainty

So how do you perform valid inference on predicted data?

There are several existing methods for performing the bias correction for valid inference with predicted data. While the technical details differ, these methods are built upon the same intuition. At its simplest, you incorporate what you learn when you have access to both ground truth and predicted outcomes into downstream inference where you rely solely on predicted outcomes. The two step procedure looks like this:

Using side-by-side ground truth and predicted measures of the outcome variable, estimate the IPD rectifier, Δ. This tells you how differences between Y and Ŷ are associated with covariates X for the same observation.
(Y_i - Ŷ_i) = ΔX_i
Now, when you perform inference with predicted outcomes in the absence of ground truth measured outcomes, you incorporate the rectifier Δ into the naive parameters you estimate to recover valid IPD estimates.
Invalid IPD → Ŷ₂ = θX₂ Correct IPD → Ŷ₂ = (θ+Δ;)X₂

Cartoon example: height and basketball ability

We are interested in the association between a person's height and an index of their basketball ability on a scale from 1-10. Height can be measured directly, or from a self-report. Some people might report correctly, others might not, so we consider the self-reported height predicted data relative to directly measured height as ground truth data.

Oops! It looks like some people report being a couple inches taller than they actually are... How does this affect our conclusion about the association between height and basketball ability when we are relying on mostly self-reported height outcomes? Let's see. This is what it looks like to learn the rectifier Δ from the labeled data to correct inference performed on the unlabeled data.

First, we have some labeled data, ℓ, with observed basketball ability 🏀, with both measured height 📏 and self-reported height ✏️. We also have some unlabeled data, μℓ, with observed basketball ability 🏀 and only self-reported height ✏️.

For the labeled data ℓ, we can specify the relationship, β, between height {measured:📏:y_m} or {reported:✏️:y_r} and basketball ability {🏀:X_ℓ} with the following equations:
y_📏 = β_📏 🏀_ℓ or y_m = β_mX_ℓ y_✏️ = β_✏️🏀_ℓ or y_r = β_rX_ℓ
solution for β_m written as β_m = (X_ℓ^TX_ℓ)^-1X_ℓ^Ty_m
solution for β_r written as β_r = (X_ℓ^TX_ℓ)^-1X_ℓ^Ty_r

and residuals of reported height y_✏️ and measured height y_📏 Height Residuals

After matrix multiplication, rectifier Δ works out to -1.375. To recover the IPD corrected estimate β_IPD from the unlabeled data μℓ, we first estimate β_r from y_r = β_rX_μℓ just like above. We find that μℓ β_r is 1.41. Then, we subtract the rectifier Δ from β_r to find β_IPD. This gives
β_IPD = β_r - Δ 1.41 - (-1.375) = 2.78 β_IPD = 2.78 Below, you can see how the estimated relationships (ℓ β_r, ℓ β_m, μℓ β_r, μℓ β_IPD) compare to eachother. In the labeled data ℓ, we see that the ground truth relationship between basketball ability and directly measured height is β_m=2.37. Because some people self-reported being taller than they actually are, the estimated relationship is much weaker, with β_r=1. Moving to the unlabeled data μℓ, we see that the naive estimate of the relationship between self-reported height and basketball ability is similarly weak, with β_r = 1.41.

Leveraging the relationship between directly measured height 📏 and self-reported height ✏️ in the labeled dataset ℓ enables us to produce a valid estimate β_IPD in the unlabeled dataset μℓ. β_IPD = 2.78 is much closer to the ground truth β_m=2.37, and we arrive at this conclusion even in the absense of ground truth measures in the unlabeled dataset μℓ. The magnitude of the ground truth relationship between height and basketball ability is much larger than we would conclude from inference on predicted data. For more discussion about how IPD correction can alter your scientific conclusions, check out Section 4.3 from my 2024 paper From Narratives to Numbers: Valid Inference Using Language Model Predictions from Verbal Autopsies.

How to determine if IPD correction makes sense for your problem:

IPD References

(in reverse order of publication date)

Methods for correcting inference based on outcomes predicted by machine learning. (PostPI)
Wang, McCormick and Leek. 2020 PNAS.
Prediction-powered inference. (PPI)
Angelopoulos, Bates, Fannjiang, Jordan and Zrnic. 2023a Science.
PPI++: Efficient Prediction-Powered Inference. (PPI++)
Angelopoulos, Duchi and Zrnic. 2023b arxiv.
Assumption-Lean and Data-Adaptive Post-Prediction Inference. (PSPA)
Miao, Miao, Wu, Zhao and Lu. 2023 arxiv.
Do We Really Even Need Data? 🦏
Hoffman, Salerno, Afiaz, Leek and McCormick. 2024 arxiv.
From Narratives to Numbers: Valid Inference Using Language Model Predictions from Verbal Autopsies (multiPPI++)
Fan, Visokay, Hoffman, Salerno, Liu, Leek and McCormick. 2024 COLM.
Code respository for the `ipd` package can be found here.

Inference on Predicted Data (IPD)