| Issue | Approach | Rationale |
|---|---|---|
| Primary analysis | Use the expanded covariate model after multiple imputation as the primary estimate. | It preserves the full baseline cohort and handles missing outcomes and incomplete covariates together. |
| Supportive sensitivity check | Keep IPCW as a follow-up selection diagnostic and supportive comparison. | It checks whether outcome attrition is materially shifting the estimand among those retained in follow-up. |
| What to avoid as the lead result | Do not treat strict complete-case analysis as the default main result. | It changes the represented population and can discard informative baseline variables when they are incomplete. |
| Main assumption to say out loud | State that the primary analysis relies on a missing-at-random story conditional on the imputation model variables. | That assumption is doing real work and readers need to know which variables make it more plausible. |
Real-World Missing Data Workflow
NHEFS smoking cessation study with incomplete covariates and follow-up loss
Data Source
Data came from causaldata::nhefs, a longitudinal follow-up study of baseline smokers. For the present analysis, treatment was defined as smoking cessation and the outcome was defined as 11-year weight change.
The analytic cohort comprised 1,629 observations. Follow-up weight change was observed for 96.1% of the cohort, and several baseline covariates in the expanded adjustment set also contained missing values.
Research Question
Among baseline smokers, what would the average 11-year weight change have been under smoking cessation versus continued smoking?
The objective was to evaluate the extent to which incomplete follow-up and incomplete baseline covariates influenced estimation of this contrast under complete-case analysis, inverse probability of censoring weights, and multiple imputation.
Analytic Overview
The primary analysis was specified as the expanded covariate model after multiple imputation. Inverse probability of censoring weights were used as a supportive analysis for attrition-related selection, and complete-case analysis was retained as a comparison rather than the principal result.
Study Setting
Several features of the data are directly relevant to the missing-data problem:
- 1982 weight change is missing for a nontrivial subset of baseline smokers
- several baseline covariates also have missing values
- smoking cessation and observed follow-up are not perfectly independent
Taken together, these features distinguish two related but analytically separate problems:
- missing outcomes due to attrition or censoring
- missing baseline covariates that matter for confounding control
Cohort Characteristics
| Characteristic | Overall N = 1,6291 |
Continued smoking N = 1,2011 |
Quit smoking N = 4281 |
|---|---|---|---|
| AGE IN 1971 | 43.9 (12.2) | 42.9 (11.9) | 46.7 (12.5) |
| sex | |||
| Female | 799 / 1,629 (49%) | 562 / 1,201 (47%) | 237 / 428 (55%) |
| Male | 830 / 1,629 (51%) | 639 / 1,201 (53%) | 191 / 428 (45%) |
| race | |||
| White | 1,414 / 1,629 (87%) | 1,024 / 1,201 (85%) | 390 / 428 (91%) |
| Black | 215 / 1,629 (13%) | 177 / 1,201 (15%) | 38 / 428 (8.9%) |
| HIGHEST GRADE OF REGULAR SCHOOL EVER IN 1971 | 11.1 (3.1) | 11.1 (3.0) | 11.2 (3.3) |
| NUMBER OF CIGARETTES SMOKED PER DAY IN 1971 | 20.6 (11.8) | 21.2 (11.6) | 18.8 (12.3) |
| YEARS OF SMOKING | 24.9 (12.2) | 24.3 (11.8) | 26.6 (13.0) |
| WEIGHT IN KILOGRAMS IN 1971 | 71.1 (15.7) | 70.5 (15.6) | 72.6 (16.1) |
| WEIGHT IN KILOGRAMS IN 1971 | 1.7 (0.3) | 1.7 (0.3) | 1.8 (0.4) |
| exercise | |||
| 0 | 317 / 1,629 (19%) | 247 / 1,201 (21%) | 70 / 428 (16%) |
| 1 | 677 / 1,629 (42%) | 496 / 1,201 (41%) | 181 / 428 (42%) |
| 2 | 635 / 1,629 (39%) | 458 / 1,201 (38%) | 177 / 428 (41%) |
| active | |||
| 0 | 729 / 1,629 (45%) | 547 / 1,201 (46%) | 182 / 428 (43%) |
| 1 | 738 / 1,629 (45%) | 540 / 1,201 (45%) | 198 / 428 (46%) |
| 2 | 162 / 1,629 (9.9%) | 114 / 1,201 (9.5%) | 48 / 428 (11%) |
| income | |||
| 11 | 29 / 1,567 (1.9%) | 24 / 1,164 (2.1%) | 5 / 403 (1.2%) |
| 12 | 60 / 1,567 (3.8%) | 47 / 1,164 (4.0%) | 13 / 403 (3.2%) |
| 13 | 68 / 1,567 (4.3%) | 52 / 1,164 (4.5%) | 16 / 403 (4.0%) |
| 14 | 63 / 1,567 (4.0%) | 44 / 1,164 (3.8%) | 19 / 403 (4.7%) |
| 15 | 83 / 1,567 (5.3%) | 67 / 1,164 (5.8%) | 16 / 403 (4.0%) |
| 16 | 78 / 1,567 (5.0%) | 65 / 1,164 (5.6%) | 13 / 403 (3.2%) |
| 17 | 65 / 1,567 (4.1%) | 46 / 1,164 (4.0%) | 19 / 403 (4.7%) |
| 18 | 284 / 1,567 (18%) | 202 / 1,164 (17%) | 82 / 403 (20%) |
| 19 | 417 / 1,567 (27%) | 295 / 1,164 (25%) | 122 / 403 (30%) |
| 20 | 220 / 1,567 (14%) | 172 / 1,164 (15%) | 48 / 403 (12%) |
| 21 | 114 / 1,567 (7.3%) | 87 / 1,164 (7.5%) | 27 / 403 (6.7%) |
| 22 | 86 / 1,567 (5.5%) | 63 / 1,164 (5.4%) | 23 / 403 (5.7%) |
| Missing | 62 | 37 | 25 |
| SYSTOLIC BLOOD PRESSURE IN 1982 | 128.7 (19.1) | 127.7 (18.8) | 131.7 (19.6) |
| Missing | 77 | 42 | 35 |
| DIASTOLIC BLOOD PRESSURE IN 1982 | 77.7 (10.6) | 77.4 (10.5) | 78.9 (10.8) |
| Missing | 81 | 44 | 37 |
| SERUM CHOLESTEROL (MG/100ML) IN 1971 | 220.0 (45.4) | 218.9 (45.1) | 223.0 (46.2) |
| Missing | 16 | 14 | 2 |
| follow_up_status | |||
| Observed | 1,566 / 1,629 (96%) | 1,163 / 1,201 (97%) | 403 / 428 (94%) |
| Missing | 63 / 1,629 (3.9%) | 38 / 1,201 (3.2%) | 25 / 428 (5.8%) |
| 1 Mean (SD); n / N (%) | |||
Within this analysis, the built-in censored field in nhefs aligned with rows missing 1982 weight change, so missing wt82_71 was treated as the attrition indicator for outcome follow-up.
Missing Data Patterns
The principal sources of incompleteness were:
- missing 1982 weight change, which affects whether a participant contributes to the observed-outcome analysis
- missing baseline covariates such as income, blood pressure, and cholesterol, which shrink a complete-case adjustment set even further
| Variable | Missing N | Missing % | Domain |
|---|---|---|---|
| Diastolic blood pressure | 81 | 5.0% | Baseline covariates with missingness |
| Systolic blood pressure | 77 | 4.7% | Baseline covariates with missingness |
| 11-year weight change | 63 | 3.9% | Outcome follow-up |
| Income | 62 | 3.8% | Baseline covariates with missingness |
| Cholesterol | 16 | 1.0% | Baseline covariates with missingness |
Common Patterns of Missingness
mice::md.pattern() is a compact way to show whether missingness tends to occur one variable at a time or in recurring bundles.
These patterns show that a strict complete-case analysis discards observations for more than one reason, including both outcome attrition and missing baseline covariates.
Predictors of Missingness
Missingness is easier to discuss when it is turned into an explicit outcome. The table below fits separate logistic models for whether each key variable is missing, using the same fully observed baseline predictors. This helps show whether missingness is plausibly related to treatment status or baseline prognosis rather than being completely haphazard.
| Variable | Missing % | Quitter OR for missingness | 95% CI | Strongest baseline signal |
|---|---|---|---|---|
| Diastolic blood pressure | 5.0% | 2.18 | 1.35 to 3.50 | Baseline BMI (OR 0.49) |
| Systolic blood pressure | 4.7% | 2.17 | 1.34 to 3.53 | Baseline BMI (OR 0.36) |
| 11-year weight change | 3.9% | 1.64 | 0.95 to 2.83 | Some exercise vs much exercise (OR 0.37) |
| Income | 3.8% | 1.86 | 1.07 to 3.22 | Baseline BMI (OR 3.14) |
| Cholesterol | 1.0% | 0.45 | 0.10 to 2.03 | Black vs white (OR 0.00) |
These results suggest that missingness is not obviously unrelated to treatment status or baseline covariates, which weakens the case for treating the missing-data process as negligible.
Attrition Analyses
Complete-case analysis on the outcome implicitly changes the target population from the full baseline cohort to the people who both remained in follow-up and had an observed 1982 weight change. That population shift matters if the probability of remaining observed depends on treatment or baseline prognosis.
Follow-Up by Treatment Group
| Treatment group | N | Observed | Missing | Observed % |
|---|---|---|---|---|
| Continued smoking | 1201 | 1163 | 38 | 96.8% |
| Quit smoking | 428 | 403 | 25 | 94.2% |
The difference is not huge, but it is directionally important: quitters are somewhat less likely to have observed follow-up. That alone does not prove problematic selection bias, but it is enough to justify checking whether the observed-outcome analysis is drifting away from the original baseline cohort.
Model for Outcome Observation
The logistic model below predicts whether the outcome is observed using fully observed baseline covariates plus treatment status. This is a diagnostic tool, not a proof that missingness is MAR.
| Predictor | Odds ratio | 95% CI |
|---|---|---|
| Inactive vs very active | 0.49 | 0.23 to 1.03 |
| Quit smoking | 0.61 | 0.35 to 1.06 |
| Baseline BMI | 0.75 | 0.12 to 4.78 |
| Black vs white | 0.93 | 0.41 to 2.10 |
| Male vs female | 0.94 | 0.43 to 2.03 |
| Smoking years | 0.97 | 0.93 to 1.01 |
| Baseline weight | 0.99 | 0.94 to 1.03 |
| Age (per year) | 0.99 | 0.95 to 1.03 |
| Cigarettes per day | 0.99 | 0.97 to 1.02 |
| Years of education | 1.03 | 0.95 to 1.12 |
| Moderately active vs very active | 1.15 | 0.63 to 2.12 |
| Little or no exercise vs much exercise | 1.88 | 0.93 to 3.80 |
| Some exercise vs much exercise | 2.71 | 1.29 to 5.70 |
The strongest visible pattern here is baseline weight, which is associated with whether the later outcome is observed. That is exactly the kind of pattern that makes naive observed-outcome analyses hard to defend without further discussion.
Inverse Probability of Censoring Weights
Inverse probability of censoring weights are one way to reweight the observed-outcome sample back toward the baseline cohort, as long as the follow-up model includes the variables that jointly explain observation and outcome.
| IPCW summary statistic | Value |
|---|---|
| Minimum | 0.966 |
| 25th percentile | 0.976 |
| Median | 0.987 |
| Mean | 0.999 |
| 75th percentile | 1.008 |
| Maximum | 1.317 |
The stabilized weights are fairly well behaved, which suggests that IPCW is acting mainly as a diagnostic correction rather than creating a dramatic reweighting of the cohort.
Comparison of Missing-Data Strategies
The comparison below keeps the estimand fixed while changing how missingness is handled.
| Analysis | Population | N | Estimate | 95% CI |
|---|---|---|---|---|
| Observed outcome model with baseline demographic and smoking-history covariates | People with observed 1982 weight change | 1,566 | 3.32 kg | 2.46 to 4.18 kg |
| Observed outcome model with baseline demographic and smoking-history covariates plus IPCW | Baseline cohort reweighted back toward full follow-up | 1,566 | 3.31 kg | 2.45 to 4.18 kg |
| Complete-case model with expanded baseline covariates including income and clinical measures | Subset complete on outcome and all expanded covariates | 1,461 | 3.12 kg | 2.22 to 4.02 kg |
| Multiple-imputation model with expanded baseline covariates including income and clinical measures | Full eligible baseline cohort under the imputation model | 1,629 | 3.24 kg | 2.36 to 4.11 kg |
A few things stand out:
- the observed-outcome model with baseline demographic and smoking-history covariates and the corresponding IPCW analysis are very similar, which suggests attrition reweighting is not radically changing the answer here
- the complete-case model with expanded baseline covariates uses a smaller subset because it requires observed outcome plus complete income and clinical measures
- the multiple-imputation model with the same expanded covariate set recovers the full baseline cohort under the imputation model and produces a slightly larger estimated effect than the expanded complete-case fit
That pattern is exactly why complete-case analysis should rarely be treated as the default primary analysis when a plausible imputation model can be justified.
Rationale for Multiple Imputation as the Primary Analysis
Multiple imputation is the more natural primary strategy here for two reasons:
- It addresses both baseline covariate gaps and missing outcomes in one coherent model.
- It keeps the analysis closer to the eligible baseline cohort rather than redefining the estimand around a fully observed subset.
The imputation analysis shown here uses 5 imputations and 5 iterations.
Imputation Diagnostics
The first diagnostic question is not whether imputation is perfect. It is whether the completed datasets are producing obviously implausible shifts relative to the observed data.
| Variable | Missing % | Observed mean | Average completed mean | Shift | SD across imputations |
|---|---|---|---|---|---|
| 11-year weight change | 3.9% | 2.6 | 2.6 | 0.0 | 0.04 |
| Systolic blood pressure | 4.7% | 128.7 | 129.0 | 0.3 | 0.11 |
| Diastolic blood pressure | 5.0% | 77.7 | 77.8 | 0.0 | 0.09 |
| Cholesterol | 1.0% | 220.0 | 220.0 | 0.0 | 0.10 |
Those small shifts are reassuring. They do not prove the MAR assumption, but they do show that the imputed values are not forcing the completed data into an obviously different scale or location.
Primary Imputed Analysis
| Term | Estimate | 95% CI | |
|---|---|---|---|
| 2 | Quit smoking | 3.24 | 2.36 to 4.11 |
| 3 | Age (per year) | -0.22 | -0.28 to -0.15 |
| 8 | Smoking years | 0.07 | 0.00 to 0.13 |
| 10 | Baseline BMI | -3.92 | -6.74 to -1.10 |
| 13 | Moderately active vs very active | -1.18 | -1.99 to -0.38 |
These coefficients show how the primary completed-data model behaves after the imputation workflow is applied.
Assumptions and Interpretation
None of the missing-data strategies below are assumption-free. The right question is not “Which method is unbiased by construction?” but “Which method makes the most defensible assumptions for the scientific context?”
| Strategy | What it targets | Key assumption | Main risk if wrong |
|---|---|---|---|
| Observed-outcome analysis | Association in the subset with observed follow-up and complete data for the fitted model. | People retained in the observed-data subset are still a reasonable stand-in for the target population after adjustment. | The estimate quietly shifts to a selected subpopulation without making that change explicit. |
| Observed-outcome analysis with IPCW | Selection into observed follow-up when outcome observation is explainable from included baseline variables. | Outcome observation is conditionally independent of the missing outcome given measured baseline predictors in the censoring model. | If missingness still depends on unmeasured or unmodeled factors, censoring weights may not remove selection bias. |
| Multiple imputation | Missing baseline covariates and missing outcomes under an imputation model that makes MAR plausible. | Given the imputation model variables, the remaining missingness mechanism is close enough to missing at random to support inference. | If missing values depend on unobserved quantities even after conditioning on the model, imputed values may look precise but still be biased. |
Transportability and Selection Bias
Missingness changes interpretation even before it changes coefficients.
If we analyze only people with observed outcome data, our estimate is implicitly about the follow-up-complete subset. That may be close to the baseline cohort, or it may not. IPCW tries to recover the original cohort by reweighting observed participants. Multiple imputation instead tries to reconstruct the missing pieces directly under a model-based MAR story. Both approaches are attempts to preserve the link back to the original eligible population.
That matters for transportability because the reported effect may otherwise be interpreted as though it still refers to the original study cohort. If attrition or complete-case restriction has silently narrowed the represented population, the interpretation shifts even when the effect estimate itself does not move very much.
Methods
Analytic Workflow
- Define the smoking-cessation estimand in the NHEFS cohort.
- Summarize where missingness appears across the outcome and expanded baseline covariates.
- Model outcome observation as a function of fully observed baseline predictors and treatment to diagnose attrition.
- Fit a stabilized IPCW comparison among participants with observed follow-up.
- Fit an expanded complete-case regression that requires observed outcome plus complete baseline covariates.
- Run multiple imputation with
miceand pool the expanded regression across completed datasets. - Compare how the estimated smoking-cessation contrast changes across those strategies.
Additional Analyses
- increase the number of imputations beyond this lightweight demonstration run
- inspect trace plots and Monte Carlo error more formally
- consider passive imputation or tighter structural handling of derived variables if the model becomes more complex
- add sensitivity analyses for departures from MAR, such as delta-adjusted outcome imputation
- pair the missing-data workflow with a weighted causal estimand rather than only regression adjustment
References
- van Buuren S. Flexible Imputation of Missing Data. This is the main practical reference for multiple imputation and the logic used by
mice. - Rubin DB. Multiple Imputation for Nonresponse in Surveys. This remains the canonical reference for pooled inference after multiple imputation.
- Hernan MA, Robins JM. Causal Inference: What If. This is the main causal framing reference for exchangeability, positivity, and selection into observed follow-up.
- Seaman SR, White IR. Review articles on inverse probability weighting for missing data provide the conceptual basis for the IPCW comparison shown here.
micepackage documentation is the direct software reference for the imputation workflow used here.
Conclusions
- Outcome attrition and baseline covariate missingness both affect this analysis.
- IPCW is useful here as an attrition diagnostic and as a reweighting correction when the outcome-observation model is credible.
- Multiple imputation is the more complete primary strategy because it can handle both missing outcomes and incomplete baseline confounders together.
- Neither IPCW nor MI removes the need to state assumptions clearly; each only moves the problem into a more explicit model.