Real-World Missing Data Workflow

NHEFS smoking cessation study with incomplete covariates and follow-up loss

Data Source

Data came from causaldata::nhefs, a longitudinal follow-up study of baseline smokers. For the present analysis, treatment was defined as smoking cessation and the outcome was defined as 11-year weight change.

The analytic cohort comprised 1,629 observations. Follow-up weight change was observed for 96.1% of the cohort, and several baseline covariates in the expanded adjustment set also contained missing values.

Research Question

Among baseline smokers, what would the average 11-year weight change have been under smoking cessation versus continued smoking?

The objective was to evaluate the extent to which incomplete follow-up and incomplete baseline covariates influenced estimation of this contrast under complete-case analysis, inverse probability of censoring weights, and multiple imputation.

Analytic Overview

Issue Approach Rationale
Primary analysis Use the expanded covariate model after multiple imputation as the primary estimate. It preserves the full baseline cohort and handles missing outcomes and incomplete covariates together.
Supportive sensitivity check Keep IPCW as a follow-up selection diagnostic and supportive comparison. It checks whether outcome attrition is materially shifting the estimand among those retained in follow-up.
What to avoid as the lead result Do not treat strict complete-case analysis as the default main result. It changes the represented population and can discard informative baseline variables when they are incomplete.
Main assumption to say out loud State that the primary analysis relies on a missing-at-random story conditional on the imputation model variables. That assumption is doing real work and readers need to know which variables make it more plausible.

The primary analysis was specified as the expanded covariate model after multiple imputation. Inverse probability of censoring weights were used as a supportive analysis for attrition-related selection, and complete-case analysis was retained as a comparison rather than the principal result.

Study Setting

Several features of the data are directly relevant to the missing-data problem:

  • 1982 weight change is missing for a nontrivial subset of baseline smokers
  • several baseline covariates also have missing values
  • smoking cessation and observed follow-up are not perfectly independent

Taken together, these features distinguish two related but analytically separate problems:

  1. missing outcomes due to attrition or censoring
  2. missing baseline covariates that matter for confounding control

Cohort Characteristics

Table 1. Baseline characteristics and follow-up status
Overall N = 1,629; observed follow-up outcome = 96.1%.
Characteristic Overall
N = 1,6291
Continued smoking
N = 1,2011
Quit smoking
N = 4281
AGE IN 1971 43.9 (12.2) 42.9 (11.9) 46.7 (12.5)
sex


    Female 799 / 1,629 (49%) 562 / 1,201 (47%) 237 / 428 (55%)
    Male 830 / 1,629 (51%) 639 / 1,201 (53%) 191 / 428 (45%)
race


    White 1,414 / 1,629 (87%) 1,024 / 1,201 (85%) 390 / 428 (91%)
    Black 215 / 1,629 (13%) 177 / 1,201 (15%) 38 / 428 (8.9%)
HIGHEST GRADE OF REGULAR SCHOOL EVER IN 1971 11.1 (3.1) 11.1 (3.0) 11.2 (3.3)
NUMBER OF CIGARETTES SMOKED PER DAY IN 1971 20.6 (11.8) 21.2 (11.6) 18.8 (12.3)
YEARS OF SMOKING 24.9 (12.2) 24.3 (11.8) 26.6 (13.0)
WEIGHT IN KILOGRAMS IN 1971 71.1 (15.7) 70.5 (15.6) 72.6 (16.1)
WEIGHT IN KILOGRAMS IN 1971 1.7 (0.3) 1.7 (0.3) 1.8 (0.4)
exercise


    0 317 / 1,629 (19%) 247 / 1,201 (21%) 70 / 428 (16%)
    1 677 / 1,629 (42%) 496 / 1,201 (41%) 181 / 428 (42%)
    2 635 / 1,629 (39%) 458 / 1,201 (38%) 177 / 428 (41%)
active


    0 729 / 1,629 (45%) 547 / 1,201 (46%) 182 / 428 (43%)
    1 738 / 1,629 (45%) 540 / 1,201 (45%) 198 / 428 (46%)
    2 162 / 1,629 (9.9%) 114 / 1,201 (9.5%) 48 / 428 (11%)
income


    11 29 / 1,567 (1.9%) 24 / 1,164 (2.1%) 5 / 403 (1.2%)
    12 60 / 1,567 (3.8%) 47 / 1,164 (4.0%) 13 / 403 (3.2%)
    13 68 / 1,567 (4.3%) 52 / 1,164 (4.5%) 16 / 403 (4.0%)
    14 63 / 1,567 (4.0%) 44 / 1,164 (3.8%) 19 / 403 (4.7%)
    15 83 / 1,567 (5.3%) 67 / 1,164 (5.8%) 16 / 403 (4.0%)
    16 78 / 1,567 (5.0%) 65 / 1,164 (5.6%) 13 / 403 (3.2%)
    17 65 / 1,567 (4.1%) 46 / 1,164 (4.0%) 19 / 403 (4.7%)
    18 284 / 1,567 (18%) 202 / 1,164 (17%) 82 / 403 (20%)
    19 417 / 1,567 (27%) 295 / 1,164 (25%) 122 / 403 (30%)
    20 220 / 1,567 (14%) 172 / 1,164 (15%) 48 / 403 (12%)
    21 114 / 1,567 (7.3%) 87 / 1,164 (7.5%) 27 / 403 (6.7%)
    22 86 / 1,567 (5.5%) 63 / 1,164 (5.4%) 23 / 403 (5.7%)
    Missing 62 37 25
SYSTOLIC BLOOD PRESSURE IN 1982 128.7 (19.1) 127.7 (18.8) 131.7 (19.6)
    Missing 77 42 35
DIASTOLIC BLOOD PRESSURE IN 1982 77.7 (10.6) 77.4 (10.5) 78.9 (10.8)
    Missing 81 44 37
SERUM CHOLESTEROL (MG/100ML) IN 1971 220.0 (45.4) 218.9 (45.1) 223.0 (46.2)
    Missing 16 14 2
follow_up_status


    Observed 1,566 / 1,629 (96%) 1,163 / 1,201 (97%) 403 / 428 (94%)
    Missing 63 / 1,629 (3.9%) 38 / 1,201 (3.2%) 25 / 428 (5.8%)
1 Mean (SD); n / N (%)

Within this analysis, the built-in censored field in nhefs aligned with rows missing 1982 weight change, so missing wt82_71 was treated as the attrition indicator for outcome follow-up.

Missing Data Patterns

The principal sources of incompleteness were:

  • missing 1982 weight change, which affects whether a participant contributes to the observed-outcome analysis
  • missing baseline covariates such as income, blood pressure, and cholesterol, which shrink a complete-case adjustment set even further

Variable Missing N Missing % Domain
Diastolic blood pressure 81 5.0% Baseline covariates with missingness
Systolic blood pressure 77 4.7% Baseline covariates with missingness
11-year weight change 63 3.9% Outcome follow-up
Income 62 3.8% Baseline covariates with missingness
Cholesterol 16 1.0% Baseline covariates with missingness

Common Patterns of Missingness

mice::md.pattern() is a compact way to show whether missingness tends to occur one variable at a time or in recurring bundles.

These patterns show that a strict complete-case analysis discards observations for more than one reason, including both outcome attrition and missing baseline covariates.

Predictors of Missingness

Missingness is easier to discuss when it is turned into an explicit outcome. The table below fits separate logistic models for whether each key variable is missing, using the same fully observed baseline predictors. This helps show whether missingness is plausibly related to treatment status or baseline prognosis rather than being completely haphazard.

Variable Missing % Quitter OR for missingness 95% CI Strongest baseline signal
Diastolic blood pressure 5.0% 2.18 1.35 to 3.50 Baseline BMI (OR 0.49)
Systolic blood pressure 4.7% 2.17 1.34 to 3.53 Baseline BMI (OR 0.36)
11-year weight change 3.9% 1.64 0.95 to 2.83 Some exercise vs much exercise (OR 0.37)
Income 3.8% 1.86 1.07 to 3.22 Baseline BMI (OR 3.14)
Cholesterol 1.0% 0.45 0.10 to 2.03 Black vs white (OR 0.00)

These results suggest that missingness is not obviously unrelated to treatment status or baseline covariates, which weakens the case for treating the missing-data process as negligible.

Attrition Analyses

Complete-case analysis on the outcome implicitly changes the target population from the full baseline cohort to the people who both remained in follow-up and had an observed 1982 weight change. That population shift matters if the probability of remaining observed depends on treatment or baseline prognosis.

Follow-Up by Treatment Group

Treatment group N Observed Missing Observed %
Continued smoking 1201 1163 38 96.8%
Quit smoking 428 403 25 94.2%

The difference is not huge, but it is directionally important: quitters are somewhat less likely to have observed follow-up. That alone does not prove problematic selection bias, but it is enough to justify checking whether the observed-outcome analysis is drifting away from the original baseline cohort.

Model for Outcome Observation

The logistic model below predicts whether the outcome is observed using fully observed baseline covariates plus treatment status. This is a diagnostic tool, not a proof that missingness is MAR.

Predictor Odds ratio 95% CI
Inactive vs very active 0.49 0.23 to 1.03
Quit smoking 0.61 0.35 to 1.06
Baseline BMI 0.75 0.12 to 4.78
Black vs white 0.93 0.41 to 2.10
Male vs female 0.94 0.43 to 2.03
Smoking years 0.97 0.93 to 1.01
Baseline weight 0.99 0.94 to 1.03
Age (per year) 0.99 0.95 to 1.03
Cigarettes per day 0.99 0.97 to 1.02
Years of education 1.03 0.95 to 1.12
Moderately active vs very active 1.15 0.63 to 2.12
Little or no exercise vs much exercise 1.88 0.93 to 3.80
Some exercise vs much exercise 2.71 1.29 to 5.70

The strongest visible pattern here is baseline weight, which is associated with whether the later outcome is observed. That is exactly the kind of pattern that makes naive observed-outcome analyses hard to defend without further discussion.

Inverse Probability of Censoring Weights

Inverse probability of censoring weights are one way to reweight the observed-outcome sample back toward the baseline cohort, as long as the follow-up model includes the variables that jointly explain observation and outcome.

IPCW summary statistic Value
Minimum 0.966
25th percentile 0.976
Median 0.987
Mean 0.999
75th percentile 1.008
Maximum 1.317

The stabilized weights are fairly well behaved, which suggests that IPCW is acting mainly as a diagnostic correction rather than creating a dramatic reweighting of the cohort.

Comparison of Missing-Data Strategies

The comparison below keeps the estimand fixed while changing how missingness is handled.

Analysis Population N Estimate 95% CI
Observed outcome model with baseline demographic and smoking-history covariates People with observed 1982 weight change 1,566 3.32 kg 2.46 to 4.18 kg
Observed outcome model with baseline demographic and smoking-history covariates plus IPCW Baseline cohort reweighted back toward full follow-up 1,566 3.31 kg 2.45 to 4.18 kg
Complete-case model with expanded baseline covariates including income and clinical measures Subset complete on outcome and all expanded covariates 1,461 3.12 kg 2.22 to 4.02 kg
Multiple-imputation model with expanded baseline covariates including income and clinical measures Full eligible baseline cohort under the imputation model 1,629 3.24 kg 2.36 to 4.11 kg

A few things stand out:

  • the observed-outcome model with baseline demographic and smoking-history covariates and the corresponding IPCW analysis are very similar, which suggests attrition reweighting is not radically changing the answer here
  • the complete-case model with expanded baseline covariates uses a smaller subset because it requires observed outcome plus complete income and clinical measures
  • the multiple-imputation model with the same expanded covariate set recovers the full baseline cohort under the imputation model and produces a slightly larger estimated effect than the expanded complete-case fit

That pattern is exactly why complete-case analysis should rarely be treated as the default primary analysis when a plausible imputation model can be justified.

Rationale for Multiple Imputation as the Primary Analysis

Multiple imputation is the more natural primary strategy here for two reasons:

  1. It addresses both baseline covariate gaps and missing outcomes in one coherent model.
  2. It keeps the analysis closer to the eligible baseline cohort rather than redefining the estimand around a fully observed subset.

The imputation analysis shown here uses 5 imputations and 5 iterations.

Imputation Diagnostics

The first diagnostic question is not whether imputation is perfect. It is whether the completed datasets are producing obviously implausible shifts relative to the observed data.

Variable Missing % Observed mean Average completed mean Shift SD across imputations
11-year weight change 3.9% 2.6 2.6 0.0 0.04
Systolic blood pressure 4.7% 128.7 129.0 0.3 0.11
Diastolic blood pressure 5.0% 77.7 77.8 0.0 0.09
Cholesterol 1.0% 220.0 220.0 0.0 0.10

Those small shifts are reassuring. They do not prove the MAR assumption, but they do show that the imputed values are not forcing the completed data into an obviously different scale or location.

Primary Imputed Analysis

Term Estimate 95% CI
2 Quit smoking 3.24 2.36 to 4.11
3 Age (per year) -0.22 -0.28 to -0.15
8 Smoking years 0.07 0.00 to 0.13
10 Baseline BMI -3.92 -6.74 to -1.10
13 Moderately active vs very active -1.18 -1.99 to -0.38

These coefficients show how the primary completed-data model behaves after the imputation workflow is applied.

Assumptions and Interpretation

None of the missing-data strategies below are assumption-free. The right question is not “Which method is unbiased by construction?” but “Which method makes the most defensible assumptions for the scientific context?”

Strategy What it targets Key assumption Main risk if wrong
Observed-outcome analysis Association in the subset with observed follow-up and complete data for the fitted model. People retained in the observed-data subset are still a reasonable stand-in for the target population after adjustment. The estimate quietly shifts to a selected subpopulation without making that change explicit.
Observed-outcome analysis with IPCW Selection into observed follow-up when outcome observation is explainable from included baseline variables. Outcome observation is conditionally independent of the missing outcome given measured baseline predictors in the censoring model. If missingness still depends on unmeasured or unmodeled factors, censoring weights may not remove selection bias.
Multiple imputation Missing baseline covariates and missing outcomes under an imputation model that makes MAR plausible. Given the imputation model variables, the remaining missingness mechanism is close enough to missing at random to support inference. If missing values depend on unobserved quantities even after conditioning on the model, imputed values may look precise but still be biased.

Transportability and Selection Bias

Missingness changes interpretation even before it changes coefficients.

If we analyze only people with observed outcome data, our estimate is implicitly about the follow-up-complete subset. That may be close to the baseline cohort, or it may not. IPCW tries to recover the original cohort by reweighting observed participants. Multiple imputation instead tries to reconstruct the missing pieces directly under a model-based MAR story. Both approaches are attempts to preserve the link back to the original eligible population.

That matters for transportability because the reported effect may otherwise be interpreted as though it still refers to the original study cohort. If attrition or complete-case restriction has silently narrowed the represented population, the interpretation shifts even when the effect estimate itself does not move very much.

Methods

Analytic Workflow

  1. Define the smoking-cessation estimand in the NHEFS cohort.
  2. Summarize where missingness appears across the outcome and expanded baseline covariates.
  3. Model outcome observation as a function of fully observed baseline predictors and treatment to diagnose attrition.
  4. Fit a stabilized IPCW comparison among participants with observed follow-up.
  5. Fit an expanded complete-case regression that requires observed outcome plus complete baseline covariates.
  6. Run multiple imputation with mice and pool the expanded regression across completed datasets.
  7. Compare how the estimated smoking-cessation contrast changes across those strategies.

Additional Analyses

  • increase the number of imputations beyond this lightweight demonstration run
  • inspect trace plots and Monte Carlo error more formally
  • consider passive imputation or tighter structural handling of derived variables if the model becomes more complex
  • add sensitivity analyses for departures from MAR, such as delta-adjusted outcome imputation
  • pair the missing-data workflow with a weighted causal estimand rather than only regression adjustment

References

  • van Buuren S. Flexible Imputation of Missing Data. This is the main practical reference for multiple imputation and the logic used by mice.
  • Rubin DB. Multiple Imputation for Nonresponse in Surveys. This remains the canonical reference for pooled inference after multiple imputation.
  • Hernan MA, Robins JM. Causal Inference: What If. This is the main causal framing reference for exchangeability, positivity, and selection into observed follow-up.
  • Seaman SR, White IR. Review articles on inverse probability weighting for missing data provide the conceptual basis for the IPCW comparison shown here.
  • mice package documentation is the direct software reference for the imputation workflow used here.

Conclusions

  • Outcome attrition and baseline covariate missingness both affect this analysis.
  • IPCW is useful here as an attrition diagnostic and as a reweighting correction when the outcome-observation model is credible.
  • Multiple imputation is the more complete primary strategy because it can handle both missing outcomes and incomplete baseline confounders together.
  • Neither IPCW nor MI removes the need to state assumptions clearly; each only moves the problem into a more explicit model.