The goal of this package is to provide R users access to modern methods for non-probability samples when auxiliary information from the population or probability sample is available:
- inverse probability weighting estimators with possible calibration constraints (Chen et al. 2020),
- mass imputation estimators based on nearest neighbours (Yang et al. 2021), predictive mean matching (Chlebicki et al. 2025), non-parametric (Chen et al. 2022) and regression imputation (Kim et al. 2021),
- doubly robust estimators (Chen et al. 2020) with bias minimization (Yang et al. 2020).
The package allows for:
- variable section in high-dimensional space using SCAD (Yang et al.
2020), Lasso and MCP penalty (via the
ncvreg,Rcpp,RcppArmadillopackages), - estimation of variance using analytical and bootstrap approach (see Wu (2023)),
- integration with the
surveyandsrvyrpackages when probability sample is available (Lumley 2004, 2023; Freedman Ellis and Schneider 2024), - different links for selection (
logit,probitandcloglog) and outcome (gaussian,binomialandpoisson) variables.
Details on the use of the package can be found:
- see the working paper Chrostowski, Ł., Chlebicki, P., & Beręsewicz, M. (2025). nonprobsvy–An R package for modern methods for non-probability surveys. arXiv preprint arXiv:2504.04255 – forthcoming to the Journal of Statistical Software
- in the draft (and not proofread) version of the book Modern inference methods for non-probability samples with R,
- in example codes that reproduce papers available on github in the repository software tutorials.
You can install the recent version of nonprobsvy package from main
branch Github with:
pak::pkg_install("ncn-foreigners/nonprobsvy")or install the stable version from CRAN
install.packages("nonprobsvy")or development version from the dev branch
pak::pkg_install("ncn-foreigners/nonprobsvy@dev")Consider the following setting where two samples are available:
non-probability (denoted as
| Sample | Auxiliary variables |
Target variable |
Design ( |
|
|---|---|---|---|---|
|
|
1 | ? | ||
| … | ? | |||
| ? | ||||
|
|
? | |||
| … | ? | |||
| ? |
The current implementation does not use target-variable values from the
probability sample. Data structures where dependence and key arguments in
control_sel() are reserved for future overlap handling and currently
raise a “not yet implemented” error if set.
Suppose
- unit-level data is available for the non-probability sample
$S_{\text{NP}}$ , i.e. $(y_k,\boldsymbol{x}k)$ is available for all units $k \in S{\text{NP}}$, and population-level data is available for $\boldsymbol{x}1,\ldots,\boldsymbol{x}p$, denoted as $\tau{x{1}},\tau_{x_{2}},\ldots,\tau_{x_{p}}$ and population size$N$ is known. We can also consider situations where population data are estimated (e.g. on the basis of a survey to which we do not have access), - unit-level data is available for the non-probability sample
$S_{\text{NP}}$ and the probability sample$S_{\text{P}}$ , i.e. $(\boldsymbol{x}k,I{\text{NP}, k})$ is determined by the data:$I_{\text{NP}, k}=1$ if$k \in S_{\text{NP}}$ otherwise$I_{\text{NP}, k}=0$ ,$y_k$ is observed only for sample$S_{\text{NP}}$ and $\boldsymbol{x}k$ is observed in both $S{\text{NP}}$ and$S_{\text{P}}$ .
Supported target-variable types depend on the estimator family:
| Estimator family | Supported target variable Y |
|---|---|
| IPW | Numeric targets whose population mean is meaningful, including continuous, count, and 0/1 binary variables. No outcome model is fitted. |
Mass imputation with method_outcome = "glm" |
Continuous, count, or binary variables through family_outcome = "gaussian", "poisson", or "binomial". |
Mass imputation with method_outcome = "nn", "pmm", or "npar" |
Numeric targets; categorical, ordinal, survival, and other structured outcomes are not supported. |
| Doubly robust | GLM outcome models only; use family_outcome = "gaussian", "poisson", or "binomial". |
The compact examples below use the built-in admin non-probability
sample and jvs probability sample.
library(survey)
library(nonprobsvy)
data(admin)
data(jvs)
prob <- svydesign(
ids = ~1,
weights = ~weight,
strata = ~size + nace + region,
data = jvs
)
pop_totals <- colSums(model.matrix(~region + private + nace + size, jvs) * jvs$weight)|
Estimator |
Example code |
|---|---|
|
Mass imputation based on regression imputation |
nonprob(
outcome = single_shift ~ region + private + nace + size,
data = admin,
pop_totals = pop_totals,
method_outcome = "glm",
family_outcome = "binomial",
se = FALSE
) |
|
Inverse probability weighting |
nonprob(
selection = ~region + private + nace + size,
target = ~single_shift,
data = admin,
pop_totals = pop_totals,
method_selection = "logit",
se = FALSE
) |
|
Inverse probability weighting with calibration constraint |
nonprob(
selection = ~region + private + nace + size,
target = ~single_shift,
data = admin,
pop_totals = pop_totals,
method_selection = "logit",
control_selection = control_sel(est_method = "gee", gee_h_fun = 1),
se = FALSE
) |
|
Doubly robust estimator |
nonprob(
selection = ~region + private + nace + size,
outcome = single_shift ~ region + private + nace + size,
data = admin,
pop_totals = pop_totals,
method_outcome = "glm",
family_outcome = "binomial",
se = FALSE
) |
|
Estimator |
Example code |
|---|---|
|
Mass imputation based on regression imputation |
nonprob(
outcome = single_shift ~ region + private + nace + size,
data = admin,
svydesign = prob,
method_outcome = "glm",
family_outcome = "binomial",
se = FALSE
) |
|
Mass imputation based on nearest neighbour imputation |
nonprob(
outcome = single_shift ~ region + private + nace + size,
data = admin,
svydesign = prob,
method_outcome = "nn",
control_outcome = control_out(k = 2),
se = FALSE
) |
|
Mass imputation based on predictive mean matching |
nonprob(
outcome = single_shift ~ region + private + nace + size,
data = admin,
svydesign = prob,
method_outcome = "pmm",
se = FALSE
) |
|
Mass imputation based on regression imputation with variable selection (LASSO) |
nonprob(
outcome = single_shift ~ region + private + nace + size,
data = admin,
svydesign = prob,
method_outcome = "glm",
family_outcome = "binomial",
control_outcome = control_out(penalty = "lasso"),
control_inference = control_inf(vars_selection = TRUE),
se = FALSE
) |
|
Inverse probability weighting |
nonprob(
selection = ~region + private + nace + size,
target = ~single_shift,
data = admin,
svydesign = prob,
method_selection = "logit",
se = FALSE
) |
|
Inverse probability weighting with calibration constraint |
nonprob(
selection = ~region + private + nace + size,
target = ~single_shift,
data = admin,
svydesign = prob,
method_selection = "logit",
control_selection = control_sel(est_method = "gee", gee_h_fun = 1),
se = FALSE
) |
|
Inverse probability weighting with calibration constraint with variable selection (SCAD) |
nonprob(
selection = ~region + private + nace + size,
target = ~single_shift,
data = admin,
svydesign = prob,
method_selection = "logit",
control_selection = control_sel(penalty = "SCAD"),
control_inference = control_inf(vars_selection = TRUE),
se = FALSE
) |
|
Doubly robust estimator |
nonprob(
selection = ~region + private + nace + size,
outcome = single_shift ~ region + private + nace + size,
data = admin,
svydesign = prob,
method_outcome = "glm",
family_outcome = "binomial",
se = FALSE
) |
|
Doubly robust estimator with variable selection (SCAD) and bias minimization |
nonprob(
selection = ~region + private + nace + size,
outcome = single_shift ~ region + private + nace + size,
data = admin,
svydesign = prob,
method_outcome = "glm",
family_outcome = "binomial",
control_inference = control_inf(
vars_selection = TRUE,
vars_combine = TRUE,
bias_correction = TRUE
),
se = FALSE
) |
Simulate example data from the following paper: Kim, Jae Kwang, and Zhonglei Wang. “Sampling techniques for big data analysis.” International Statistical Review 87 (2019): S177-S191 [section 5.2]
library(survey)
library(nonprobsvy)
set.seed(1234567890)
N <- 1e6 ## 1000000
n <- 1000
x1 <- rnorm(n = N, mean = 1, sd = 1)
x2 <- rexp(n = N, rate = 1)
epsilon <- rnorm(n = N) # rnorm(N)
y1 <- 1 + x1 + x2 + epsilon
y2 <- 0.5*(x1 - 0.5)^2 + x2 + epsilon
p1 <- exp(x2)/(1+exp(x2))
p2 <- exp(-0.5+0.5*(x2-2)^2)/(1+exp(-0.5+0.5*(x2-2)^2))
flag_bd1 <- rbinom(n = N, size = 1, prob = p1)
flag_srs <- as.numeric(1:N %in% sample(1:N, size = n))
base_w_srs <- N/n
population <- data.frame(x1,x2,y1,y2,p1,p2,base_w_srs, flag_bd1, flag_srs, pop_size = N)
base_w_bd <- N/sum(population$flag_bd1)Declare svydesign object with survey package
sample_prob <- svydesign(ids= ~1, weights = ~ base_w_srs,
data = subset(population, flag_srs == 1),
fpc = ~ pop_size)
sample_prob
#> Independent Sampling design
#> svydesign(ids = ~1, weights = ~base_w_srs, data = subset(population,
#> flag_srs == 1), fpc = ~pop_size)or with the srvyr package
sample_prob <- srvyr::as_survey_design(.data = subset(population, flag_srs == 1),
weights = base_w_srs)
sample_probIndependent Sampling design (with replacement)
Called via srvyr
Sampling variables:
Data variables:
- x1 (dbl), x2 (dbl), y1 (dbl), y2 (dbl), p1 (dbl), p2 (dbl), base_w_srs (dbl), flag_bd1 (int), flag_srs (dbl)Estimate population mean of y1 based on doubly robust estimator using
IPW with calibration constraints and we specify that auxiliary variables
should not be combined for the inference.
result_dr <- nonprob(
selection = ~ x2,
outcome = y1 + y2 ~ x1 + x2,
data = subset(population, flag_bd1 == 1),
svydesign = sample_prob
)Results
result_dr
#> A nonprob object
#> - estimator type: doubly robust
#> - method: glm (gaussian)
#> - auxiliary variables source: survey
#> - vars selection: false
#> - variance estimator: analytic
#> - population size fixed: false
#> - naive (uncorrected) estimators:
#> - variable y1: 3.1817
#> - variable y2: 1.8087
#> - selected estimators:
#> - variable y1: 2.9500 (se=0.0414, ci=(2.8689, 3.0312))
#> - variable y2: 1.5762 (se=0.0313, ci=(1.5150, 1.6375))Mass imputation estimator
result_mi <- nonprob(
outcome = y1 + y2 ~ x1 + x2,
data = subset(population, flag_bd1 == 1),
svydesign = sample_prob
)Results
result_mi
#> A nonprob object
#> - estimator type: mass imputation
#> - method: glm (gaussian)
#> - auxiliary variables source: survey
#> - vars selection: false
#> - variance estimator: analytic
#> - population size fixed: false
#> - naive (uncorrected) estimators:
#> - variable y1: 3.1817
#> - variable y2: 1.8087
#> - selected estimators:
#> - variable y1: 2.9498 (se=0.0420, ci=(2.8675, 3.0321))
#> - variable y2: 1.5760 (se=0.0326, ci=(1.5122, 1.6398))Inverse probability weighting estimator
For IPW, MLE without a fixed pop_size, pop_totals, or pop_means
uses the Hajek-type estimator. Supplying a fixed population size uses
the Horvitz-Thompson-type estimator, while IPW-GEE with a reference
survey uses sum(weights(svydesign)) as the denominator.
result_ipw <- nonprob(
selection = ~ x2,
target = ~y1+y2,
data = subset(population, flag_bd1 == 1),
svydesign = sample_prob)Results
result_ipw
#> A nonprob object
#> - estimator type: inverse probability weighting
#> - method: logit (mle)
#> - IPW point estimator: Hajek (denominator: estimated IPW weights = 1025062.6981)
#> - auxiliary variables source: survey
#> - vars selection: false
#> - variance estimator: analytic
#> - population size fixed: false
#> - naive (uncorrected) estimators:
#> - variable y1: 3.1817
#> - variable y2: 1.8087
#> - selected estimators:
#> - variable y1: 2.9248 (se=0.0500, ci=(2.8269, 3.0227))
#> - variable y2: 1.5517 (se=0.0499, ci=(1.4539, 1.6496))Work on this package is supported by the National Science Centre, OPUS 20 grant no. 2020/39/B/HS4/00941.
Chen, Sixia, Shu Yang, and Jae Kwang Kim. 2022. “Nonparametric Mass Imputation for Data Integration.” Journal of Survey Statistics and Methodology 10 (1): 1–24.
Chen, Yilin, Pengfei Li, and Changbao Wu. 2020. “Doubly Robust Inference with Nonprobability Survey Samples.” Journal of the American Statistical Association 115 (532): 2011–21. https://doi.org/10.1080/01621459.2019.1677241.
Chlebicki, Piotr, Łukasz Chrostowski, and Maciej Beręsewicz. 2025. Data Integration of Non-Probability and Probability Samples with Predictive Mean Matching. https://arxiv.org/abs/2403.13750.
Freedman Ellis, Greg, and Ben Schneider. 2024. Srvyr: ’Dplyr’-Like Syntax for Summary Statistics of Survey Data. https://CRAN.R-project.org/package=srvyr.
Kim, Jae Kwang, Seho Park, Yilin Chen, and Changbao Wu. 2021. “Combining Non-Probability and Probability Survey Samples Through Mass Imputation.” Journal of the Royal Statistical Society Series A: Statistics in Society 184 (3): 941–63. https://doi.org/10.1111/rssa.12696.
Lumley, Thomas. 2004. “Analysis of Complex Survey Samples.” Journal of Statistical Software 9 (1): 1–19.
Lumley, Thomas. 2023. Survey: Analysis of Complex Survey Samples.
Wu, Changbao. 2023. “Statistical Inference with Non-Probability Survey Samples.” Survey Methodology 48 (2): 283–311. https://www150.statcan.gc.ca/n1/pub/12-001-x/2022002/article/00002-eng.htm.
Yang, Shu, Jae Kwang Kim, and Youngdeok Hwang. 2021. “Integration of Data from Probability Surveys and Big Found Data for Finite Population Inference Using Mass Imputation.” Survey Methodology 47 (1): 29–58. https://www150.statcan.gc.ca/n1/pub/12-001-x/2021001/article/00004-eng.htm.
Yang, Shu, Jae Kwang Kim, and Rui Song. 2020. “Doubly Robust Inference When Combining Probability and Non-Probability Samples with High Dimensional Data.” Journal of the Royal Statistical Society Series B: Statistical Methodology 82 (2): 445–65. https://doi.org/10.1111/rssb.12354.
