Consider a scenario with

  • N observations
  • P predictors (explanatory variables)
    • some categorical predictors have >20 categories

where the response is is bimodal with one mode at zero and the other is a positive, right-skewed distribution. The mode at zero accounts for the vast majority of observations.

The client states they want a

predictive model for the data.

Currently, their plan is to remove all the zero response observations and build a model for the remaining observations. They are interested in a SAS solution.


We discussed prediction for binary and continuous responses previously. The difficulty here is with the zero-inflation.

Although the client states they want a predictive model, we wonder whether they really want a model or whether they just want predictions and we proceed as if they just want predictions.

We can let the client know that if they throw out all the zero responses and build a model for the remaining observations, they will have a predictive model that is conditional on the response being zero. We believe this is unlikely to be what the client wants, but that they are moving this direction because they don’t know how to deal with the zero inflation.

Hurdle models

Our general suggestion is to split the problem up: 1) Build a model to predict whether a new response will be zero and 2) Conditional on the response not being zero, build a model to predict the non-zero response. These models are called hurdle models.

With these models, we can report three point estimates for each new observation: 1) the probability of being zero, 2) the expected value of the response if the observation is non-zero, and 3) the product of these two provide the marginal (as opposed to conditional) expected value.

Our suggestion is to use random forests for the prediction in both steps, but random forests are not available in our SAS license (to our knowledge). As the code to fit the models in R is pretty simple, we suggest using R to perform the analysis and education the client on how to use an R script.