R code

library("tidyverse")
theme_set(theme_bw())
library("randomForest")
library("nnet")
library("xgboost")
library("DT")

Let’s take a stab at predicting Wild Blueberry Yield. Here I will follow the process I used to take build a method for prediction of Wild Blueberry Yield.

Data

Let’s read the training data and get an understanding of it. I am caching this chunk so that I can depend on it in future chunks. I am also making the caching depend on the file.

d <- read_csv("data/train.csv",

  # Good practice to formally define variable types.
  col_types = cols(
    id          = col_character(),
    .default    = col_double()
  )
)

Check to see if there are any missing values. Please note that this may depend on how “not available” data are recorded in the data set.

anyNA(d)
## [1] FALSE

It appears there are no missing data here.

The training data have 15289 observations on 18 variables. The variable names are

names(d)
##  [1] "id"                   "clonesize"            "honeybee"             "bumbles"              "andrena"             
##  [6] "osmia"                "MaxOfUpperTRange"     "MinOfUpperTRange"     "AverageOfUpperTRange" "MaxOfLowerTRange"    
## [11] "MinOfLowerTRange"     "AverageOfLowerTRange" "RainingDays"          "AverageRainingDays"   "fruitset"            
## [16] "fruitmass"            "seeds"                "yield"

which includes the response variable yield and the unique id id. This id is not going to be important in the prediction but is simply just an identifier. I infer this because the id is just sequential numbers

all(diff(as.numeric(d$id)) == 1)
## [1] TRUE

Yield

Let’s take a look at yield first since it is our

ggplot(d, aes(x = yield)) +
  geom_histogram(aes(y = after_stat(density)),
    fill = "gray"
  ) +
  stat_function(
    fun = dnorm,
    args = list(
      mean = mean(d$yield),
      sd   = sd(d$yield)
    ),
    color = "blue",
    linewidth = 2
  )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This histogram looks slightly skewed to the left. With the left-skewness and the ratio of the max to min not greater than 10, I am not initially considering using the logarithm of yield as my response variable. But, I may contemplate it in the future.

Apparently multiple observations have the same min/max yield

d %>% filter(yield == min(yield))
## # A tibble: 26 × 18
##    id    clonesize honeybee bumbles andrena osmia MaxOfUpperTRange MinOfUpperTRange AverageOfUpperTRange
##    <chr>     <dbl>    <dbl>   <dbl>   <dbl> <dbl>            <dbl>            <dbl>                <dbl>
##  1 287        25       0.5     0.38    0.5   0.63             69.7             42.1                 58.2
##  2 327        12.5     0.25    0.25    0.5   0.63             94.6             57.2                 79  
##  3 614        25       0.5     0.25    0.5   0.5              69.7             42.1                 58.2
##  4 1203       25       0.5     0.38    0.5   0.63             94.6             57.2                 79  
##  5 1503       25       0.5     0.25    0.75  0.75             94.6             57.2                 79  
##  6 2075       25       0.5     0.38    0.5   0.5              69.7             42.1                 58.2
##  7 3056       25       0.5     0.25    0.5   0.75             94.6             57.2                 79  
##  8 4159       12.5     0.25    0.25    0.25  0.25             94.6             57.2                 79  
##  9 4419       25       0.5     0.25    0.5   0.63             86               52                   71.9
## 10 4965       25       0.5     0.25    0.63  0.5              69.7             42.1                 58.2
## # ℹ 16 more rows
## # ℹ 9 more variables: MaxOfLowerTRange <dbl>, MinOfLowerTRange <dbl>, AverageOfLowerTRange <dbl>, RainingDays <dbl>,
## #   AverageRainingDays <dbl>, fruitset <dbl>, fruitmass <dbl>, seeds <dbl>, yield <dbl>
d %>% filter(yield == max(yield))
## # A tibble: 19 × 18
##    id    clonesize honeybee bumbles andrena osmia MaxOfUpperTRange MinOfUpperTRange AverageOfUpperTRange
##    <chr>     <dbl>    <dbl>   <dbl>   <dbl> <dbl>            <dbl>            <dbl>                <dbl>
##  1 749        12.5     0.25    0.25    0.5   0.5              94.6             57.2                 79  
##  2 771        12.5     0.25    0.25    0.63  0.75             69.7             42.1                 58.2
##  3 956        12.5     0.25    0.38    0.38  0.75             94.6             57.2                 79  
##  4 1353       25       0.5     0.25    0.5   0.75             77.4             46.8                 64.7
##  5 2063       12.5     0.25    0.25    0.5   0.63             86               52                   71.9
##  6 2646       12.5     0.25    0.38    0.38  0.75             69.7             42.1                 58.2
##  7 3324       25       0.5     0.38    0.5   0.75             69.7             42.1                 58.2
##  8 4519       12.5     0.25    0.25    0.5   0.5              86               52                   71.9
##  9 4795       25       0.5     0.25    0.38  0.5              69.7             42.1                 58.2
## 10 6240       12.5     0.25    0.25    0.63  0.75             94.6             57.2                 79  
## 11 7608       12.5     0.25    0.38    0.38  0.5              86               52                   71.9
## 12 8416       12.5     0.25    0.38    0.38  0.75             86               52                   71.9
## 13 9841       25       0.5     0.38    0.63  0.75             69.7             42.1                 58.2
## 14 12141      12.5     0.25    0.25    0.25  0.75             77.4             46.8                 64.7
## 15 12653      25       0.5     0.25    0.5   0.75             86               52                   71.9
## 16 13680      12.5     0.25    0.38    0.38  0.63             77.4             46.8                 64.7
## 17 13826      25       0.5     0.25    0.5   0.75             86               52                   71.9
## 18 13911      12.5     0.25    0.38    0.38  0.75             94.6             57.2                 79  
## 19 15267      12.5     0.25    0.25    0.75  0.75             86               52                   71.9
## # ℹ 9 more variables: MaxOfLowerTRange <dbl>, MinOfLowerTRange <dbl>, AverageOfLowerTRange <dbl>, RainingDays <dbl>,
## #   AverageRainingDays <dbl>, fruitset <dbl>, fruitmass <dbl>, seeds <dbl>, yield <dbl>

Thus these seem like forced min/max values. When it comes to prediction, we should probably ensure that we never have values outside this range.

Explanatory variables

The original blueberry yield contest contains more information on the explanatory variables. The variables can be grouped by type

  • clonesize: average size of shrub in the field (I think)
  • bees: density of bees in the field
    • honeybee
    • bumbles
    • andrena
    • osmia
  • temperature: summaries of temperatures during the season
    • MaxOfUpperTRange
    • MinOfUpperTRange
    • AverageOfUpperTRange
    • MaxOfLowerTRange
    • MinOfLowerTRange
    • AverageOfLowerTRange
  • rain:
    • RainingDays: the number of days when rain was greater than 0
    • AverageRainingDays: average amount of rain on rainy days (I think)

The original data did not have the other variables, but these variables are included in the test data set and therefore are features that can be used in the prediction. Other than the names, I have no more information about these variables, but they all seem related to the fruit (since seeds are inside the fruit).

  • fruit set
  • fruit mass
  • seeds

Clonesize

ggplot(d, aes(x = clonesize, y = yield)) +
  geom_point(position = position_jitter(width = 0.5))

I thought this was a double, but apparently not. I went back and changed how I read the data in.

d %>%
  group_by(clonesize) %>%
  summarize(
    n = n(),
    mean = mean(yield),
    sd = sd(yield),
    max = max(yield),
    min = min(yield)
  )
## # A tibble: 6 × 6
##   clonesize     n  mean     sd   max   min
##       <dbl> <int> <dbl>  <dbl> <dbl> <dbl>
## 1      10       4 5424. 1222.  6773. 3826.
## 2      12.5  6717 6569. 1238.  8969. 1946.
## 3      20      56 5350. 1453.  8466. 2385.
## 4      25    8245 5646. 1225.  8969. 1946.
## 5      37.5   265 4193. 1285.  8248. 1946.
## 6      40       2 4003.   76.8 4058. 3949.

It seems a bit weird that multiple observations have exactly the max/min values. I went back and pointed this out in the yield section.

Bee

bee <- d %>% select(honeybee:osmia, yield)
summary(bee)
##     honeybee          bumbles          andrena           osmia            yield     
##  Min.   : 0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :1946  
##  1st Qu.: 0.2500   1st Qu.:0.2500   1st Qu.:0.3800   1st Qu.:0.5000   1st Qu.:5128  
##  Median : 0.5000   Median :0.2500   Median :0.5000   Median :0.6300   Median :6117  
##  Mean   : 0.3893   Mean   :0.2868   Mean   :0.4927   Mean   :0.5924   Mean   :6025  
##  3rd Qu.: 0.5000   3rd Qu.:0.3800   3rd Qu.:0.6300   3rd Qu.:0.7500   3rd Qu.:7020  
##  Max.   :18.4300   Max.   :0.5850   Max.   :0.7500   Max.   :0.7500   Max.   :8969
bee_long <- bee %>%
  pivot_longer(honeybee:osmia)

ggplot(bee_long, aes(x = value)) +
  geom_histogram() +
  facet_wrap(~name, scales = "free")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

There is clear discreteness in these data. Presumably this is to due to counting the number of individual bees observed and dividing by the number of plants in the field (or dividing by the number of days). Perhaps there were 50 plants, but, in some fields, some plants died?

sort(unique(bee_long$value)) * 50
##  [1]   0.00   1.00   1.05   2.10   2.90   3.25   3.90   5.05   5.85   7.35  11.45  11.70  11.75  12.00  12.50  13.00
## [17]  14.65  19.00  20.45  24.50  25.00  26.85  28.00  29.25  30.30  31.00  31.50  35.35  37.50 332.00 921.50

Some of the honeybee observations are much larger than the rest.

bee %>% filter(honeybee > 15)
## # A tibble: 5 × 5
##   honeybee bumbles andrena osmia yield
##      <dbl>   <dbl>   <dbl> <dbl> <dbl>
## 1     18.4   0       0     0     5010.
## 2     18.4   0.293   0.234 0.058 4357.
## 3     18.4   0.042   0.63  0.021 4480.
## 4     18.4   0       0     0     5741.
## 5     18.4   0       0     0     2825.
pairs(bee)

cor(bee) %>% round(3)
##          honeybee bumbles andrena  osmia  yield
## honeybee    1.000  -0.018   0.031 -0.010 -0.118
## bumbles    -0.018   1.000  -0.165  0.158  0.161
## andrena     0.031  -0.165   1.000  0.310  0.074
## osmia      -0.010   0.158   0.310  1.000  0.198
## yield      -0.118   0.161   0.074  0.198  1.000

Temperature

Let’s take a look at the temperature variables.

temp <- d %>% select(MaxOfUpperTRange:AverageOfLowerTRange, yield)
temp_long <- temp %>%
  select(-yield) %>%
  pivot_longer(everything())

ggplot(temp_long, aes(x = value)) +
  geom_histogram() +
  facet_wrap(~name)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Again some clear discretization. Basically there are 4-5 unique temperature values for each temperature variable. Probably the fields are in 4-5 spatial locations with the same temperature stations.

with(d, table(MaxOfUpperTRange, MinOfUpperTRange))
##                 MinOfUpperTRange
## MaxOfUpperTRange   39 42.1 46.8   52 57.2
##             69.7    0 3562    0    0    2
##             77.4    0    0 3788    0    0
##             79      0    0    0    0    1
##             86      0    0    0 4200    0
##             89      2    0    0    0    0
##             94.6    1    0    0    0 3733

Are the temperatures with only a few counts real or typos?

temp %>%
  # select(MaxOfUpperTRange:AverageOfLowerTRange) %>%
  cor() %>%
  round(3)
##                      MaxOfUpperTRange MinOfUpperTRange AverageOfUpperTRange MaxOfLowerTRange MinOfLowerTRange
## MaxOfUpperTRange                1.000            0.999                1.000            1.000            1.000
## MinOfUpperTRange                0.999            1.000                0.999            0.998            0.999
## AverageOfUpperTRange            1.000            0.999                1.000            0.999            1.000
## MaxOfLowerTRange                1.000            0.998                0.999            1.000            0.999
## MinOfLowerTRange                1.000            0.999                1.000            0.999            1.000
## AverageOfLowerTRange            1.000            0.999                1.000            0.999            1.000
## yield                          -0.023           -0.022               -0.022           -0.022           -0.022
##                      AverageOfLowerTRange  yield
## MaxOfUpperTRange                    1.000 -0.023
## MinOfUpperTRange                    0.999 -0.022
## AverageOfUpperTRange                1.000 -0.022
## MaxOfLowerTRange                    0.999 -0.022
## MinOfLowerTRange                    1.000 -0.022
## AverageOfLowerTRange                1.000 -0.022
## yield                              -0.022  1.000

These are highly correlated and have low correlation with yield.

Perhaps we should be constructing a variable for region of the field based on temperature. But then we should make sure these same temperatures are observed in the test data set.

Fruit

Let’s take a look at the remaining variables.

fruit <- d %>% select(fruitset:yield)
fruit_long <- fruit %>%
  pivot_longer(-yield)

ggplot(fruit_long, aes(x = value)) +
  geom_histogram() +
  facet_wrap(~name, scales = "free")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

fruit %>%
  cor() %>%
  round(3)
##           fruitset fruitmass seeds yield
## fruitset     1.000     0.937 0.930 0.886
## fruitmass    0.937     1.000 0.932 0.826
## seeds        0.930     0.932 1.000 0.869
## yield        0.886     0.826 0.869 1.000
pairs(fruit)

Modeling

I wanted to use the caret package as it provides a consistent interface to fit a number of different methods. It also provides automatic parameter tuning. Unfortunately, for some methods, it was taking too long to run. So, where possible, this package was used, but elsewhere other packages were utilized with their default settings, i.e. no tuning.

Here we will set up a train/test split based on the kaggle training data. We will utilize 80% of the observations for training and the remaining 20% for testing.

set.seed(20230503)

# Construct our own train and test
u <- sample(c(TRUE, FALSE),
  size = nrow(d),
  prob = c(.8, .2), # 80% for training, 20% for testing
  replace = TRUE
)

train <- d[u, ] %>% select(-id) # remove id to exclude it as an explanatory variable
test <- d[!u, ] %>% select(-id)

We can also set up a function to calculate our metric to compare models. Since the competition is using mean absolute deviation/error, we will use this as our metric.

mad <- function(p) {
  mean(abs(p - test$yield))
}

LASSO

For all of the methods, we will train in one code chunk and cache the results.

m_lasso <- caret::train(yield ~ ., data = train, method = "lasso")

Then we will predict in another code chunk.

p_lasso <- predict(m_lasso, newdata = test)

Random forest

m_rf <- randomForest(yield ~ ., data = train)
p_rf <- predict(m_rf, newdata = test)

Neural network

m_nnet <- nnet(yield ~ .,
  data = train,
  size = 5,
  decay = 5e-4,
  rang = .01
) # rang*max(abs(x)) ~= 1
## # weights:  91
## initial  value 464010139325.433655 
## final  value 463936148972.903809 
## converged
# Code modified from
# https://stats.stackexchange.com/questions/209678/nnet-package-is-it-neccessary-to-scale-data
# m_nnet <- caret::train(yield ~ ., data = train, method = "nnet",
#                        preProcess = c("center","scale"),
#                          trace = FALSE,
#                         tuneGrid = expand.grid(size=1:8, decay=3**(-6:1)),
#                         trControl = trainControl(method = 'repeatedcv',
#                                                 number = 10,
#                                                 repeats = 10,
#                                                 returnResamp = 'final'))
# Need to unscale predictions
p_nnet <- predict(m_nnet, newdata = test)
unique(p_nnet)
##   [,1]
## 1    1

The predictions from this neural network are constant. I tried a variety of approaches and, so far, have failed to get anything other than a constant prediction.

xgbTree

m_xgbtree <- xgboost(
  data = train %>% select(-yield) %>% as.matrix(),
  label = train$yield,
  params = list(booster = "gbtree"),
  nrounds = 20, # Increasing this reduces train RMSE
  verbose = 0
)

# This code only increased MAD by 1
# m_xgbtree <- caret::train(yield ~ ., data = train, method = "xgbTree",
#                        verbose = 0)
p_xgbtree <- predict(m_xgbtree, newdata = as.matrix(test %>% select(-yield)))

xgbLinear

# m_xgblinear <- xgboost(data = train %>% select(-yield) %>% as.matrix,
#                      label = train$yield,
#                      params = list(booster = "gblinear"),
#                      nrounds = 10, # Increasing this reduces train RMSE
#                      verbose = 0)

m_xgblinear <- caret::train(yield ~ .,
  data = train, method = "xgbLinear",
  verbose = 0
)
p_xgblinear <- predict(m_xgblinear, newdata = as.matrix(test %>% select(-yield)))

Summary

Comparison

error <- tribble(
  ~method, ~`test-mad`,
  "lasso", mad(p_lasso),
  "rf", mad(p_rf),
  "nnet", mad(p_nnet),
  "xgbtree", mad(p_xgbtree),
  "xgblinear", mad(p_xgblinear)
)
error |>
  datatable(
    rownames = FALSE,
    caption = "Test Mean Absolute Deviation (MAD)"
  ) |>
  formatRound(columns = c("test-mad"), digits = 0)

It seems that tree based (rf and xgbtree) methods out-performed linear methods (lasso and xgblinear). This could be due to some large values in the explanatory variables, e.g. honeyee, that would be hugely influential in any regression type of analysis.

Feature creation

For these slides, we did not modify the explanatory variables (features) at all. But we could consider

  1. removing highly correlated features
  2. creating additional features

The temperature variables have correlation that is close to 1 and thus we could eliminate all but one of these variables.

We could consider adding additional features, e.g. 

  • principal components analysis
    • temperatures
    • fruit
  • differences in max/min temperature
  • total bee density
  • total rain
  • logarithms of explanatory variables

Tuning

Some of the methods we tuned parameters to fit our held out test set better. For the other methods, we could probably improve our performance if we tuned parameters. This was not done due to time constraints.

kaggle competition entry

For the kaggle competition, we’ll take our best performing method and fit it using the entire training data and then predict for the test data.

train <- read_csv("data/train.csv") %>% select(-id)
## Rows: 15289 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (18): id, clonesize, honeybee, bumbles, andrena, osmia, MaxOfUpperTRange, MinOfUpperTRange, AverageOfUpperTRange...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
m <- xgboost(
  data = train %>% select(-yield) %>% as.matrix(),
  label = train$yield,
  params = list(booster = "gbtree"),
  nrounds = 20, # Increasing this reduces train RMSE
  verbose = 0
)
test <- read_csv("data/test.csv")
## Rows: 10194 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (17): id, clonesize, honeybee, bumbles, andrena, osmia, MaxOfUpperTRange, MinOfUpperTRange, AverageOfUpperTRange...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
prediction <- test %>%
  mutate(
    yield = predict(m, newdata = as.matrix(test %>% select(-id))),

    # Since the training yield data seem to have a fixed min/max
    # truncate data to that range
    yield = ifelse(yield > max(train$yield), max(train$yield), yield),
    yield = ifelse(yield < min(train$yield), min(train$yield), yield)
  )

write_csv(prediction %>% select(id, yield),
  file = "submission.csv"
)

Our out-of-sample error is estimated to be around 360 based on our study above and our observed error on the 20% (see below) is around 354. This is much larger than other predictions on the leaderboard. Note that the leaderboard is calculated with approximately 20% of the test data and the final results will be based on the other 80%. Thus, most likely the top predictions on the leaderboard are individuals overfitting that 20% and therefore they will perform poorly on the remaining 80%.