In this chapter we’ll take a tour through the world of generalized additive models (GAMs) and see how these flexible models work in practice. Note that although this chapter gives a broad overview of GAMs, it does not cover anything close to “everything” one needs to know. I highly recommend CITE WOOD 2017 both as a starting point and a useful reference for more complex modelling techniques. That book alone is a few hundred pages, what I can condense into a chapter here is the bare bones in comparison.

With that in mind, here we’ll look into some of the mathematics behind generalized additive models while keeping an eye on the practical implications that these technical details have on modelling biological populations.

What is a GAM?

Chances are, you’ve heard of the term “GAM” before. You might have seen it in a paper or had it suggested as an approach to an analysis. In common parlance, we often use the term “GAM” to mean “GLM with extra wiggly bits in it”, or smoothing. In this chapter we’ll take this definition and not explore too far outside it. However, it’s important to note that GAMs really encapsulate a large class of models and can include much more than just smoothing.

As a quick example of smoothing and to give an idea of the kinds of models we’ll fit, let’s see a simple example of fitting a smoother to some data. We can generate some sample data (left panel of Figure XXXX) using the gamSim function in the mgcv package, then use gam to fit a model to that data.

library(mgcv)
## Loading required package: nlme
## This is mgcv 1.8-17. For overview type 'help("mgcv-package")'.
# simulate some data... (don't worry about understanding this right now)
set.seed(2)
dat <- gamSim(1, n=100, dist="normal", verbose=FALSE)
dat$y <- dat$f2 + rnorm(nrow(dat), sd=1)

# fit a model
b <- gam(y~s(x2), data=dat)

The syntax for fitting the model is very similar to that of the GLM, we just wrapped an s() around our covariate x2. A plot of our model is shown in the right side of Figure XXXX. There’s a lot more going on inside this call to gam() than that, and we’ll explore this below.

Figure 1: An example of smoothing some data. Left shows the generating function and the data generated from it (function values plus some normal error). Right shows the fitted smooth (black line) with uncertainty (gray band) with the same data as points.

We’ll start thinking about GAMs1 by dissecting the term “generalized additive model”:

Response distributions

One of the things that was swept under the carpet in our simple example above is the assumption we made about the response distribution of the data. Let’s begin by thinking about the response, this gets to the heart of what a GAM (and, indeed GLMs and other models) are trying to do. What we’re really talking about is modelling the mean (or expected value) of the data. The “fancy” bits of our model (the smooths, random effects, etc) help us explain the variation in the mean.

What do we mean by “should be modelled using a count distribution”? We want to say that given particular covariate values, we will expect a particular value of the distribution on average. We model the relationship between the counts and the covariates using the smooths (or other additive terms), which we’ll get to in the next section.

As we talked about in the previous chapter, the response for a DSM is a count or estimated abundance in a given segment. I’ll use the term “count” here very generally for the rest of this section, but I mean both count and estimated abundance. Counts should be modelled using a “count distribution”: a distribution that can handle non-negative numbers (though they don’t necessarily need to be whole numbers). The classic count distribution you’ve probably heard of before is Poisson, but for the kind of data we collect from distance sampling surveys, we’ll require flexible distributions. Figure XXXX shows a histogram of the counts per segment for the Gulf of Mexico dolphin data. We can observe some important (and common) features of the data:

  • Counts are mostly zero. Though this can differ based on the species, observers and land/seascape, one should expect that many of the segments to have zero counts.
  • Overdispersion: For our canonical count distribution, the Poisson, we assume that the mean and variance are equal (this is usually a quite restrictive assumption). This is clearly not the case if we have many zero counts and then a series of non-zero (perhaps large) counts. To model data where the variance is greater than the mean (“overdispersed”), we require flexible mean-variance relationships — the three distributions we look at below have this property.
Figure 2: Frequency of per-segment counts for the Gulf of Mexico dolphin data. Note the spike at zero (though the bin is goes from 0 to 10, there are only zeros here).

Generally speaking, the response is a count but is not not always integer. This can be because we used the estimated abundance as the response (which doesn’t guarantee that the response will be whole numbers of animals) but can also be because we averaged group sizes from multiple observations (multi-observer cetacean cruises often average group sizes recorded by different observers to obtain a more robust estimate). Our response should take this into account (this means that the Poisson distribution cannot be used, as it will only accept integer values).

We’re going to focus on using three response distributions here, they are the: quasi-Poisson, Tweedie and the negative binomial distribution2.

Figure 3: Tweedie and negative binomial distribution densities over a range of counts. Different colours indicate different power parameters (Tweedie) or scale parameters (negative binomial). Both distributions are able to model spikes at zero.

Quasi-Poisson

The quasi-Poisson “distribution” is not really a distribution but is a quick way of fitting data that have particular mean-variance relationships3. Specifically, the quasi-Poisson assumes only that Var(count) = ϕ𝔼(count) where ϕ is referred to as a dispersion parameter that scales the mean appropriately so that the variance is large enough. One can think of quasi-Poisson a more flexible version of the Poisson distribution, but still only allows for linear scaling between mean and variance.

The quasi-Poisson distribution’s quasi nature is a bit of a drawback in practice. As quasi distributions don’t have probability density functions (CITE XXXX), it’s rather difficult to do some of the model checking that we’d really like to do for these models (see the following chapters for more on this). On the other hand quasi-Poisson models are usually quick to put together and can be a good first step to building a model.

Tweedie

The Tweedie distribution (Tweedie, 1984; Candy, 2004; Shono, 2008) is not a single distribution at all, but rather a family of distributions that one can obtain by setting a specific parameter (the power parameter). Tweedie incorporates Poisson, Gamma and Normal distributions and things between.

Tweedie distributions are often referred to in the statistical literature as “gamma mixtures of Poisson random variables”, which may seem a little opaque. Intuitively we can think of our survey (ship or person or plane) travelling to a given segment, at this segment, we either see something or we don’t (gamma distributed) and if we do see something, we see a given count (Poisson distributed).

The power parameter, q, dictates which distribution we get and also tells us what the mean-variance relationship will be. The mean-variance relationship is given by: Var(count) = ϕ𝔼(count)q, where ϕ is a scale parameter (similar to the quasi-Poisson above). Setting q = 1 gives a Poisson distribution, q = 2 gives a gamma distribution and q = 3 gives a normal distribution. Once q gets below 1.2, we see some odd behaviour from the distribution (we get a multimodal distribution, which seems unrealistic for our count data). We are only interested in distributions between 1.2 < q < 2 and really there is not much difference in the distributions if we vary q at below the first decimal place (so we really only need to think about q = 1.2, 1.3, …, 1.9, as shown in Figure XXXX). Luckily, we can estimate q during the fitting of our model4.

Negative binomial

The negative binomial distribution assumes a different mean-variance relationship to the Tweedie and quasi-Poisson: Var(count)= 𝔼(count)+κ𝔼(count)2 where we estimate κ. The negative binomial therefore assumes a quadratic relationship between the mean and variance, which may be a rather strong assumption to make (though you might argue that the linear assumption of quasi-Poisson is also strong). Ver Hoef and Boveng (2007) note that the negative binomial also tends to up-weight observations with small (compared to the mean) counts relative to what the quasi-Poisson would; this property seems to be useful when fitting models where there are many large groups (for example seabirds and dolphins) and we’ll look into this in more detail in the following chapters. As with the Tweedie we estimate κ during fitting our model.

Additive terms

Let’s start by thinking about the good old statistical workhorse, the linear model. Figure XXXX shows a plots of two models fitted to the same data; the left panel shows a linear fit through the data (y~x) which seems to struggle a bit, the right shows a better model with an additional term: the explanatory variable squared (y~x+poly(x, 2)). The true function that generates this data is exp(2x). The model with the squared term in it is a better fit, as it has some curvature (though we’re cheating here as we know about the data generation process).

Figure 4: Predictions from fits to data generated from the function exp(2x) (with additional noise). Left: a linear regression, right: linear with an additional quadratic term.

This “trick” of adding polynomial terms seems to work here, and could be extended to higher order terms (cubic, quartic, etc polynomials), but is it sustainable? Could we fit something close to the model in Figure XXXX(first fig)? In short: no. There are a number of issues with this kind of approach. First, it feels somewhat ad hoc to add terms like this, the choice of which terms to add is arbitrary and although it’s easy for simple examples, knowing which order terms to choose for more complex relationships is down to the expertise and imagination of the modeller. Second, the polynomials are defined over the whole of the range of the covariate, one can’t (easily) restrict their effect to a subset of the range. So if the line we need to fit looks like pieces of different functions, fitting a model with simple polynomials won’t work. Third, even if we want the extra terms to effect the whole range of the covariate, these (simple) polynomials are not terribly efficient at doing this — they lack the mathematical property of orthogonality and tend to “overlap” in their effects, so each additional term doesn’t add much new5. Fourth, there is no control over how additional polynomials should effect the model fit, the model will tend to overfit, making the model less general and useful. With these points in mind, wouldn’t it be nice to think about adding wiggles to our models through a framework rather than ad hoc additions?

What if we could introduce flexible wiggles into our models, built using simple (mostly) locally-acting functions that add together to create complicated effects? Enter those s terms we saw above, which we’ll refer to here as smooths or splines (more correctly). Splines are based on a simple idea: you can build complicated functions from lots of simple functions and apply this to making wiggles in your models. What is s(x) made of then?
$$ s(x) = \sum_{k=1}^K \beta_k b_k(x), $$
where we estimate K parameters, {βk;k=1,…,K}, which multiply the basis functions bk of the covariate x. We’ll ignore the exact form of the basis functions for now, but bear in mind that there are many options when it comes to which basis to use, and they can be tailored to the specific modelling task. For now let’s also simplify and only think about the case where s is a function of one covariate, but later we’ll see more complex formulations for s. Figure XXXX shows some basis functions (dashed) and how they can add (when scaled by appropriate βks) to create another function (solid line).

Figure 5: Eight basis functions (grey, dashed lines) used to build the smooth depicted by the solid line.

When thinking about linear models, we generally talk about the design matrix for our model, and call it X. Each row of X corresponds to a sample in our data (in our DSM case, a segment) and each column relates to a covariate we will multiply by the coefficients we’ll estimate when fitting the model (β). When we add polynomials or splines to the model, we just augment the design matrix with extra columns. Each column is the evaluation of the appropriate basis function (or polynomial).

If we have a model with a smooth of x, a linear x term and an intercept (β0 + βxx + s(x) mathematically, ~x+s(x) in R), we have the following design matrix (for n samples):
$$ \mathbf{X} = \pmatrix{1 & x_1 & b_1(x_1) & b_2(x_1) & \ldots & b_K(x_1)\\ 1 & x_2 & b_1(x_2) & b_2(x_2) & \ldots & b_K(x_2)\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\ 1 & x_n & b_1(x_n) & b_2(x_n) & \ldots & b_K(x_n)\\ }. $$
So we can write the linear predictor of our model as μ = Xβ (if we put all our βs together in a vector β). Note that Xβ is on the link function scale, and we need to apply the inverse link to get on the scale of the observations (usually we use log links with the above distributions).

How do we estimate smooths?

Now we know we can build complicated functions from lots of little simple ones, how can we make it fit well? The danger with having something that can fit a complex function is that it might well always do that, which we don’t want. Generally in fitting a statistical model, we want to maximise a likelihood (or something like a likelihood, at least6) and find a set of parameters such that our model is a good approximation of the process that generated our data. Consider the fits shown in Figure XXXX: data is generated from the blue line with some (normal) errors. Looking at the left plot (and ignoring the others for now), we see that our model (the black line) in an effort to make our model as like the data as possible, has attempted to join the dots. Although an admirable try, this is not the behaviour we seek — we would like our models to tell us something general about the data, not precisely reproduce what we saw (we can look at the raw data for that!). Getting very close to the data is especially problematic when we know there is some randomness in the response (which we think there always are). Using likelihood maximisation alone to guide us would lead us to this kind of model, which would not be helpful.

Figure 6: Smoothers fitted to data (dots) generated from the blue line (plus some noise). Left plot shows what happens when the GAM has its smoothing parameter set to zero and the model has no penalty and can try to “join the dots”. The middle plot shows when the penalty dominates the fit and a linear model fit is produced. The right plot shows the correct smoothing parameter has been selected and a parsimonious fit has been found.

We’d prefer a model like the one on the right of Figure XXXX, but how can we achieve this? We need to penalize our fit to stop our model from being too wiggly. But what should we use for this penalty? Well, the answer comes from thinking about wiggles. We can stop our model from overfitting by supressing the number of wiggles it can have. Effectively we are saying to the model: “get as close to the data as possible, but only be this wiggly”. Before we start thinking about how wiggly “this” should be, we first need to think about how we measure wigglyness.

Mathematically, we can think of wigglyness in terms of the derivatives of a function, as they measure change in the function. In one dimension, the second derivatives of the function tell us about the wiggles (if we think of wiggles as “changes in direction of the function”). Figure XXXX shows a function and its first and second derivatives. Since our function is continuous, we can find its first and second derivatives at any point. Knowing the second derivative at a given point is useful, but we really need a single number that tells us how wiggly the function is. That’s where calculus comes in useful again, we can integrate the square of the second derivative to get a single number. Large values of this integral correspond to very wiggly functions, small values will correspond to very flat functions (which we also might call “very smooth” functions). Mathematically we can write such a penalty as7
$$ \int_A \left(\frac{\partial^2 s(x)}{\partial x^2}\right)^2 \text{d}x, $$
where again s(x) is our smooth, from which we’ve calculated second derivatives. We then square the derivative and integrate over the domain A (which depends on the exact formulation of s(x), it could be the range of the data or the whole real line). Now a useful property of our basis decomposition of the smooth comes into play. Since we can write s(x) as a sum and the βks don’t involve x at all, we can rewrite our penalty as8


βTSβ,
where the ijth element of the penalty matrix, S, is:
$$ S_{ij} = \int_A \left(\frac{\partial^2 b_i(x)}{\partial x^2} \frac{\partial^2 b_j(x)}{\partial x^2}\right)^2 \text{d}x. $$
This quadratic form for the penalty turns out to be very useful. The integrals and the model parameters are seperable, so we only need compute S once, at the start of our model fitting and once it’s computed we only need to matrix multiplications to calculate the penalty for different parameter values.

Animation of derivatives (one frame of this animation will appear in the paper copy of the book) (This caption needs improving).

Animation of derivatives (one frame of this animation will appear in the paper copy of the book) (This caption needs improving).

So far we have a measure of how close the model is to the data (something like a likelihood) and the penalty (a function of the model parameters that measures the wigglyness of the model). But how to we control how much influence the penalty has once we add those things together? We can add another parameter to the model (one per smooth term) and estimate that while we’re estimating the βks. The smoothing parameter (often denoted λ) dictates the influence of the penalty. Setting the smoothing parameter to be zero means the penalty has no influence and the model is free to overfit (back to the left panel in Figure XXXX). If the smoothing parameter is set large (say, ), the penalty will have a very big effect and we’ll remove all the wiggles from the model. Depending on the basis we use, the resulting model might be a constant line or some line with a slope (or something else), this is because what is left are the parts of s(x) that aren’t effected by the penalty: the parts without second derivatives (middle plot of Figure XXXX).

In order to fit these models we don’t set the smoothing parameter(s), we estimate them. Luckily there is a fairly large literature on which methods work well (both in practice and theoretically). Generally, REstricted Maximum Likelihood (REML) gives good results in the kind of models we’ll fit here. We’ll cover REML in more detail in later chapters.

Let’s try that out

Before we go further, let’s try out fitting a very simple model using the dsm package and see how the concepts we’ve looked at above translate into code. First let’s write down a simple model:
$$ n_j = a_j\hat{p}_j \exp\left[ \beta_0 + s(\texttt{y}) \right] + \epsilon_j $$
where ϵj ∼ N(0, σ2), nj ∼ Tweedie.

This model specifies that the count in segment j is a function of y (Northing) and an intercept (β0). After applying the (inverse) link function9 (exp), we scale the model by the effective area $a_j\hat{p}_j$, where aj is the segment area and $\hat{p}_j$ is estimated from the detection function. We can fit this model in R using the dsm() function in the dsm package:

# load packages and the dolphin data
library(dsm)
library(Distance)
data(mexdolphins)
# fit a detection function
dolphin_df <- ds(distdata, truncation=6000)
# fit a dsm
simple_dsm <- dsm(formula=count~s(y), family=tw(), ddf.obj=dolphin_df,
                  segment.data=segdata, observation.data=obsdata)

This code specifies our mathematical model: inside the link we have our formula (formula=count ~ s(y)), the response distribution (family=tw()), detectability information in the form of the fitted detection function from Distance (ddf.obj=dolphin_df), offset and segment-level data (segment.data=segdata) and finally the table linking observations to the segments (observation.data=obsdata). The last of these is not strictly-speaking in the formula above, but is implied.

The dsm package is based on the popular mgcv package by Simon Wood. The main thing that the dsm() function does is take the data formatted as detection function, segments and observations and appropriately aggregate it. The rest of the package provides the utility functions we’ll use later on. Most of the output from dsm is actually being generated by mgcv in the background at some point. Most of the syntax we’ll learn for dsm() can be applied to gam() models

Having fitted this model, we can inspect the returned object and find the mathematical objects we saw above. First, we can look at the model graphically, by calling plot as usual:

plot(simple_dsm, main="Smooth of x, k default")
Figure 7: Plot of the smooth of Northing on the Gulf of Mexico dolphins for a model with the default basis size (10, left) and a basis size of 20 (right).

In Figure XXX, dashed lines indicate +/- 2 standard errors from the fitted smooth in solid line, the lines at on the horizontal axis are a rug plot indicating the locations of the observed data. The label on the vertical axis tells us that the model term’s name and gives an indication of how complex the function is (more on this in a moment). Importantly, the plot is on the link scale, so we could exponentiate it to get a plot on the response scale here, but generally we look at these plots on the link scale10.

The plot shows that there is a range of values of Northing that seem to be appealing to the dolphins but those in the extreme north and south appear to be less appealing. Going back to the plot of the data in the “Why do spatial modelling?” chapter, this seems believable.

Now let’s look at extracting the mathematical objects we talked about above. First we cal look at the coefficients we estimated (the βks):

coef(simple_dsm)
##  (Intercept)       s(y).1       s(y).2       s(y).3       s(y).4 
## -16.86252328  -0.55370495  -0.28110898   0.08177955  -0.22026833 
##       s(y).5       s(y).6       s(y).7       s(y).8       s(y).9 
##   0.08781170  -0.57486697   0.24403478   2.79348437  -1.07253867

and also we can see what the smoothing parameter was estimated to be:

simple_dsm$sp
##    s(y) 
## 2.50573

We can also look at the penalty matrix:

simple_dsm$smooth[[1]]$S[[1]]
##                [,1]          [,2]          [,3]          [,4]
##  [1,]  1.329512e+01 -2.117904e+00 -7.568354e+00  1.213621e+00
##  [2,] -2.117904e+00  2.963613e+01  1.148526e+00 -1.304797e+01
##  [3,] -7.568354e+00  1.148526e+00  1.265049e+02  4.593004e+00
##  [4,]  1.213621e+00 -1.304797e+01  4.593004e+00  1.163063e+02
##  [5,] -8.523928e+00  3.164233e+00 -9.368721e+00  6.464227e+00
##  [6,]  6.000226e+00 -2.458128e+01  9.280728e+00 -3.554152e+01
##  [7,] -1.099882e+01  1.628152e+01 -1.270771e+01  2.293885e+01
##  [8,]  2.981452e+00  3.508034e-01 -1.602983e+00  1.384243e+01
##  [9,]  1.752331e-15  2.061826e-16 -9.421434e-16  8.135808e-15
##                [,5]          [,6]          [,7]          [,8]
##  [1,] -8.523928e+00  6.000226e+00 -1.099882e+01  2.981452e+00
##  [2,]  3.164233e+00 -2.458128e+01  1.628152e+01  3.508034e-01
##  [3,] -9.368721e+00  9.280728e+00 -1.270771e+01 -1.602983e+00
##  [4,]  6.464227e+00 -3.554152e+01  2.293885e+01  1.384243e+01
##  [5,]  2.319356e+02  1.080374e+01 -1.358976e+01 -4.887908e+00
##  [6,]  1.080374e+01  2.282082e+02  3.066465e+01  3.166317e+01
##  [7,] -1.358976e+01  3.066465e+01  3.600327e+02 -2.303989e+01
##  [8,] -4.887908e+00  3.166317e+01 -2.303989e+01  1.637532e+01
##  [9,] -2.872839e-15  1.860984e-14 -1.354156e-14  9.624496e-15
##                [,9]
##  [1,]  1.752331e-15
##  [2,]  2.061826e-16
##  [3,] -9.421434e-16
##  [4,]  8.135808e-15
##  [5,] -2.872839e-15
##  [6,]  1.860984e-14
##  [7,] -1.354156e-14
##  [8,]  9.624496e-15
##  [9,]  5.656740e-30

Note that the penalty matrix is 9 by 9, but there are 10 coefficients. This is because the intercept term (handily labelled as such) is subtracted from the model during fitting (known as an idenitfiability constraint), as a constant, it also doesn’t have any second derivatives, so doesn’t have an associated entry in the penalty matrix.

Unlike linear regression, these numbers are not really directly interpretable, but it’s useful to know that they are there and we will use them later on.

Now, we’ve said there are 10 coefficients in the model and this, of course relates to the basis size K that we defined above. We can increase K per-term in the model, making it have more potential complexity (we often refer to K a basis complexity or basis size) and we can control this in R using the k argument to s. So, we can double the basis size:

double_k_dsm <- dsm(formula=count~s(y, k=20),
                    family=tw(), ddf.obj=dolphin_df,
                    segment.data=segdata, observation.data=obsdata)

We can see that this changed the number of coefficients and the size of the penalty matrix (let’s not print out all those numbers again):

length(coef(double_k_dsm))
## [1] 20
dim(double_k_dsm$smooth[[1]]$S[[1]])
## [1] 19 19

Looking at Figure XXXX, we don’t see a difference in the plots. So, what’s going on here? We increased the number of basis function we use to represent the smooth, but the smooth didn’t change, meaning that we had a good idea of what the function should look like when we just use 10 basis functions (we’ll use this trick extensively later to see how complex we should let things be). Looking back at the coefficients, we see that some of them are quite small and some are negative, can we measure how many of the basis functions are really being used? Yes, the number reported in the “s()” in the plots gives us the estimated degrees of freedom (EDF) – the amount of that basis complexity that is used in the smooth. Handily, we can compute this:

sum(influence(simple_dsm))
## [1] 4.047065
sum(influence(double_k_dsm))
## [1] 4.094203

where influence() will extract the diagonal elements from the influence or hat matrix of the model (CITE XXXX), that is the matrix $\hat{\mathbf{A}}$ such that $\hat{\mu} = \hat{\mathbf{A}}\mathbf{x}$ (it “puts the hats on the xs”, turning them into the predictions). Note these numbers are one bigger than the number in the axis labels in the plots in Figure XXXX, as the numbers above are for the whole model, including the intercept (which adds 1 whole parameter to the total). It might seem weird at first that the EDF is not an integer, but remember that the penalty doesn’t suppress the effect of whole basis functions, just their wiggliness.

We can see what the estimated Tweedie power parameter is:

simple_dsm$family$getTheta(trans=TRUE)
## [1] 1.401401

(note we need to specify trans=TRUE to put the power parameter on the right scale for us to understand).

Now we’ve looked at our model the hard way, we can also look at a summary:

summary(simple_dsm)
## 
## Family: Tweedie(p=1.401) 
## Link function: log 
## 
## Formula:
## count ~ s(y) + offset(off.set)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -16.8625     0.2552  -66.07   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##        edf Ref.df     F p-value  
## s(y) 3.047   3.83 2.835  0.0268 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.013   Deviance explained = 13.4%
## -REML = 305.18  Scale est. = 66.702    n = 387

We have the information we just extracted plus more, including the Tweedie parameter, the EDF of each term in the model as well as some information on significance, deviance and adjusted R211. Note that the model formula includes a term offset(off.set) this is the $a_j\hat{p}_j$ in the formula (R needs to know it’s an offset, and therefore has no coefficient to estimate, so it gets encased in offset()). We’ll explore more of the summary() output as we continue looking at GAMs (though the output is very similar to the summary() output for a lm or glm object). summary() is usually the first step in investigating the model results.

Adding a term

To add another term into our model, we just use + to include it:

xy_dsm <- dsm(formula=count~s(y) + s(x),
              family=tw(), ddf.obj=dolphin_df,
              segment.data=segdata, observation.data=obsdata)

Looking at the corresponding summary() output, we see an additional line in the smooth terms table:

summary(xy_dsm)
## 
## Family: Tweedie(p=1.377) 
## Link function: log 
## 
## Formula:
## count ~ s(y) + s(x) + offset(off.set)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.0909     0.2714  -62.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##        edf Ref.df     F p-value  
## s(y) 3.243  4.072 2.880  0.0228 *
## s(x) 3.011  3.809 2.252  0.0642 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.0388   Deviance explained = 22.2%
## -REML = 302.67  Scale est. = 64.054    n = 387

In the next chapter on practical DSMs, we’ll look at whether that was a good idea (in the sense of model selection), but for now, let’s just experiment with the model structure. We can plot that model as before, and see the two plots for the two terms in the model.

par(mfrow=c(1, 2))
plot(xy_dsm, shade=TRUE)
Figure 8: Plot of the smooths of Northing and Easting on the Gulf of Mexico dolphins. Note that setting shade=TRUE means our +/- 2 standard error band is shaded gray rather than delimited with dotted lines, this can be easier to see on plots with many elements.

The model xy_dsm treats the effects of Northing and Easting as additive, but this ignores their joint effect (the model only addresses their marginal effects). We can include bivariate terms terms in our model instead. How we do this depends on the exact form of the basis functions we use, but for the default basis in mgcv (and hence dsm, thin plate regression splines), we can simply add another variable to our s term and again look at the summary:

xy_bi_dsm <- dsm(formula=count~s(x, y), family=tw(), ddf.obj=dolphin_df,
                    segment.data=segdata, observation.data=obsdata)
summary(xy_bi_dsm)
## 
## Family: Tweedie(p=1.347) 
## Link function: log 
## 
## Formula:
## count ~ s(x, y) + offset(off.set)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.6386     0.3095  -56.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##          edf Ref.df     F p-value   
## s(x,y) 16.19  20.78 2.184 0.00214 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.107   Deviance explained = 41.6%
## -REML = 301.05  Scale est. = 56.808    n = 387

We now see that there is only one smooth term, s(x,y). We can plot this effect in 2 dimensions using a heatmap (we use scheme=1 to produce the heatmap rather than the other, harder to read options; asp=1 sets the x-y aspect ratio to be one). We can also overplot the locations of the non-zero counts with circles.

plot(xy_bi_dsm, scheme=1, asp=1)
points(xy_bi_dsm$data[, c("x","y")][xy_bi_dsm$data$count>0, ])
Figure 9: Plot of the bivariate smooth of Northing and Easting on the Gulf of Mexico dolphins. Heatmap colours indicate that red is lower values, white higher. Dots indicate observation locations (segments) and circles the location of non-zero counts. Again this plot is on the linear predictor (logarithmic) scale.

This model doesn’t seem too satisfying, but we’ll look into that more in the next chapter. For now let us be content with the increase in deviance explained and R2.

Recap

In this chapter you’ve had a whirlwind tour of generalized additive models from the perspective of fitting spatial distance sampling data. We looked at what a generalized additive model is doing (fitting a model to the mean of a distribution) and how it does it (adding wiggly functions of the covariates). We also saw how we can measure wigglyness (via integrating squared derivatives) and go a rough idea of how to trade-off flexibility and overfitting. We finally tried this out in practice using the dsm package, which is very similar to mgcv.

Next chapter we’ll dive into this further and start building and selecting between more complex models.

Further reading

References

Candy, S.G. (2004) Modelling catch and effort data using generalised linear models, the Tweedie distribution, random vessel effects and random stratum-by-year effects. Ccamlr Science.

Ruppert, D., Wand, M.P. and Carroll, R.J. (2003) Semiparametric Regression. Cambridge University Press.

Shono, H. (2008) Application of the Tweedie distribution to zero-catch data in CPUE analysis. Fisheries Research, 93, 154–162.

Tweedie, M.C.K. (1984) An index which distinguishes between some important exponential families., 579–604.

Ver Hoef, J.M. and Boveng, P.L. (2007) Quasi-Poisson vs. negative binomial regression: how should we model overdispersed count data? Ecology, 88, 2766–2772.


  1. Natalie Kelly (Australian Antarctic Division) pointed me to another definition of “gam”, as seen in Moby Dick. 1. Collective noun used to refer to a group of whales, or rarely also of porpoises; a pod or, 2. (by extension) A social gathering of whalers.

  2. There are other options for “overdispersed” data, including so-called “zero-inflated” models. We don’t discuss them here as they don’t currently fit into the GAM framework very easily. It’s also worth noting that “zero-inflation” is a complex concept and really refers to a rather specific situation (excess zeros conditional on the model) rather than just “a lot of zeros in the data”.

  3. Strictly speaking, by “quasi-Poisson” we mean “quasi-likelihood methods with Poisson assumptions” Ver Hoef and Boveng (2007).

  4. Historically, this has not been the case and various strategies have been suggested to select q as Tweedie is only an exponential family distribution for fixed values of q. One popular strategy was to refit the model for q = 1.2, 1.3, …, 1.9 and select the best model by AIC. As we’ll see below, this is no longer necessary.

  5. To get around this we could use a different set of polynomials, like the Hermite polynomials we saw as an adjustment term for detection functions, but they will still act globally over the whole covariate range.

  6. For example for a linear regression, we might actually use something like ordinary least squares (OLS). Though thanks to the Gauss-Markov theorem, we know that estimates from liklihood and OLS are the same for linear models (if some reasonable conditions apply).

  7. Note that the penalty-basis function story is more complicated than this and generally speaking, we start with a penalty and the penalty implies a set of basis functions. We’ll get into this in more detail later, for now bear in mind that this is not a general statement of the penalty.

  8. Try this out for yourself! Start by expanding out the sqaured and simplifying. If you get stuck, try from the other direction and multiply out the matrices.

  9. Note that log is the link function, so exp is referred to as the inverse link.

  10. We can exponentiate this plot to get things on the response scale, but this gets more tricky with more complicated models with multiple terms, so it’s generally recommended that one sticks to looking at plots of terms on the link scale.

  11. Cosma Shalizi has some useful thoughts on the topic of R2 at https://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/10/lecture-10.pdf. One of the most useful to us here is that R2 can be arbitrarily high when the model is completely wrong and arbitrarily low when the model is correct.