In this first chapter I want to give a brief refresher on distance sampling, building up from simpler sampling techniques to show how they relate to distance sampling. I will show this via some simulations in R. (Understanding the code is not as important as understanding the concepts, but it will help; the source for this document can be found here, showing how the plots were made).

Before diving into the sampling we should first think a little about why we’re doing distance sampling. Really what we are interested in is estimating the abundance of some biological population. Knowing how many animals (or plants) there are in a given area can tell us important management information as well as help to answer ecological questions.

Since we are interested in abundance, it is important to ensure that the estimates of abundance are as accurate as possible. Our estimates of abundance will always be based on the observed counts – how many animals we saw. We may also want to think about what other factors affected that count, and model the *observation process* (contrast this to the paragraph about, where we were thinking about modelling the abundance as a function of environmental factors). There are many factors that can affect whether we see an animal, for example: how far away animals are, how brightly coloured they are, whether the observers are fresh grad students or older, seasoned field biologists. We might also like to think about whether animals are there to be seen at all – have cetaceans or seabirds dived and are now too deep to be detected?

The rest of this chapter will show how common survey techniques (quadrat and strip sampling) generalise into distance sampling methodology, by accounting for imperfect detectability, using a simple simulation in R.

Let’s first consider a simple example of a simulated population. Each point in the plot below represents an individual in our population – for simplicity’s sake let’s call them animals. One simple option to estimate how many animals there are in the study area (in our case the square from 0 to 1 in *x* and *y* directions) would be to survey quadrats (boxes) that were randomly placed within the region, count all the animals in each box.

By dividing the number of animals we saw by the total area surveys (sum of the areas of all the quadrats), we can obtain an estimate of the abundance of the animals in the study area by multiplying this estimated density by the area of the study region.

```
# covered area (10 quadrats, each a 0.1x0.1 square)
covered <- 10*0.1^2
# estimate density
D <- n/covered
# area of the survey region
A <- 1
# estimate abundance
Nhat <- D*A
```

This estimator gives us an estimated abundance (*N̂*) of 520, relatively close to the true value of 500 (`N`

in the above code).

To simplify data collection from a logistical point of view, it seems simpler to have observers traverse lines and observe strips rather than have them walk between randomly placed squares. Using a regular grid of strips (offset randomly) allows observers to move from one strip to the next most efficiently.

Again, it’s straightforward to estimate abundance, via first finding the density:

```
# covered area -- same area as for the quadrats
covered <- 5*1*0.025
# estimate density
D <- n/covered
# area of the survey region
A <- 1
# estimate abundance
Nhat <- D*A
```

So the estimate of abundance, *N̂*= 480, again quite close to the true population size of 500.

Both methodologies assume that all animals within the *covered area* (the area inside the blue dashed lines in the figures above) are detected. Certain detection is possible, for example if video recording/digital photography are used (Buckland et al., 2012), but not likely when human observers conduct surveying. Many factors will affect whether or not an animal is detected by an observer. For example: time of day, weather, the observer’s previous experience, environmental conditions and, of course, the distance to the animal from the observer. These factors and (importantly) combinations of these factors surely influence the number of animals seen and hence the final abundance estimate of the species of interest.

As is usually the case in statistics, we choose to model this problem. The price we must pay to model detectability is that we must collect more data. In the simplest distance sampling models we require not only the number of animals we saw, but also the distance from the observations to the sampler. We will go on to supplement this with data on other factors which affect detectability.

Returning to our example population, rather than think of boxes where we see everything in a given area, instead consider walking down a line (*transect*) and recording distances to the animals that we see. Let’s assume that detectability changes as a function of distance from the transect line^{1} and that relationship is governed by a *half-normal* relationship, that is that the probability of detecting an animal, if it’s at a distance *x* from the transect line is:

$$
\exp\left(-\frac{x^2}{2\sigma^2}\right),
$$

where *x* is the *perpendicular distance* from the transect line to the animal and *σ* controls how likely we are to see an animal at a given distance. Graphically the detection function used in the example below looks like this:

The larger the *σ* the more likely we are to see animals at large distances. We’ll look at this relationship in more detail in the next chapter, Models for detectability.

Note that we miss some closer animals (some grey dots very near the transect lines) but also detect some animals that are further away (red dots much further away than they were in the strip transect case). In an actual survey this would translate itself into acknowledging the fact that we miss some animals due to cover, or weather conditions etc. However, we are not ignoring animals that we can plainly see but are outside of the current strip or quadrat we are surveying. We wish to record every animal we saw and include this data in our model without biasing our results.

What is also interesting is the pattern we see in the histogram of the observed perpendicular distances: large numbers of animals are seen at small distances, with numbers detected decreasing with increasing distance from the line. Again, this makes physical sense – we should see those individuals that are near us more easily. Although this data is simulated and the pattern is a feature of the simulation setup, we will see throughout the book that this pattern is replicated in survey data.

`hist(detected_distances,xlab="Distance",main="")`

In the previous examples, we were able to estimate the abundance in the study area by calculating density in the covered area and then multiplying density by the size study area. Mathematically: *D̂* = *n*/*a*,

where *D* is density (the hat indicates it’s an estimate of density *TAM this is an estimator, not an estimate, of density*), *n* is the number of animals we saw and *a* is the area we covered during the survey. We can think of *a* as the sum of the areas of the strips/quadrats in the above examples, and that each strip or quadrat has area equal to its width multiplied by its height, i.e. *a* = 2*w**K**l* where *K* is the number of transects, 2*w* is the width (why we say 2*w* will become apparent below) and *l* is the length. What is unclear in the *line transect* case is what *a* is – not only the quantity we should use, but also its meaning.

Looking at each component of *a*, the part that we are uncertain about is 2*w*, what should this width be? One could make the argument that the largest observed distance is a good candidate for the width of the transect. This can easily be too large if we observe a few outlying distances which are much much further than others (we will revisit this in a more rigorous way later). Since we may want to discard these outlying distances we refer to *w* as the *truncation* (or *truncation distance*). For now let us choose a truncation that include almost all of our observations, throwing out only a few outliers (say *w* = 0.02). We now have a value for the area covered, but it’s not quite right: the estimator we’ve constructed so far is the same as the quadrat/strip estimator, not taking into account the probabilistic nature of the detections we made.

If we knew the probability of detecting an animal we could divide the estimate of density by this, thus inflating the according to this probability of detection. We would then have the following estimator:

$$
\hat{D} = \frac{n}{2wKl\hat{p}},
$$

where *p̂* is the probability of detecting an animal at any distance from the transect. For simplicity’s sake let us say we have estimated that *p̂*= 0.5981, we can then use the above estimator to obtain *N̂*= 512 animals in the area. The rest of this chapter (and the distance sampling literature) focuses on models for *p̂*.

Two important pieces of information that must be collected have been mentioned so far: number of animals observed (counts) and distances to the animals. A third and equally important piece of information is the *effort* expended – how large an area was covered by the observers during the survey. It is essential that effort is collected and recorded accurately, otherwise abundance will be estimated incorrectly (as the denominator in our equations will not be correct). When thinking about line transects we use the term “effort” to refer to the total length of lines that we survey, as this is fixed once the survey has been carried out (*w* may be changed during the analysis).

So far we have only talked about line transects. Rather than walking along lines we might prefer to stand at a point for some period of time and count how many animals we see or hear in that period of time. Such counts are common for avian studies as it’s often easier to hear small birds than see them in dense forest and hearing can be degraded while observers are walking. It is also the case that multiple species are detected. Therefore the observer must be able to make a determination of species, measure the distance from the point to the bird, and record information about species and distance in a short amount of time. Consequently, relieving the observer of the process of navigation during all of this leads to the idea of the observer remaining roughly stationary at a point while collecting data.

Going back to our simulated example, we will pick out 6 points to sample. At each point we again assume that detectability decreases with distance.

Looking at the histogram of distances:

`hist(detected_distances,xlab="Distance",main="")`

As we saw with the line transects, the point transect distances see a similar drop-off in frequency with increasing distance. However, unlike the line transects we also see an increase as distance increases up to around 0.03. The shape of the histogram is down to two competing processes during sampling:

- as the radius increases, the area surveyed increases – so there are more animals available to be detected, this increases quadratically (since the area of the circle is
*π**r*^{2}). - At the same time, we are increasing the distance from the point to the animals in question, this is the same problem as described for line transects: as animals are further away, they are harder to detect.

So, up to a certain distance we see an increase in sightings, as we have an increasing numbers of animals that we can detect but after some distance the area becomes too big and the effect of detectability outweighs the effect of increasing area.

Now we can again calculate abundance, in a similar manner as above, but note that the covered area is *k**π**w*^{2} where there are *k* points, each with radius *w* (again, we truncate at a distance *w*). If we again say that we have estimated *p̂* (somehow), and that *p̂*= 0.25, we can multiply the survey area by the probability of detection and calculate the abundance in the study area to be 519.61.

In this chapter I showed how we can think about distance sampling line and point transect surveys as extensions of strip and quadrat surveys where we model the probability of detection as a function of distance. Both line and point transect examples gave estimated abundances that were close to the true abundance, without assuming that we saw all of the animals in the covered area. In the next chapter we’ll see how to formulate these models mathematically and fit them to data using R.

Buckland, S.T., Burt, M.L., Rexstad, E.A., Mellor, M., Williams, A.E. and Woodward, R. (2012) Aerial surveys of seabirds: the advent of digital methods. *Journal of Applied Ecology*, **49**, 960–967.

Note that the distance from the

*transect*is important here. The detection function models probability of detection (given observation) as a function of distance between the line or point transect centreline or point (respectively) to the animal. We may record the distance from the observer,*r*, and convert this to distance from the transect using trigonometry, of course. For that we also need to record the angle between the transect line and the observer-to-animal line,*θ*. Then,*x*=*r*sin*θ*.↩