Multilevel Regression for Sockeye Salmon Spawner Abundance Across Years and Rivers

Introduction

Let the data speak for itself... but all the data must speak.                                                     - John P. Cunningham (2012)

Estimates of the number of sockeye salmon spawners that occupy in a river (or similar geographic category of spawning) involve repeated surveys during, typically, September to November. Techniques vary: counting spawners while walking beside a river, while in a helicopter or small airplane, while snorkeling down the river, and from analyzing video from drones. There are many other methods to estimate abundance: mark-recapture, conductivity sensors, counting fences, and eDNA. Other salmon species may be counted in a survey for sockeye, and each salmon species has a diversity of enumeration methods.

The analysis of repeated surveys of sockeye spawners is conventionally an area under the curve estimate (AUC) where the "curve" is sometimes just the area under straight lines joining the observations. The data may be supplemented with prior knowledge or subjective judgements about first, peak, and last presence of spawners. One improvement, with liberal assumptions about salmon behaviour, fits a Gaussian distribution (the temporal pattern of abundance) to a time series of survey estimates. These analyses are confined to the samples for one river in one year, so parameter estimates are based on few surveys, < 10, and abundance estimates are accordingly imprecise.

The intention of this exercise is to show, using simulated data, that AUC estimates can be improved by considering multiple rivers and multiple years in a multilevel regression. The ideas are:

there will be an effect in common due to years on the temporal pattern of abundance in a group of rivers. Marine migration might be early or late, rainfall to raise water levels sufficiently for salmon to enter the rivers may be early or late.
a river can have a consistently different temporal pattern from other river. Spawners that home to that river can arrive earlier or later, and more or less spread in time.
a river can have consistently more or fewer salmon compared to other rivers. Factors such as length of accessible river and the abundance of gravel suitable for spawning (for redds) affect salmon abundance.; and
the abundance of spawners across all rivers varies by year. Implicit is that these spawning groups in rivers (salmon tribes?) are related within a population or conservation unit (metapopulation).

These ideas can be assembled into an equation

N_{tjk} = N_j P_k (T_0[\mu,\sigma]_t + T_j[\mu,\sigma]_t + T_k[\mu,\sigma]_t) + \epsilon_{tjk}

where is the number of spawners on day i in river k in year j. That is estimated with error from the proportionof all spawners that year which enter river k according to the background temporal pattern and how that pattern is affected by the pattern for that year and that river The expression refers to a temporal distribution with parameters andfor timing and spread (if Gaussian chosen).

Gaussian might be a sufficient description for the temporal abundance of spawners despite being symmetrical and unbounded. Reality might be better described as a truncated and skewed Gaussian distribution, or by a distribution that is inherently bounded.

The beta distribution,for example, is bounded between zero and one, with the advantage that, after scaling time between the first and last important abundance of spawners in all rivers (in the region being analyzed) then river effect and year effect might each be sufficiently described by a single parameter, the day of peak abundance. Specifically, when the beta distribution has and and has so so provides a simple to use, variably skewed, bounded distribution. It is symmetric (parabolic) when so

0.4s

Local FunctionsR_rstan (R)

# Local Functions 

# add dimension names to an array. There has to be a better way.
ArrayDimNames <- function (a, names){
    # a is an array
    # names with length(dim(a)), one name for each dimension
    dims=dim(a); ndims=length(dims);
    if (length(dims) > 4) stop("ArrayDimNames: only defined for 3 or 4 dimensions");
    result<- list(rep(names[1],dims[1]),rep(names[2],dims[2]),
                  rep(names[3],dims[3]) )
    if((length(dims) == 4)){
        result<- list(rep(names[1],dims[1]),rep(names[2],dims[2]),
                      rep(names[3],dims[3]),rep(names[4],dims[4]) )
    }
    return(result)
}
# example:
# ArrayDimNames(a=array(dim=c(3,3,3)), names=c("dog","fox","bat"))
    # [[1]] [1] "dog" "dog" "dog"
    # [[2]] [1] "fox" "fox" "fox"
    # [[3]] [1] "bat" "bat" "bat"


# day of the year if not leap year
doy=cumsum(c(0,31,28,31,30,31,30,31,31,30,31,30,31)); 
   # 0 31  59  90 120 151 181 212 243 273 304 334 364
   # used as doy(month) where month is 1:12. 

# truncated Normal distribution (tails trimmed). This begs to be recursive.
RtruncNorm <- function(n=1,m=0,s=1,t=2){
		# t is truncation, how many sd tolerable? default is 2
    a <- rnorm(n);  print(a);
    r <- max(abs(a)); print(r)
    while(r > t){
        j <- a > t; print(j) # where exceeds
        a[j] <- rnorm(sum(j)) # try again where exceeds
        r <- max(abs(a)) # OK yet?
    }
    result <- a*s+m
    return(result)
}

0.4s

Local FunctionsR_rstan (R)

Examples of Samples

Temporal patterns from year and river effects.

Simulate data applies the normal distribution for the overall abundance of spawners with timing (mean,) centered on day 273, the end of September, and a spread (sd,) of 7 days. The year effect on the timing (mean) for spawner abundance in one river is drawn from a random normal distribution with sd = 5 days and mean = 0. The year effect on the spread (sd) of the spawner abundance in a river is similar, random normal with sd = 1.5 days and mean = 0. The river effects are the same, random normal with timing sd = 5 days and spread sd = 1.5 days.

The following code creates a three dimensional array of nday observations in each of nriver rivers in each of nyears years.

# mean and SD of salmon abundance after removing year
# and river effects. Day 270 is September 30 (not leap year)
background <- c(273, 7)  
# storage for simulated data
nday <- nriver <- nyear <- 5;  # remember: arrays stored by first index
trueAbundance <- array(dim=c(nday, nriver, nyear));
dimnames(trueAbundance) <-list(day=1:nday,river=LETTERS[1:nriver],
                               year=1991:(1991+nday-1))
# river size effect:proportion across all years, of spawners in that river 
river_size <- runif(nriver,1,10);   # random uniform.
river_size=river_size/sum(river_size); # proportion
# temporal effects effects
yearSD <- c(5,1.5) # effect on timing and spread
riverSD <- c(5,1.5) 
yearEffect <-   cbind(rnorm(nyear,   0, yearSD[1]),
                      rnorm(nyear,   0, yearSD[2]) ) # 2 columns
riverEffect <- cbind(rnorm(nriver, 0, riverSD[1]),
                      rnorm(nriver, 0, riverSD[2]), river_size) # 3 columns
dimnames(yearEffect) <-   list(year=1991:(1990+nday),
                            effect=c("timing", "spread"));
dimnames(riverEffect) <- list(river=LETTERS[1:nriver], 
                            effect=c("timing", "spread", "size"));
cat(  ' Year Effect   \n'); print(yearEffect);
cat('\n River Effect \n'); print(riverEffect);

0.6s

year and river effectsR_rstan (R)

Examine total effect.

The random, additive effects of rivers (A, B, C,) and year (1991, 1992,) are calculated for timing and spread.

b1  <- matrix(nrow=nriver, ncol=nyear);
dimnames(b1) <- list(river=LETTERS[1:nriver], year=1991:(1990+nday) )
b4 <- b3 <- b2 <- b1
cat("riverEffect+yearEffect. Timing (mean) \n")
for(year in 1:nyear){
  for(river in 1:nriver){
    b1[river, year]=riverEffect[river,1]+yearEffect[year,1]; 
  }
}
print(b1)
cat("\n riverEffect+yearEffect. Spread (SD) \n")
for(river in 1:nriver){
  for(year in 1:nyear){
    b2[river, year]=riverEffect[river,2]+yearEffect[year,2];
  }
}
print(b2)
# earliest and last abundance across all rivers and years
# needed to determine sampling days
b3 <- b1 - 1.5*(b2+background[2]); # first case revealed
cat("\n Early due to river and year effects \n");
print(b3);
b4 <- b1 + 1.5*(b2+background[2]); # last  case revealed
cat("\n Late due to river and year effects \n");
print(b4)

0.9s

total effectR_rstan (R)

Generate sampling days

As a starting point, sampling days are the same for every river every year. The range of sampling days has to cover the range of days when salmon are abundant in any river. That is determined from -1 sd of abundance distribution for the most early case of river and year, and +1 sd for most delayed case.

# sampling days. Cover expected range across all years and rivers.
j <- which.min(b3); # earliest case
a1 <- b3[j]+background[1];  # value for this case
cat('earliest case',a1, 'timing effect', b1[j],'spread effect', b2[j],'\n')
j <- which.max(b4);  # most delayed case
a2 <- b4[j]+background[1];
cat('last case',a2,'timing effect', b1[j], 'spread effect', b2[j], '\n')
days <- seq(a1, a2, length.out=nday);
days <- floor(days) # integer days
cat("\n Days of the year for observations:", days,"\n");

0.9s

sampling daysR_rstan (R)

Abundance by day, one river, many years

The relative abundance of salmon in one river is taken as fixed and independent of years. This is true only when considering the potential abundance of spawners ("capacity" refers to some measure of output of salmon, not input of spawners). The spawning populations in a river may thrive or dwindle for many reasons: fishing, predators, disease,. These random effects within rivers across years will fall into the unexplained variance.

# total abundance by year
totalYearly <- rnorm(nyear, 1e5, 0.25e5) # abundance by year
# abundance by day by river by year.
for (river in 1:nriver) {
    for (year in 1:nyear) {
        trueAbundance [,river,year] <- totalYearly[year]*riverEffect[river,3]*
        dnorm(days,
            background[1]+riverEffect[river,1]+ yearEffect[year,1], # mean
            background[2]+riverEffect[river,2]+ yearEffect[year,2]) # SD
    }
}
j <- is.na(trueAbundance) | trueAbundance < 0.1 
trueAbundance[j] <- 0

par(tck=0.02, las=1, yaxs="i")
color <- c("black","blue","green","orange","red"); 
plot(1,1,type="n", xlim=c(-5,5)+range(days), ylim=c(.9,1.1)*range(trueAbundance),
     xlab="Day of Year", ylab="Abundance"); axis(3,labels=F);axis(4,labels=F);
for (river in 1:nriver) 
    {for (year in 1:nyear) {
        lines(days, trueAbundance[,river, year], col=color[river] )
    }
}
abline(v=days, lty="dotted", col="grey50");

0.8s

total abundanceR_rstan (R)

Area under the curve: conventional

To establish a contrast to AUC by river by year, we calculate the integral of total salmon abundance from the linear interpolation method and by fitting a normal distribution. Adjusting the linear interpolation by considerations such as a date for unobserved peak spawning is not included (yet).

A spawning salmon may be observed in more than one survey, so a key parameter to translate AUC into salmon abundance is the duration of a spawner in a river. While that duration may be a haphazard estimate, it directly affects abundance estimates and needs to be included in these models. Potentially all of the spawners observed after peak abundance were also observed at peak abundance. This has led to simplifying the survey analyses to be maximum spawners observed. This value is included in the result because the relative abundance of spawners, from some uncalibrated index of abundance, such as maximum spawners observed across multiple surveys in one river and one year, may be sufficient for fisheries management.

Calculations without observation error

Knowing how the precision of estimates of total changes with increasing observation error seems valuable. The baseline is zero observation error. The following code provides functions for exploring the effects of observation error, which I have not yet done. (perhaps Gaussian Process regression?)

function for linear AUC

# linear interpolation with irregular intervals.
# each observation applies to half the preceding inteval and half the following.
AUC <- function(days,dat) { # x,y. vectors
  n=length(dat)
  if (length(days) != n) stop("AUC: lengths differ");
  interval <- days[-1] - days[-n]; # second minus first, etc. length n-1
  weight <- interval/2 # n-1 of these
  weight <- c(0,weight)+ c(weight,0); # for non-uniform intervals
  auc <- sum (weight * dat)
#  cat('weight', weight,'\n')
  return(auc)
}
test_dat <- c(1,3,5, 2,1 ); # not symmetrical
test_day <- c(2,5,8,11,14);
AUC(test_day, test_dat ) # 12

0.6s

linear AUCR

linear function to fit Gaussian distribution

The log of the Gaussian distribution is a parabola

log(P_x)= log(\frac{1}{\sigma \sqrt{2\pi}})+ \frac{(x-\mu)^2}{2\sigma^2}

log(y)= \frac{-1}{2}log(2\pi\sigma^2) + \frac{x^2 - 2 \mu x +\mu^2}{2\sigma^2}

log(y)= -0.5 \left( log(2\pi\sigma^2) + \frac{\mu^2} {\sigma^2} \right) -x \frac{2\mu}{2\sigma^2} + x^2 \frac{1}{2\sigma^2}

log(y) = a + bx +cx^2

where

and

After linear regression of the mean and standard deviation can be recovered as and

The Gaussian integral (LaPlace 1782, Gauss 1809) is the area under the curve of a normal distribution with and

\int_{-\inf}^{\inf}e^{-x^2}dx=\sqrt{\pi}

There is an extension applicable to the parabola obtained by regression with the log of abundance data. The antilog of the preceding fitted parabola (parameters ) is a normal distribution, and there is a analytic solution for the integral of the antilog of a parabola,

\int_{-\inf}^{\inf}e^{a+bx-cx^2}dx= \sqrt{ \frac{\pi}{c}} \ e^{ \frac{b^2}{4c} +a}

That integral is the total abundance, aka area under the curve, aka Gaussian scaling factor. See link.

Fitting to log abundance emphasizes the contribution to the sum of square of residuals from small abundance estimates compared to large. Are small abundance estimates more or less precise than large? That is true when there are too many salmon to count individually. If the precision of each abundance estimate can be quantified, then a weighted regression is a trivial extension. This is deferred, because it is more likely that relative precision is known (ranking, ordinal), which means fitting weights instead of knowing weights (Akenhead et al. 2016).

0.9s

quadratic to GaussianR_rstan (R)

# the log of a normal distribution is a parabola
AUCnormal <- function(dat, day){
  day2 <- day^2
  y<- log(dat)
  a <- lm (y ~ day + day2)
  return(a)
}
test_dat <- c(1,3,5,2,1); # not symmetrical
test_day <-c(2, 5, 8, 11, 14);
a <- AUCnormal(test_dat, test_day);
p <- coefficients(a)
print(summary(a)); print(p);
y <- predict(a) # data=test_day, default predicted by regression 
par(tcl=0.2)
plot(log(test_dat)~test_day, pch=20, yaxs="i", 
     xlim=c(0,1.2*max(test_day)),ylim=range(y)*c(.8,1.2));
points(test_day,y)
x= seq(min(test_day), max(test_day), length.out=20)
y <- p[1]+ p[2]*x+p[3]*x^2
lines(x,y) 
# extract Gaussian parameters from regression parametersestimates
mu <- -0.5* p[2]/p[3];
sigma <- sqrt(-0.5/p[3]);
# be careful: regression y=a+bx+cx^2 represents math y=a+bx-cx^2
# so see c as -c, see? So,
auc <- sqrt(pi/(-p[3]))*exp( (p[2]^2)/(4*(-p[3])) +p[1]); 
cat('\n mu =', mu, 'sigma =', sigma,'auc =',auc, '\n')
abline(v= mu+sigma* c(-2,-1,0,1,2), lty='dotted',col='blue')
# not logged
plot(test_dat~test_day, pch=20, yaxs="i", 
          xlim=c(0,1.2*max(test_day)),ylim=c(0, 1.2*max(test_dat)));
y1 <- exp(y); # antilog
lines(x, y1);
mean0=sum(test_dat*test_day) / sum(test_dat);
sd0= sqrt( sum(test_dat*(test_day-mean0)^2)/(sum(test_dat)-1)) # 
cat('\n simple weighted mean =',mean0,' with sd =', sd0,'\n')
y2 <- dnorm(x,mean0,sd0) * auc; # crude normal scaled to total abundance
lines(x,y2,col="blue");
abline(v= mean0+sd0* c(-2,-1,0,1,2), lty='dotted',col='blue')

0.9s

quadratic to GaussianR_rstan (R)

Nonlinear regression (search) for normal distribution

This model minimizes the squares of residuals from trial values of parameters to estimate It has the advantages of (1) fitting a complete distribution to possibly truncated observations (started late or ended early), (2) does not emphasize the importance of small abundances compared to large, and (3) provides a direct estimate of the precision of the fitted value for AUC,

# the model and the optimization in one function
fitGaussian <-function(x,y,mu,sigma,scale){
  f = function(p){  # p is c(mu,sigma,scale)
    d = p[3]*dnorm(x,mean=p[1],sd=p[2]) # predicted from trial values in p
    sum((d-y)^2) # minimize sum of squared deviations (or another distance)
  }
  optim(c(mu,sigma,scale),f, hessian=T) # search for minimum, starting from guesses. Return curvature of SSD manifold at best fit for parameters (Hessian)
}
a <- fitGaussian(test_day, test_dat, mu=7.8,sigma=3.5, scale=36);
mean1 <- a$par[1]; sd1 <- a$par[2]; auc1 <- a$par[3]; SSres <- a$value;
se <- sqrt(diag(solve(a$hessian))); 
r2 <- 1-SSres/sum((test_dat - mean(test_dat))^2); # SSreg/SSmean
cat(sep='', 'mean=',  round(mean1,2), '(', round(se[1],2),
           '), sd=',  round(sd1,2),   '(', round(se[2],2),
           '), auc=', round(auc1,2),  '(', round(se[3],2), 
           '), r2=',  round(100*r2,0),'% \n' );
par(tcl=0.2);
plot(test_dat~test_day, pch=20, yaxs="i", 
          xlim=c(0,1.2*max(test_day)),ylim=c(0, 1.2*max(test_dat)));
lines(x, auc1*dnorm(x, mean1, sd1))
abline(v= mean1+sd1* c(-2,-1,0,1,2), lty='dotted',col='blue')

0.4s

nonlinear GaussianR_rstan (R)

Monte Carlo fit

As the first step to a Bayes-LaPlace approach to fitting a multilevel regression, the conventional one river in one year approach with Hamiltonian Monte Carlo. This uses the Stan language for Bayesian statistics via R package rstan.

Overfitting is a problem for prediction in ecological systems with a superabundance of habitat indicators. Leave-one-out statistics, where the predicted value of each data point, when it is omitted from the data used to fit the model, is a safeguard against overfitting. The latest development of this is "PSIS-LOO" (Betancourt et al., 2017), available via R package loo.

experience: scheduling, downloading, booting R_stan took ~70 seconds. The following test of rstan completed in 76 seconds, mainly compiling. Sampling took less than 1/10 second.

# library(rstan)
# stancode <- '
#    data {real y_mean;} 
#    parameters {real y;} 
#    model {y ~ normal(y_mean,1);}' # from  R help: stan_model
# mod <- stan_model(model_code = stancode)
# fit <- sampling(mod, data = list(y_mean = 0))
# print(fit)
# fit2 <- sampling(mod, data = list(y_mean = 5))
# print(fit2)

76.2s

test rstanR_rstan (R)

80.0s

stan_GaussR_rstan (R)

library(rstan)
stan_Gauss_code <- '
    functions {
         // Gaussian to predict temporal abundance of salmon.
         // faster than (1/(sigma*sqrt(2*pi)))* exp(-(1/2)*((x-mu)/sigma)^2)
         // sqrt(2) = 1.4142135623731,  sqrt(2*pi)= 2.506628274631
        vector Gauss(vector x, real mu, real sigma) {
            return (1/(sigma*2.506628274631))*
               exp(-square((x-mu)/(1.4142135623731*sigma)));
        } 
    } 
		data {
			int<lower=0> np;       // count of prior (8)
			vector [np]  prior;    // timing: m,s; spread: m,s; auc: m,s
	    int<lower=0> n;        // count of data
      vector [n]   x;        // day of year for observation
      vector [n]   y;        // estimate of abundance for thatday
    }
    parameters {
      real <lower=0> timing;      // mean of Gaussian
      real <lower=0> spread;      // standard deviation of Gaussian
      real <lower=0> auc;         // area under curve, total abundance
      real <lower=0> sigma;       // sd of residuals
    }
    model {
      vector [n] ypred;
          // prior knowledge
	    timing ~ normal(prior[1], prior[2]); 
	    spread ~ normal(prior[3], prior[4]);
	    timing ~ normal(prior[5], prior[6]);
      sigma  ~ gamma( prior[7], prior[8]);
          // prediction from trial parameters
      ypred = auc * Gauss(x, timing, spread);  // Gauss returns a vector 
          // likelihood of data given prediction
      y~normal(ypred, sigma);
    } '

stanc_ret <- stanc(model_code='stan_Gauss_code') # check fussy syntax, compile
stan_Gauss <- stan_model (stanc_ret=stanc_ret) # create model

80.0s

stan_GaussR_rstan (R)

0.5s

R_rstan (R)

# create test data and initial guesses
# background temporal distribution
timing <- 280.0;
spread <-  15.0;
auc <- 2000.0;
n = 8; # number of days observed
# x is day of observation (day of Julian year)
x <-  floor(seq(from=timing - 2*spread,to=timing + 2*spread, length.out=n)); 
# y is estimates of abundance
y0 <- auc*dnorm(x,timing, spread); # without observation error
y <- round(y*rnorm(y0,1,0.2),0); # add 20% error, proportional to y.
cat(' x =',x,'\n y =',y,'\n');
# ----- priors 
# guess for auc is linear AUC
auc <- round(AUC(x,y), 0);
# guess for timing is day weighted by count
ptm <- round(sum(x*y)/sum(y), 0); # ptm abbreviates prior timing mean
# guess for spread is weighted variance 
psm <-  round(sqrt(sum( y* (x-ptm)^2)/sum(y)), 0) # ignore supposed bias
plot(x, y0, pch=20, ylim=c(0,1.2*max(y)) ); 
     points(x,y); lines(x,auc*dnorm(x,ptm,psm));
cat(sep='', 'calculated priors: timing=',ptm,', spread=',psm,', AUC =',auc,'\n') 
# sigma (sd of normal residuals) has prior gamma(shape=5, scale=1)
k=exp(seq(-4,3,.25));  # interested in behaviour near zero
plot( k,dgamma(k,shape=5, rate=1), type="l",ylim=c(0,0.25) );
lines(k,dgamma(k,shape=10,rate=1), col="red");
prior <- c(
  timing_m = ptm,       # 1. about 280 (September 30 is day 273) 
  timing_s = 7.0,       # 2. range, 4 sd is 1 moon 
  spread_m = psm,       # 3. about 2 moons. 
  spread_s = 3,         # 4. range 1 to 4 moons. Avoid negative! 
  auc_m    = auc,       # 5. simplest estimate
  auc_s    = 0.25*auc,  # 6. expect 0/5 to 1/5 of estimate
  sigma_shape = 10.0,   # 7. high and wide. Y should be scaled: y/max(y) 
  sigma_scale = 1.0)    # 8. 
print(prior) # shows names
dat <- list(np=8, prior=prior, n=8, x=x, y=y); 

# might need to initialize HMC
# one chaing for testing
# init1 <- list(
#  chain1=list(timing=ptm, spread=psm, auc=auc, sigma=10.0));
# print(as.data.frame(init1));
#  four chains for fitting
# init4 <- list(
#  chain1=list(timing=ptm*1.10, spread=psm*1.10, auc=auc*1.10, sigma=12.0),
#  chain2=list(timing=ptm,1.05, spread=psm*1.05, auc=auc*1.05, sigma=10.0),
#  chain3=list(timing=ptm*0.95, spread=psm*0.95, auc=auc*0.95, sigma=8.0),
#  chain4=list(timing=ptm*0.90, spread=psm*0.90, auc=auc*0.90, sigma=6.));
# print(as.data.frame(init4));

# I used these print statements in model block to debug Stan code
#  with (iter=1, chains=1); 
#      // print("timing =",timing); // to check Stan code
#      // print("spread =",spread);
#      // print("auc =",auc);
#      // print("sigma =",sigma);
#      // print("x =",x);
#      // print("y =",y);
#      // print("ypred =",ypred);

0.5s

R_rstan (R)

## test my function in Stan function block
#   expose_stan_functions(stan_Gauss);
#   Gauss(c(2,3,4), 0, 1); # [1] 0.0539909665 0.0044318484 0.0001338302
#   dnorm(c(2,3,4), 0,1) # [1] 0.0539909665 0.0044318484 0.0001338302

0.8s

check function GaussR_rstan (R)

fit <- sampling(stan_Gauss, data=dat, chains=4, iter=2000); # init=init1);
print(fit)
pairs(fit, condition=0.2);

0.3s

R_rstan (R)

Experience

Stan is "strongly typed" to enforce rigor in coding, an attempt to avoid subtle bugs. Which takes some getting used to, a contrast to R. I found vector [n] x confusing compared to real x [n], the latter being an array, so you could have real [m] x [n] which would be n arrays of m values. And then row_vectors which are not vectors or arrays, nor rows of a matrix. I see the point: protect domain experts (salmon ecologists) from their vague notions of linear algebra.

This model does not work with the default improper priors; their use causes the sampling to get stuck from despite initializing in the vicinity of the best fit. Instead, the prior knowledge assigned to parameters is derived from simple models applied loosely (large range for the prior distribution of a parameter). When there is adequate data to estimate a parameter, the effect of prior knowledge is small. When there is a lot of data, by modeling many years and rivers at once, the effect of prior knowledge (a priori probability distributions for parameters) on the results (ditto but a posteriori) will be smaller again.

Parameters were recovered accurately, but that is a trivial result because the true model was known. In practices, true model is unknown (infinitely complicated) and a serviceable while simplistic description will be discovered from exploring real data.

Many Rivers, Many Years

Multilevel Regression

A simple, linear, multivariate, multilevel regression model would be

y=\mathcal{N} (\alpha + X \beta + Z \gamma, \ \sigma)

where X is the matrix of independent variables and the fitted parameters (intercepts) and (slopes) define the overall or average linear regression, the "fixed effects". If there were no more to the model, this would describe 'complete pooling" across identifiable groups in the data. The matrix Z describes the groups in X and is the fitted estimate of the deviation of the parameter by groups (easily extended to have slope also vary by group). Thusmakes this similar to a "random effects" model, where the intention is to discern the fixed effects and discard the random effects: what apparent noise can be accounted for via groups? In the Bayes-Laplace paradigm, the group effects are important for prediction, and modeled as draws from a random distribution. The choice of that distribution involves prior knowledge and its parameters are fitted as part of this model. The dependent variable to be predicted by the model, and the residuals (observed minus predicted) are assumed (more prior knowledge) to have a Gaussian distribution with mean 0 and standard deviation

Some quotes, about how "partial pooling" shares information among groups, might be helpful at this point:

"An extreme approach would be to completely pool all the data and estimate a common vector of regression coefficients β. At the other extreme, an approach with no pooling assigns each group its own coefficient vector that is estimated separately from the other groups. A hierarchical model is an intermediate solution where the degree of pooling is determined by the data and a prior on the amount of pooling." - Stan User's Guide version 2.19 page 23.

"While complete pooling or no pooling of data across groups is sometimes called for, models that ignore the grouping structures in the data tend to underfit or overfit" (Gelman et al., 2013). Hierarchical [multilevel] modeling provides a compromise by allowing parameters to vary by group at lower levels of the hierarchy while estimating common parameters at higher levels. Inference for each group-level parameter is informed not only by the group-specific information contained in the data but also by the data for other groups as well. This is commonly referred to as borrowing strength or shrinkage." - http://mc-stan.org/rstanarm/articles/glmer.html

River and Year Effects on Timing and Spread

The background model for temporal distribution of salmon spawner abundance is a distribution (not yet identified, Gaussian by default) with two parameters: timing and spread. This is applied to all nyear years and all nriver rivers. The other part of the background is the overall abundance of salmon for each year ("run size"). The extent to which the rivers share a single run of salmon is a matter of subsequent ecological refinement; this analysis starts with the assumption of one run each year shared unevenly across rivers.

The departure from that background for any one river, the river effect on timing and spread, is two parameters. If the data for all rivers were pooled, there would only be parameters: background timing and spread, and each river's timing and spread. In this model, the prior knowledge for the probability distributions for the parameters for year effect and river effect proposes they are zero unless the data unambiguously supports the existence of an effect (for years across all rivers, for rivers across all years). The LaPlace (double exponential) probability distribution provides that effect, This idea is used for lasso regression where the objective is to remove or ignore parameters that are unimportant.

# example of double expontial probability density distribution
LaPlace <- function(x, mu, sigma){
    (1/(2*sigma)) * exp(-abs(x-mu)/sigma)
}
x <- seq(-5,5,0.5);
mu <- 0.
sigma <- 2.
y <- LaPlace(x,mu, sigma);
par(las=1,tcl=0.2, pch=20, yaxs="i");
plot(y~x, type='o', ylim=c(0.0, 1.1/(2*sigma)), ylab='Pr(x)' );

0.4s

River effects are modeled the same way. This first model ignores any year X river effects (parameters). Conceivably but also ignored to start, timing and spread may be correlated across years or across rivers. Similarly treated, if spawner timing for some year (mean day of spawner abundance over all rivers before river effects) is late, then it may have less spread as a result of waiting for a migration barrier to drop. Conversely, if spawner timing for one particular river is late (mean day for that river over all years) it may have more spread because of an lengthy and arduous up-river migration.

Those examples indicate there will be habitat indicators that are covariates (predictors) for the stream and year effects on temporal distributions. It is very clear that spawner behaviour on the West Coast of Vancouver Island is affected by stream flows: spawners assemble in the ocean near a river (estuary) and enter after a rainfall event raises water levels in the river; spawners wait in pools in the river before spawning in response to a rainfall/flow event. This model does not include such behaviour and attendent habitat indicators, but it lays the foundation for such extensions. In fact, the objective of this exercise is to develop a framework to use what we know about salmon ecology to improve estimates of salmon abundance.

River and Year Effects on Abundance

The area under the curve of spawner abundance is not the same as the number of spawners because a spawner will reside in the spawning ground for many days. That residence time may be longer for females guarding redds than for males that drift downriver after spawning. A model that involved the accumulation of identified spawners would involve a cumulative distribution function, possibly preferable. But in this case the total annual abundance of spawners over all rivers in a year, is not the run size considered as the number of salmon indepent of multiple observations of the same fish. Determining number of fish from the sum of daily abundances of fish requires knowing the duration of a fish in a river. That will be an important calibration parameter, but probably complicated by the need for habitat indicators: perhaps warmer river temperatures mean shorter spawning ground residence.

Abundance is primarily a year effect. There is little prior knowledge, other than non-negative, in this model, but that will not be typical. Autocorrelations apply, such as known long trends in salmon abundance, and four-year generation cycles. The application of this model is likely to involve known abundances for all but the last year.

Independent of years (or average over all years) there is a river effect for abundance: the proportion of the total that goes to each river. Perhaps proportions vary due to spawning ground area and/or quality, or river length. The existence of sub-populations (or populations within a metapopulation) of salmon that home to one stream and not to others is an imporatn complication. This model assumes that any such sub-populations vary identically among years. The proportions by river sum to the entire abundance of salmon, therefore the fitted values for that vector of parameters ( with length nriver) are not independent, they must sum to unity. Such a vector is a simplex and a specific probability model is required to meet that constraint, the Dirichlet distribution (Gelman et al 2017, page ).

After considering temporal distributions and abundance by year and river proportion, the abundance in each year in each river (i.e. the AUC) is estimated as

\hat{y}_{i,j,k}= \mathcal{N} (\text{day}_{i}, \mu_{j,k}, \sigma_{j,k})

where the timing for river and year nriver X nyears parameters. We might expect those parameters to have spatial correlation length scales (near or distant in measures of river properties) and temporal correlation length scales (successive generations and population trends), so there is plenty of scope for additional data and knowledge to sharpen this type of model. The point is, that scope is zero without this kind of multilevel regression. The daily observations are predicted from all of these parameters. The probability of that prediction, given the data and an assumption about the distribution of residuals, is the basis for describing the joint probability density in the parameter space, and thereby estimating the parameters and their marginal distributions, e.g. median (50% quantile) with 10% and 90% quantiles to indicate precision. While the correlations between parameter is typically ignored for reporting, it is required for predicting. One aspect of predicting is to simulate data described by the fitted model and see if that looks like the observed data.

River proportion

Describing the relative abundance of spawners between rivers as the proportion of salmon that enter that river means the fitted estimates for those proportions across all river must sum to 1. That vector of non-negative parameters is a unit simplex, the Direchlet distribution (Stan Functions Reference v2.19 §23.1 p126) describes the multinomial probabilities for values within a simplex. The Direchlet parameter describes how even the probabilities are, with large values (>100) creating a even distribution and small values (< 1/00) creating extremes (one of ~1 and all others ~zero). Using 1 as the default looks like uniform random.

 data {
   int <lower=1> nriver;
   real <lower=0> alpha; // large alpha means even proportions.
 }
parameters {
    simplex [nriver] river_prop;  // constrained 0<x<1 and sum=1.
}
model {
    river_prop ~ direchlet(alpha); // sampling for run proportions
}

Model Overview

"In theory, a Bayesian model should include all relevant substantive knowledge and subsume all possible theories. In practice, it won’t." - Jeffrey B. Arnold https://jrnold.github.io/bayesian_notes/model-checking.html

Any model is one of many possible variants, so we need a basis for comparing models. An indicator of model fit based on prediction ability uses cross validation to leave one out (LOO) when fitting and the predict that excluded observation. Conceptually, the fitting is repeated for every data point, but there is a mathematical shortcut. The required information is the log likelihood of each data point; produced and saved in the generated quantities block in a Stan model. http://mc-stan.org/loo/reference/loo-package.html

Stan code

This code uses an overly simplified version of salmon abundance patterns, but tries to get the surrounding stats model right.

83.1s

R_rstan (R)

library(rstan)
stan_MultiGauss_code <- '
functions {
     // Gaussian to predict temporal abundance of salmon.
     // faster than (1/(sigma*sqrt(2*pi)))* exp(-(1/2)*((x-mu)/sigma)^2)
     // sqrt(2) = 1.4142135623731,  sqrt(2*pi)= 2.506628274631
     /// NOT vector version
  real Gauss(real x, real mu, real sigma) {
    return (1/(sigma*2.506628274631)) *
           exp(-square((x-mu)/(1.4142135623731*sigma)));
    } 
} 
data {
  int<lower=0>        np;      // length of vector of priors,
	vector [np]      prior;      // does not include river_prop_prior
  int<lower=0>    nriver;      // how many rivers
  int<lower=0>     nyear;      // how many years
	int<lower=0>     ndata;      // how many data
  vector [ndata]       x;      // day of year for each observation
  vector [ndata]       y;      // estimate of abundance for each day in x
  int river_id [ndata];        // id of river in data. integer array
  int year_id  [ndata];        // id of year 
  vector [nriver] river_prop_prior; // Dirichelet, vector of ones, even
}
parameters {
  vector  <lower=0> [nyear]  run;            // annual spawner abundance
  real    <lower=0>          timing;         // all years and rivers
  real    <lower=0>          spread;         // all years and rivers
  vector            [nyear]  year_timing;    // by year, all rivers
  vector            [nyear]  year_spread;    // by year, all rivers
  vector            [nriver] river_timing;   // by rivers, all years
  vector            [nriver] river_spread;   // by rivers, all years
  simplex           [nriver] river_prop;     // fraction of run
  real   <lower=0>           sigma;          // stdev of Gaussian residuals
}
model {
  // local variables
  vector [ndata] ypred;            // predicted
  int            year;             // index
  int            river;            // index 
  real           timing_pred;      // background+yearEffect+riverEffect
  real           spread_pred;      // background+yearEffect+riverEffect
  // prior knowledge of probability distributions of parameters
	timing        ~ normal(prior[1], prior[2]);   // background
	spread        ~ normal(prior[3], prior[4]);   // background
	run           ~ normal(prior[5], prior[6]);
  year_timing   ~ double_exponential(0,prior[7]);  // shared all years
  year_spread   ~ double_exponential(0,prior[8]);
  river_timing  ~ double_exponential(0,prior[9]);  // shared all rivers
  river_spread  ~ double_exponential(0,prior[10]);
  river_prop    ~ dirichlet(river_prop_prior); // separate data vector 
  sigma         ~ gamma(prior[11], prior[12]);  // wide
  // prediction from a sampled point in parameter space
  for (j in 1:ndata){
		year        = year_id[j];  
    river       = river_id[j];
    timing_pred = timing + year_timing[year] + river_timing[river];
    spread_pred = spread + year_spread[year] + river_spread[river];
    ypred[j]    = run[year] * river_prop[river] *
                   Gauss(x[j], timing_pred, spread_pred);
  }
  // accumulate log likelihood of data given this prediction
  y~normal(ypred, sigma);
} '  
 
#generated quantities {
 #   vector[ndata] log_lik; 
 #   log_lik = normal_lpdf(ypred | y, sigma); YPRED DOES NOT EXIST.
 # } '

stanc_ret <- stanc(model_code='stan_MultiGauss_code') # parse
stan_MultiGauss <- stan_model (stanc_ret=stanc_ret) # create model

83.1s

R_rstan (R)

Test Data

If the statistical model cannot recreate the parameters (within sampling variance) behind simulated data that exactly matches the model, then something is wrong. I started with 5 rivers and 5 years but moved to 10 of both. Guessing that application will be more like 20 of both. In reality, timing and spread will be responses to river flow events, clearly for WCVI, less so for Fraser River with protracted freshwater migration. Big rivers still have habitat events that control salmon migrations, such as high water temperature blocking Okanogan River for Osoyoos Lake sockeye waiting in slightly cooler Columbia River.

The most worrisome bit is the proportion of run that goes to each river, averaged across all years. That involves Direchelet priors for proportions (a simplex, sum is 1). And seems unrealistic: will not be constant, assumes populations spread across rivers instead of one population in one river so number of spawners will be independent of other rivers.

0.4s

R_rstan (R)

# data and initial guesses for testing stan_MultiGauss
#  dat[] will have 10 items:
# int<lower=0>        np;      // length of vector of priors,
#	vector [np]      prior;      // does not include river_prop_prior
#  int<lower=0>   nriver;      // how many rivers
#  int<lower=0>    nyear;      // how many years
#	int<lower=0>     ndata;      // how many data
#  vector [ndata]      x;      // day of year for each observation
#  vector [ndata]      y;      // estimate of abundance for each day in x
#  int  river_id [ndata];      // id of river in data. integer array
#  int  year_id  [ndata];      // id of year 
#  vector [nriver] river_prop_prior; // Dirichelet, even when vector of ones

# setup to simulate salmon abundance (before observing)
nriver        <- 10
nyear         <- 10
timing        <- 280.0 # day , background timing (mean of Gaussian)
spread        <-  10.0 # days, background spread (SD of Gaussian)
year_timing_sd  <- 4
year_spread_sd  <- 2
river_timing_sd <- 4
river_spread_sd <- 2
year_timing  <- sort(rnorm(nyear,  0,  year_timing_sd));
year_spread  <- rnorm(nyear,  0,  year_spread_sd);
river_timing <- sort(rnorm(nriver, 0, river_timing_sd), decreasing=T);
river_spread <- rnorm(nriver, 0, river_spread_sd);
# run size: 1k, 2k, ... 10k, so observations are >1 
run  <- 1000.0 * seq(3,7,length.out=nyear); # normal-ish, m-5k,sd=1k
# river proportion of run (over all years)
river_prop <- c(1:nriver)+ 3; 
river_prop <- river_prop/sum(river_prop); # 10 rivers: 5% to 15%.
river_prop_prior= rep(1,nriver); 

cat('background timing =', timing, 'background spread =', spread, 
    '\n year_timing = \n ',  round(year_timing, 2), 
    '\n year_spread = \n ',  round(year_spread, 2), 
    '\n river_timing = \n ', round(river_timing,2), 
    '\n river_spread = \n ', round(river_spread,2), 
    '\n run = \n' , run, 
    '\n river_prop = \n' , round(river_prop,3), '\n');
# set day of observation in each river, always the same.
nday    <-  6; # surveys in one river in one year.
from <- timing-2*spread; to <- timing + 2*spread;
x1       <-  floor(seq(from,to, length.out=nday)); # day of year for obs.
x        <- rep(x1, nyear*nriver); # in every river every year.
# id: index into x and y for river and year 
ndata <- nday*nriver*nyear 
year_id  <- rep(1:nyear, each= nriver*nday); 
# 60 of 1, 60 of 2,. length is 600
river_id <- rep(rep(1:nriver, each=nday), nyear); 
# 1,1,1,1,1,1,2,2,2 ... 10,10,10, 1,1,1 ... for 600
# y is simulated observations. order is year, river, day. array better.
y=numeric(nyear*nriver*nday);
jn <- 0 # index into data
for (jy in 1:nyear){
     for (jr in 1:nriver){
         t <- timing + year_timing[jy] + river_timing[jr];
         s <- spread + year_spread[jy] +  year_spread[jr];
         r <-  run[jy] *   river_prop[jr];
         for (jd in 1:nday ){ 
             jn <- jn + 1;
             y[jn] <- r * dnorm(x[jd], t, s); # true abundance
         }  
     }
}
yobs <- y * rnorm(y, 1, 0.1) # 10% observation error. proportional.
a <- cbind (year_id, river_id, x, round(y,2), round(yobs,2)); 
print(a[c(1:35, 565:600), ]);

0.4s

R_rstan (R)

plot some simulated years

par(las=1,tcl=0.2, pch=20, yaxs="i", mgp=c(0,.5,0), mar=c(0,0,.2,0), mfrow=c(5,1));
colr=c("black","blue","green","orange","red")
years <-c(1,4,6,8,10) 
for(year in years){
  b <- a[ (a[,1] == year), ];
	ymax <- max(b[,4:5])
  plot(y~x, type='n', ylim=c(0,ymax), xaxt="n",ylab='',xlab='');
  axis(1,labels=F);axis(3,labels=F);axis(4,labels=F);
  k <- 0;
  for(j in seq(0,nriver*nday-1,nday)){  # 1  7 13 19 25 ...
    k=k+1; 
    lines( b[j:(j+nday), 3], b[j:(j+nday), 4], col=colr[k] ); # 6 observations
    points(b[j:(j+nday), 3], b[j:(j+nday), 5], col=colr[k] );
  }
}
#

0.4s

R_rstan (R)

plot some simulated rivers

par(las=1,tcl=0.2, pch=20, yaxs="i", mgp=c(0,.5,0), mar=c(0,0,.2,0), mfrow=c(5,1));
colr=c("black","blue","green","orange","red")
for(river in c(1,4,6,8,10)) {
  b <- a[(a[,2] == river), ]
  plot(y~x, type='n', ylim=c(0.0, max(b[,4:5])), xaxt="n",ylab='',xlab='');
  axis(1,labels=F);axis(3,labels=F);axis(4,labels=F);
  k <- 0;
  for(j in seq(0,nyear*nday-1,nday)){  # 0,6,12,,48,54
    k=k+1; 
    lines( b[j:(j+nday), 3], b[j:(j+nday), 4], col=colr[k] ); # 6 observations
    points(b[j:(j+nday), 3], b[j:(j+nday), 5], col=colr[k] );
  }
}
#

0.3s

R_rstan (R)

Define prior knowledge

0.5s

R_rstan (R)

# ----- priors --------
# // prior knowledge of probability distributions of parameters
prior <- numeric(12);
#	timing        ~ normal(prior[1], prior[2]);   // background
prior[1:2] <- c(timing, 7); # too tight, loosen later  
#	spread        ~ normal(prior[3], prior[4]);   // background
prior[3:4] <-c(spread, 3); # too tight, loosen later
#	run           ~ normal(prior[5], prior[6]);
prior[5:6] <- c(5000,1000); # too tight, loosen later
# year_timing   ~ double_exponential(prior[7]);        // shared all years
# year_spread   ~ double_exponential(prior[8]);
# river_timing  ~ double_exponential(prior[9]);        // shared all rivers
# river_spread  ~ double_exponential(prior[10]);
prior[7:10] <- c(year_timing_sd, year_spread_sd,
                 river_timing_sd, river_spread_sd); 
#  sigma         ~ gamma(prior[11], prior[12]);  // wide
prior[11:12] <- c(10,1); 
#  river_prop    ~ dirichlet(river_prop_prior); // separate data vector 
river_prop_prior <- rep(1.0, nriver);
# see preceding MC fit for prior of parameters' prob. distributions
names(prior) <- c('timing_m', 'timing_sd', 'spread_m', 'spread_sd', 'run_m', 'run_sd', 'year_timing_exp', 'year_spread_exp', 'river_timing_exp', 'river_timing_exp', 'sigma_gamma_shape', 'sigma_gamma_scale');
print("prior"); print(prior);
print("river_prop_prior"); print(river_prop_prior); 
# sigma (sd of normal residuals) has prior gamma(shape=5, scale=1)
k=exp(seq(-4,3,.25));  # interested in behaviour near zero
plot( k,dgamma(k,shape=5, rate=1), type="l",ylim=c(0,0.25), 
     xlab='',ylab='Probability Density', main='shape = 5,10; rate=1' );
lines(k,dgamma(k,shape=10,  rate=1), col="red");

dat <- list(np=12, prior=prior, nyear=nyear, nriver=nriver, ndata=ndata, x=x, y=yobs, year_id=year_id, river_id=river_id, river_prop_prior=river_prop_prior );
print("This is dat:");
print(dat);

0.5s

R_rstan (R)

# test turns
library (rstan)
s1 <- sampling( stan_MultiGauss, data=dat, chains=4, iter=4000, cores=4, verbose = F);

574.9s

R_rstan (R)

print (s1);
plot(s1, pars=c('year_timing','year_spread','river_timing','river_spread'));

1.2s

R_rstan (R)

References

Vancouver format for PlosOne: Hou WR, Hou YL, Wu GF, Song Y, Su XL, Sun B, et al. cDNA, genomic sequence cloning and overexpression of ribosomal protein gene L9 (rpL9) of the giant panda (Ailuropoda melanoleuca). Genet Mol Res. 2011;10: 1576-1588. [delete this one]

Akenhead, SA, Irvine JR, Hyatt KD, Johnson SC, Grant SCH. Stock-recruit analyses of Fraser River sockeye salmon. N. Pac. Anadr. Fish Comm. Bull. 2016;6: 363–390. doi: 10.23849/npafcb6/363.390.

Betancourt MA. Conceptual Introduction to Hamiltonian Monte Carlo. arXiv:1701.02434.[Preprint]. 2018.

Cunningham JP. Gaussian Processes for machine learning. Invited lecture at the Machine Learning Summer School 2012, La Palma, Spain, Apr. 18, 2012. Video: https://www.youtube.com/watch?v=BS4Wd5rwNwE orcid.org/0000-0002-1805-9743

Gelman A, Hill J. Data Analysis Using Regression and Multilevel- Hierarchical Models. Cambridge, UK: Cambridge University Press; 2007.

Gelman A, Carlin JB, Stern HS , Dunson DB, Vehtari A, Rubin DB. Bayesian Data Analysis. 3d edition, London: Chapman & Hall/CRC Press; 2013.

Carpenter R, Gelman A, Hoffman M, Lee D, Goodrich B, Betancourt M, Brubaker M, Jiqiang G, Li P, Riddell A. Stan: A Probabilistic Programming Language. Journal of Statistical Software 2017; 76 (1): 1–32. doi: 10.18637/jss.v076.i01. ISSN 1548-7660.

Vehtari A, Gelman A, Gabry J. Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing 2017; 27(5):1413–1432. doi:10.1007/s11222-016-9696-4. arXiv:1507.04544 [preprint].