6. Building a data object in R

Rceattle normally reads input data from an Excel workbook via read_data(). The same data object can be built entirely in R as a named list — useful when:

Simulating data for testing or MSE operating models
Generating inputs programmatically (e.g., multi-scenario loops or sensitivity analyses)
Integrating Rceattle into pipelines that don’t use Excel

Everything passed to fit_mod() lives in this list, so understanding its structure also helps you modify existing data objects read from Excel. This vignette walks through each component. The result can be validated with data_check() and fitted with fit_mod(), or exported back to Excel with write_data().

Required vs. optional fields

The list are optional and only needed when a particular feature is turned on. Rceattle fills missing optional fields with safe defaults and stops model runs under the conditions that actually require them:

Field	Required when
`comp_data`	(always optional)
`caal_data`	`any(growth_model > 0)` (estimating growth)
`emp_sel`	any fleet has `Selectivity = "Fixed"`
`NByageFixed`	`any(estDynamics > 0)`
`env_data`	env linkages
`diet_data`	`msmMode > 0`
`ration_data`	`msmMode > 0`
Bioenergetics scalars (`Ceq`, `CA`, `CB`, `Tco`, …)	`msmMode > 0`

In default single-species mode (msmMode = 0) you can omit any of these by simply not setting the field on the list. The single-species example below shows the full machinery, but feel free to skip sections you don’t need.

Setup

library(Rceattle)

# Define core dimensions up front — these drive the size of all downstream arrays.
nspp     <- 2   # number of species (or stocks)
nages    <- 10  # maximum age classes (same for all species here)
nlengths <- 20  # number of length bins (for age-length keys / CAAL data)
nyrs     <- 30  # number of hindcast years
years    <- 1:nyrs

simData <- list()

1. Species / model controls

These scalars and vectors describe the biological structure of each species and control which population model is used.

simData$nspp    <- nspp
simData$styr    <- 1       # first year of hindcast
simData$endyr   <- nyrs    # last year of hindcast
simData$projyr  <- nyrs + 10  # last year of the projection period

# Character labels shown in plots and output tables
simData$spnames <- paste0("Species", 1:nspp)

# nsex: 1 = combined-sex model (male and female treated as one group),
#        2 = sex-structured model (separate selectivity and mortality by sex)
simData$nsex    <- rep(1, nspp)

# Calendar month of spawning — used to compute the fraction-of-year weight
# applied to spawning stock biomass (e.g., 6 = mid-year spawning)
simData$spawn_month <- rep(6, nspp)

simData$nages   <- rep(nages, nspp)   # number of age classes per species
simData$minage  <- rep(1, nspp)       # minimum age (usually 1)
simData$nlengths <- rep(nlengths, nspp)

# estDynamics controls how population abundance is estimated:
#   0 = estimate population dynamics (standard statistical catch-at-age)
#   1 = fix N-at-age exactly to values in NByageFixed (e.g., external estimates)
#   2 = scale NByageFixed by a single estimated scalar per species
simData$estDynamics <- rep(0, nspp)

# Indices linking each species to its weight and growth datasets.
# Multiple weight schedules can exist (e.g., fleet-specific); these point to
# which one is used for total biomass (pop) and spawning biomass (ssb).
simData$pop_wt_index <- 1:nspp
simData$ssb_wt_index <- 1:nspp
simData$pop_age_transition_index <- rep(1, nspp)

# Length-weight relationship coefficients: W = alpha * L ^ beta.
# Used internally to convert length-based selectivity to weight when estimating growth.
simData$alpha_wt_len <- rep(0.00001, nspp)
simData$beta_wt_len  <- rep(3, nspp)

# Prior or initial standard deviation for log-recruitment deviations. Controls the
# strength of the recruitment penalty in penalized-likelihood mode.
simData$sigma_rec_prior <- rep(1, nspp)

# Non-modelled ("other") prey biomass available to predators. Used in
# multi-species mode to prevent unrealistically high predation mortality when
# modelled prey are scarce. Setting this too low forces predators to over-eat
# modelled prey; too high dilutes the predation signal.
simData$other_food <- rep(1e5, nspp)

2. Fleet control table

This is the most important configuration table. Each row defines one fleet (fishery or survey) and controls how it is modelled. Multiple fleets can target the same species.

# Fleet_type:
#   "Fishery" -- contributes catch and catch-at-age/length composition data
#   "Survey"  -- contributes an abundance index and composition data
#   "Off"     -- fleet is present in the data but excluded from the likelihood

# Selectivity options (Selectivity column) — use the string name or integer code:
#   "Fixed"              (0) -- fixed at values in emp_sel; not estimated
#   "Logistic"           (1) -- two-parameter ascending logistic (a50, slope)
#   "NonParametric"      (2) -- freely estimated selectivity-at-age/length with smoothness penalty
#   "DoubleLogistic"     (3) -- ascending + descending dome-shaped logistic
#   "DescendingLogistic" (4) -- descending logistic only (dome-shaped for oldest ages)
#   "Hake"               (5) -- age-specific logistic with spawning migration pattern
#   "2DAR1"              (6) -- 2D autoregressive random effects (age × year)
#   "3DAR1"              (7) -- 3D autoregressive random effects (age × year × fleet)
#
# Selectivity_dimension: "Age" or "Length"
#   Length-based selectivity requires an age-to-length transition matrix (see below).

# Catchability options (Catchability column) — use the string name or integer code:
#   "Fixed"                (0) -- Q fixed at Q_prior; not estimated
#   "Estimated"            (1) -- freely estimated on log scale
#   "Estimated-with-prior" (2) -- estimated with a lognormal prior (Q_prior, Q_sd_prior)
#   "Analytical"           (3) -- Ludwig & Walters / Martell (1994) analytical Q
#   "PowerEquation"        (4) -- power-equation Q (habitat area scaling)
#   "Environmental"        (5) -- mu_q + X * beta, where X is a covariate in env_data
#   "AR1"                  (6) -- auto-regressive catchability (Rogers et al. 2024)
#   NA                         -- not applicable (use for fisheries without CPUE)

# Comp_loglike / CAAL_loglike options:
#   "Multinomial"           -- standard multinomial
#   "DirichletMultinomial"  -- overdispersed; estimates a concentration parameter
#   "MultinomialAFSC"       -- AFSC variant of multinomial (marginal age + length)

simData$fleet_control <- data.frame(
  Fleet_name  = paste0(c("Survey", "Fishery"),
                       rep(paste(" Species", 1:nspp), each = 2)),
  Fleet_code  = 1:(nspp * 2),       # unique integer ID per fleet
  Fleet_type  = rep(c("Survey", "Fishery"), nspp),
  Species     = rep(1:nspp, each = 2),
  Month       = 0,                   # 0 = mid-year; calendar month otherwise

  # Selectivity_index links fleets that share the same selectivity curve.
  # Fleets with the same index share all selectivity parameters.
  Selectivity_index         = 1:(nspp * 2),
  Selectivity               = "Logistic",
  Selectivity_dimension     = "Age",
  N_sel_bins                = NA,    # used for NonParametric only
  Sel_curve_pen1            = NA,    # smoothness penalty (NonParametric)
  Sel_curve_pen2            = NA,
  # Time_varying_sel:
  #   0 / "Off" = time-invariant (default)
  #   1 / "IID" = penalized log-normal deviates (SD fixed at Time_varying_sel_sd_prior
  #       unless random_sel = TRUE in fit_mod)
  #   2 / "AR1" = AR(1) (not yet implemented)
  #   3 / "Block" = time blocks (specify via R-side mapping)
  #   4 / "RandomWalk" = random walk
  #   5 / "RandomWalkAscending" = random walk on ascending portion of double logistic
  Time_varying_sel          = 0,
  Time_varying_sel_sd_prior = 1,
  Bin_first_selected        = 1,     # youngest fully-selected age or length bin
  Sel_norm_bin1             = NA,    # bin to normalise selectivity from (optional)
  Sel_norm_bin2             = NA,

  # Composition likelihoods
  Comp_loglike = "Multinomial",
  Comp_weights = 1,                  # input effective sample size multiplier
  CAAL_loglike = "Multinomial",
  CAAL_weights = 1,

  # Index units and weight/growth linkage
  Weight1_Numbers2     = 1,          # 1 = weight (mt), 2 = numbers
  Weight_index         = rep(1:nspp, each = 2),
  Age_transition_index = 1,

  # Q_index links fleets that share the same catchability parameter.
  Q_index              = 1:(nspp * 2),
  Catchability         = rep(c("Estimated", NA), nspp),
  Q_prior              = rep(c(1,   NA), nspp),
  Q_sd_prior           = rep(c(0.2, NA), nspp),
  # Time_varying_q:
  #   0 / "Off" = time-invariant
  #   1 / "IID" = penalized log-normal deviates (SD fixed at Time_varying_q_sd_prior
  #       unless random_q = TRUE in fit_mod)
  #   2 / "AR1" = AR(1) (not yet implemented)
  #   3 / "Block" = time blocks
  #   4 / "RandomWalk" = random walk
  Time_varying_q          = rep(c(0, NA), nspp),
  Time_varying_q_sd_prior = rep(c(1, NA), nspp),

  # Estimate_index_sd:
  #   0 = fix at Log_sd from index_data
  #   1 = estimate a single SD
  #   2 = analytical sigma (Ludwig & Walters 1994)
  Estimate_index_sd  = rep(c(0,  NA), nspp),
  Index_sd_prior     = rep(c(1,  NA), nspp),

  # Estimate_catch_sd:
  #   0 = fix at Log_sd from catch_data (treats catch as known)
  #   1 = estimate
  Estimate_catch_sd  = rep(c(NA, 0),  nspp),
  Catch_sd_prior     = rep(c(NA, 1),  nspp),

  # proj_F_prop: how future F is distributed across fishery fleets per species.
  # Must sum to 1 across fishery fleets for each species. NA for surveys.
  proj_F_prop = rep(c(NA, 1), nspp)
)

3. Survey index data

One row per fleet-year combination for survey fleets. Observation is the log-scale biomass index (or numbers if Weight1_Numbers2 = 2). A negative Year value tells Rceattle to predict the index without including it in the likelihood, which is useful for model checking.

simData$index_data <- data.frame(
  Fleet_name        = rep(paste("Survey Species", 1:nspp), each = nyrs),
  Fleet_code        = rep(seq(1, nspp * 2, by = 2), each = nyrs),
  Species           = rep(1:nspp, each = nyrs),
  Year              = rep(years, nspp),
  Month             = 0,
  # Selectivity_block: integer index into the selectivity/Q parameter block.
  # Used when Time_varying_sel or Time_varying_q = 3 (time blocks).
  Selectivity_block = 1,
  Observation       = rnorm(nyrs * nspp),   # replace with real log-index values
  Log_sd            = 0.1                   # observation SE on the log scale
)

4. Catch data

One row per fishery fleet-year combination. Log_sd is typically set small (0.01–0.05) when catch is treated as known; increase it if catch is highly uncertain.

simData$catch_data <- data.frame(
  Fleet_name        = rep(paste("Fishery Species", 1:nspp), each = nyrs),
  Fleet_code        = rep(seq(2, nspp * 2, by = 2), each = nyrs),
  Species           = rep(1:nspp, each = nyrs),
  Year              = rep(years, nspp),
  Month             = 0,
  Selectivity_block = 1,
  Catch             = abs(rnorm(nyrs * nspp)),  # replace with real catch (mt)
  Log_sd            = 0.05
)

5. Composition data (optional)

Age or length composition data. Age0_Length1 = 0 for age compositions, 1 for length compositions. Sex = 0 for combined sex, 1 for females, 2 for males, 3 for joint female+male. Composition columns (Comp_1, Comp_2, …) can be raw counts or proportions — Rceattle normalises internally.

Skip this section entirely (don’t set simData$comp_data) if no composition data are available — clean_data() will fill it with an empty data.frame.

comp_matrix <- matrix(abs(rnorm(nyrs * nspp * nages * 2)),
                      nrow = nyrs * nspp * 2, ncol = nages)
colnames(comp_matrix) <- paste0("Comp_", 1:nages)

simData$comp_data <- cbind(
  data.frame(
    Fleet_name   = c(rep(paste("Survey Species",  1:nspp), each = nyrs),
                     rep(paste("Fishery Species", 1:nspp), each = nyrs)),
    Fleet_code   = rep(c(seq(1, nspp * 2, 2), seq(2, nspp * 2, 2)), each = nyrs),
    Species      = c(rep(1:nspp, each = nyrs), rep(1:nspp, each = nyrs)),
    Sex          = 0,
    Age0_Length1 = 0,
    Year         = rep(years, 2 * nspp),
    Month        = 0,
    Sample_size  = 200
  ),
  comp_matrix
)

6. Conditional age-at-length (CAAL) data (optional)

CAAL data provide observed age frequencies within each length bin and allow the model to jointly fit length and age compositions, and optionally estimate the growth curve internally (see vignette("model-parameterizations") and ?build_growth). One row per fleet-sex-year-length-bin combination.

caal_data becomes required when growth is being estimated internally (any(growth_model > 0)); otherwise it can be omitted.

caal_matrix <- matrix(abs(rnorm(nyrs * nspp * nages * 2 * nlengths)),
                      nrow = nyrs * nspp * 2 * nlengths, ncol = nages)
colnames(caal_matrix) <- paste0("CAAL_", 1:nages)

simData$caal_data <- cbind(
  data.frame(
    Fleet_name  = c(rep(paste("Survey Species",  1:nspp), each = nyrs * nlengths),
                    rep(paste("Fishery Species", 1:nspp), each = nyrs * nlengths)),
    Fleet_code  = rep(c(seq(1, nspp * 2, 2), seq(2, nspp * 2, 2)),
                      each = nyrs * nlengths),
    Species     = c(rep(1:nspp, each = nyrs * nlengths),
                    rep(1:nspp, each = nyrs * nlengths)),
    Sex         = 0,
    Year        = rep(years, 2 * nspp * nlengths),
    Length      = rep(1:nlengths, 2 * nspp * nyrs),
    Sample_size = 200
  ),
  caal_matrix
)

7. Biological inputs

Age-to-length transition matrix

The probability that a fish of age a is observed in length bin l. Required when fitting length compositions or estimating growth internally. Each row must sum to 1. The placeholder below uses an identity matrix (age a maps exactly to length bin a); replace with empirical or model-based growth matrices.

age_trans_list <- lapply(1:nspp, function(sp) {
  tmp <- matrix(0, nrow = nages, ncol = nlengths)
  diag(tmp[1:min(nages, nlengths), 1:min(nages, nlengths)]) <- 1
  colnames(tmp) <- paste0("Length_", 1:nlengths)
  cbind(data.frame(Age_transition_name  = paste("Species", sp),
                   Age_transition_index = sp,
                   Species              = sp,
                   Sex                  = 0,
                   Age                  = 1:nages), tmp)
})
simData$age_trans_matrix <- do.call(rbind, age_trans_list)

Ageing error matrix

The probability that a fish of true age t is read as observed age o. An identity matrix (no error) is a conservative starting point. Ageing error matrices can be generated with standard tools (e.g., the AgeingError package) and substituted here.

age_error_list <- lapply(1:nspp, function(sp) {
  tmp <- as.data.frame(diag(1, nages))
  colnames(tmp) <- paste0("Obs_age", 1:nages)
  cbind(data.frame(Species = sp, True_age = 1:nages), tmp)
})
simData$age_error <- do.call(rbind, age_error_list)

Weight-at-age

Mean weight (metric tonnes) at each age. Setting Year = 0 applies the same schedule to all years (time-invariant). Providing multiple rows with different years enables time-varying weight-at-age, which is used in empirical weight-at-age models.

weight_matrix <- matrix((1:nages / nages)^3 * 0.01,
                        nrow = nspp, ncol = nages, byrow = TRUE)
colnames(weight_matrix) <- paste0("Age", 1:nages)

simData$weight <- cbind(
  data.frame(Wt_name  = paste("Species", 1:nspp),
             Wt_index = 1:nspp,
             Species  = 1:nspp,
             Sex      = 0,
             Year     = 0),   # Year = 0 → time-invariant
  weight_matrix
)

Maturity-at-age

Proportion mature at each age, used to compute spawning stock biomass. Values should range from 0 to 1. For sex-structured models (nsex = 2) this represents the proportion of females that are mature.

mat_matrix <- matrix(1, nrow = nspp, ncol = nages)
colnames(mat_matrix) <- paste0("Age", 1:nages)
simData$maturity <- cbind(data.frame(Species = 1:nspp), mat_matrix)

Sex ratio at age

Proportion female at each age. Used in combined-sex models (nsex = 1) to weight spawning biomass contributions. In two-sex models (nsex = 2) this is the proportion female at recruitment.

sex_matrix <- matrix(0.5, nrow = nspp, ncol = nages)
colnames(sex_matrix) <- paste0("Age", 1:nages)
simData$sex_ratio <- cbind(data.frame(Species = 1:nspp), sex_matrix)

Fixed selectivity (optional)

Pre-specified selectivity-at-age values for fleets whose selectivity is fixed (not estimated). Required when any fleet has Selectivity = "Fixed". Skip if all fleets estimate their own selectivity (as in this example).

When you do need it, the table looks like:

# Example for one fleet whose selectivity is fixed at the supplied row of values
emp_sel <- data.frame(
  Fleet_name = "Survey Species 1",
  Fleet_code = 1,
  Species    = 1,
  Sex        = 0,
  Year       = 0,                       # 0 = applied to every year
  t(matrix(c(0.1, 0.5, 1, rep(1, nages - 3)), ncol = 1,
           dimnames = list(paste0("Comp_", 1:nages), NULL)))
)
simData$emp_sel <- NULL

Fixed numbers-at-age (optional)

Used when estDynamics > 0:

1 = numbers-at-age fixed exactly to these values
2 = numbers-at-age scaled by a single estimated scalar per species
3 = numbers-at-age scaled by age-specific estimated scalars per species

Required when any(estDynamics > 0). Skip if you’re estimating population dynamics from scratch (the default estDynamics = 0).

When you do need it, the table has columns Species_name, Species, Sex, Year, then Age1 ... Age<nages> for the fixed numbers-at-age.

Natural mortality (M1 base)

Residual natural mortality-at-age before any predation component. In single-species mode, total mortality M = M1. In multi-species mode, M = M1 + M2 where M2 is predation mortality estimated from the bioenergetics model. When switching to multi-species mode it is important to reduce M1 accordingly (see vignette("single-vs-multispecies")).

m_matrix <- matrix(0.2, nrow = nspp, ncol = nages)
colnames(m_matrix) <- paste0("Age", 1:nages)
simData$M1_base <- cbind(data.frame(Species = 1:nspp, Sex = 0), m_matrix)

M1 can also be estimated rather than fixed. See ?build_M1 for options including estimating a single scalar, an age-specific vector, or a temperature-dependent function.

Weight-at-age

Used for biomass calculations. Linked to species or fleets. Set Year = 0 to apply the same weight-at-age to all years (time-invariant). Age_1 … Age_nages columns hold mean weight (usually kg) at each age.

weight_matrix <- matrix(
  (1:nages / nages)^3 * 0.01,   # Simple power curve placeholder
  nrow = nspp, ncol = nages, byrow = TRUE
)
colnames(weight_matrix) <- paste0("Age", 1:nages)

simData$weight <- cbind(
  data.frame(
    Wt_name  = paste("Species", 1:nspp),
    Wt_index = 1:nspp,
    Species  = 1:nspp,
    Sex      = 0,
    Year     = 0    # Year = 0 applies to all years (time-invariant)
  ),
  weight_matrix
)

Environmental covariate data (optional)

An optional time series used in environment-recruitment, -M1, -catchability, or -growth relationships. Missing years are filled with the column mean. Column names after Year can be anything and are referenced by column order.

Skip this if no environmental covariates are used. clean_data() will fall back to a Year-only data.frame with no indices, and data_check() will only complain when a feature actually needs an index (env-q catchability, Ceq[sp] > 1, or any env linkage).

simData$env_data <- data.frame(
  Year    = 1:nyrs,
  EnvData = 1       # constant = no environmental effect (placeholder)
)

8. Multi-species bioenergetics parameters (optional in single-species mode)

These parameters govern the Wisconsin bioenergetics model used to compute predation mortality (M2) in multi-species mode. They are unused in single-species mode and can be omitted entirely — switch_check() fills any missing scalars with safe sentinels (Ceq = 1, Pvalue = 1, fday = 365, others 0) so TMB’s length-nspp DATA_VECTOR requirements are satisfied.

In multi-species mode (msmMode > 0) all of Ceq, Cindex, Pvalue, fday, CA, CB, Qc, Tco, Tcm, Tcl, CK1, CK4 must be supplied as length-nspp vectors; data_check() will list any that are missing or wrong-length. Likewise diet_data and ration_data are required.

See ?build_M1 and the CEATTLE technical documentation for full details on each parameter.

# Ceq: consumption equation form following Hanson et al. (1997)
#   1 = Kitchell et al. (1977)
#   2 = Stewart et al. (1983) -- most common for gadids
#   3 = Kitchell et al. (1977) with temperature dependence on both limbs
#   4 = Thornton and Lessman (1978)
simData$Ceq    <- rep(4,  nspp)

simData$Cindex <- rep(1,  nspp)   # temperature index type
simData$Pvalue <- rep(1,  nspp)   # proportion of maximum consumption realised
simData$fday   <- rep(1,  nspp)   # foraging days per year

# Wisconsin model coefficients (CA, CB) and temperature dependence parameters
simData$CA  <- rep(1,  nspp)
simData$CB  <- rep(-1, nspp)
simData$Qc  <- rep(1,  nspp)
simData$Tco <- rep(1,  nspp)
simData$Tcm <- rep(1,  nspp)
simData$Tcl <- rep(1,  nspp)
simData$CK1 <- rep(1,  nspp)
simData$CK4 <- rep(1,  nspp)

# Diet composition likelihood for each predator species
# 0 = Multinomial, 1 = Dirichlet-multinomial
simData$Diet_loglike      <- rep(0, nspp)
simData$Diet_comp_weights <- rep(1, nspp)

# Ration data (optional in single-species mode): observed mean ration-at-age from
# bioenergetics studies. Required when msmMode > 0. Skip in single-species mode.
#
# When provided, columns are: Species, Sex, Year, Age1, ..., Age<nages>.

# Diet composition data: observed stomach content proportions (predator-prey pairs).
# Required when msmMode > 0; skip in single-species mode.
#
# When provided, columns are: Pred, Prey, Pred_sex, Prey_sex, Pred_age,
# Prey_age, Year, Sample_size, Stomach_proportion_by_weight.

9. Validate and fit

plot_data() prints a summary of the data list.

plot_data(simData)

# Key fit_mod() arguments:
#
# estimateMode: 0 = full fit (hindcast + projection)
#               1 = hindcast only
#               2 = projection only (fix parameters, run HCR)
#               3 = evaluate only (MakeADFun, no optimisation) -- useful for debugging
#
# msmMode:      0 = single-species, 1 = Type II MSVPA, 2 = Type III MSVPA
#
# initMode:     0 = freely estimated initial N-at-age
#               1 = equilibrium, F_init = 0, no init deviations
#               2 = equilibrium, F_init = 0, with init deviations (default)
#               3 = non-equilibrium: F_init estimated, init devs included
#               4 = non-equilibrium: F_init scales R0
#
# phase:        TRUE = use parameter phasing for improved convergence (recommended)

sim_run <- fit_mod(
  data_list    = simData,
  estimateMode = 3,    # change to 0 for a real fit
  msmMode      = 0,
  initMode     = 2,
  random_rec   = FALSE,
  fit_control = fit_control(
    phase        = TRUE,
    verbose      = 1))

10. Export back to Excel (optional)

write_data() exports the data list to an Excel workbook in the standard Rceattle template format, which can then be edited and re-read with read_data(). This makes it easy to share data objects or move between R-based and spreadsheet-based workflows.

write_data(simData, file = tempfile(fileext = ".xlsx"))

2026-05-11

Required vs. optional fields

Setup

1. Species / model controls

2. Fleet control table

3. Survey index data

4. Catch data

5. Composition data (optional)

6. Conditional age-at-length (CAAL) data (optional)

7. Biological inputs

Age-to-length transition matrix

Ageing error matrix

Weight-at-age

Maturity-at-age

Sex ratio at age

Fixed selectivity (optional)

Fixed numbers-at-age (optional)

Natural mortality (M1 base)

Weight-at-age

Environmental covariate data (optional)

8. Multi-species bioenergetics parameters (optional in single-species mode)

9. Validate and fit

10. Export back to Excel (optional)