AmP estimation procedure

From DEBwiki
Jump to: navigation, search
AmP estimation
Concepts

Data and completeness
Parameter estimation
Goodness-of-fit: SMSE / MRE
AmP Literature

Practice - essentials

Starting an estimation for a new species
Setting initial parameter values
Setting weight coefficients
Computing implied properties (to create)
Submitting to the collection (to create)

Practice - extra modules
Code specification
User-defined files: run, mydata, pars_init, predict
Data: Zero-variate, Univariate, Pseudo-data

Typified models
Estimation options

The purpose of this Wiki-style AmP (Add-my-pet) estimation procedure manual is to explain:

  • The concepts (this page): description of the methodology used in the AmP project and for parameter estimation;
  • Practice and getting started (see table on the right for quick-access links): technical aspects (code specifications), and how to use the scripts and arrive at parameter estimates;

AmP is a project based on Dynamic Energy Budget theory for metabolic organization. For a short intro to DEB theory visit the DEB Wikipedia page, and for more about DEB visit our main DEBwiki portal. An introduction to modelling and statistics is given in the document Basic methods for Theoretical Biology.

Notation

This manual and the DEBtool software (download from GitHub) follow the DEB notation.

Data and data types

The data consists of:

  • a set of zero-variate data (i.e. a set of numbers)
  • and, possibly, one or more sets of uni-variate data (each consisting of list of values for the independent and the associated dependent variable).

Data sources are referenced in the mydata file.

Zero-variate data has real and pseudo data points. Real data relate to actual observations on the species of interest at specified temperatures and food conditions. Pseudo-data relate to the generalised animal at the reference temperature. Increasing the number of types of real data (so information) decreases the role of the pseudo-data in the parameter estimation. The impact of pseudo-data on the resulting parameter estimates is controlled by the weight coefficients.

The real data should at least contain the maximum adult weight. However, it is preferable to also include weight and age at birth and puberty as well as the maximum reproduction rate. Notice that times and rates without temperature are meaningless. This combination already fixes the growth curve in a crude way, specifies kap and the maturity thresholds at birth and puberty. The weight can be dry, ash-free dry or wet weights, but the type of weight relates to the specific densities and chemical indices. Assuming that the specific density of wet mass is close to 1 g/cm3, check the values for d_V and d_E that refer to dry weight.

Pseudo-data are parameter values corresponding to a generalized animal, i.e. typical values for a wide variety of animals (Lika et al 2011). These values may change as the AmP collection increases. Pseudo-data serve to fill possible gaps in information that is contained in the real data. Only intensive parameters can play the role of pseudo-data points. Species-specific parameters should not be included in the pseudo-data, especially the zoom factor, the shape coefficient and the maturity levels at birth and puberty. Since the value for specific cost for structure (E_G) is sensitive for the water content of tissue, which differs between jelly fish and vertebrates, it is replaced by the growth efficiency kap_G. Pseudo-data, if used properly, can play several roles. It serves the task of increasing the identifiability of parameters and, thus, preventing the ambiguous determination of parameter values. Pseudo-data can also be used to define the area of the parameter space where the parameter values are reasonable.

Generally the use of statistics derived from observations, such as the von Bertalanffy growth rate or the half saturation coefficient, as data from which DEB parameters are estimated, is discouraged. It is far better to base the parameter estimation directly on the measurements, avoiding manipulation or interpretation. For instance, if wet weights were measured, use wet weights as data and do not convert them first to dry weights (or vice versa).

Data quality and availability

The quality and availability of data varies enormously over species, which has consequences for the entries. For comparative purposes, it helps to judge the completeness of the data using a marking system from 0 (low) to 10 (high) (See Table), published in the Lika et al. 2011 paper.

Data from field conditions suffer from the problem that temperature and feeding profiles are generally unknown. To a lesser extent, this also applies to laboratory conditions. Only a few species can be cultured successfully and detailed (chemical) knowledge about nutritional requirements hardly exists for any species. The idea that `some prediction is better than no prediction' fueled the collection (e.g. for management purposes), but where data are guessed is clearly indicated in the mydata-files. The hope is that such weak entries will improve over time by supplementing data and re-estimate parameters. Predictions might help to prioritize further research.

Another motivation to include weak entries is that predictions for situations that have not yet been studied empirically can be used to test the theory rigorously. It is encouraging to see how few data already allows for an estimation of parameters. That results are not fully random is supported by the observation that similar species (in terms of body size, habitat and taxonomy) have similar parameter values, despite lack of advanced data. See, for instance, the different species of tardigrades. The reliability of the resulting estimates and predictions should always be evaluated in the context of the data on which they are based. Generally, the more types of data, the more reliable are the results.

Where many different data sources are used, however, conditions can vary to the extent that variations cannot be ignored. In some mydata-files this is taken into account by assigning different feeding conditions to different data sets. Notice that the scaled functional response only takes differences in food density into account, not differences in food quality. If food qualities differ, the scaled function response is no longer less or equal to 1, but might be larger. If feeding densities and qualities are not specified with the data, this "repair" is far from ideal, however.

The variation not only concerns environmental conditions, but also differences in parameter values among individuals that have been used. Parameter values tend to vary across the geographical range of a species, a problem that applies to many fish entries. Although parameter values are better fixed with a growing number of data types, the inherent variability works in the opposite direction. This is why marks have been given for both completeness of data and goodness of fit.

Although DEB theory concerns all organisms, the collection is only about animals, for the reason that they can live off a single (chemically complex) resource and thus can be modeled with a single reserve and resource availability is relatively simple to characterize. Within the animals, we made an effort to maximize coverage, given limitations imposed by data availability.

Typified models

Different models of DEB theory have been applied to different organisms. Some of the most used models have been formalized and are called typified models. There is a set of instructions to go from model std to abj.

Species specific details which are not included in the computation of implied properties:

  • Acanthocephalans live in the micro-aerobic environment of the gut of their host. They don't use dioxygen, but ferment. It is possible to model this (see Section 4.9.1, Kooijman 2010), but this is not yet implemented in the code behind the calculation of the statistics. These particular respiration predictions should, therefore, be ignored.
  • Cephalopods are typically semelparous (death at first spawning) and die well before approaching ultimate body size. For practical purposes, this early death is included as an effect of ageing, but ageing has probably nothing to do with this. The asymptotic size is calculated in the pars-file and some of the listed properties are not realistic as a consequence.
  • The toadlets Crinia lower their allocation fraction to soma between hatch and birth (Mueller et al 2012).
  • Mammals take milk during their baby-stage, weaning is included in all stx models for mammals as a maturity threshold, but the change in diet is not taken into account.
  • Many birds first reproduce in their second year under (seasonal) field conditions. They apparently have a relatively long juvenile period during most of which they are fully grown. This trait leads to high values for maturity maintenance at puberty and low values for maturity maintenance. Husbandry data indicates that birds potentially reproduce much earlier, which questions the realism of these two parameters.


Parameter estimation

Methodology of parameter estimation

We here discuss the estimation of all DEB parameters in context: the AmP method; for details see Marques et al, 2018a and 2018b.Van der Meer 2006 and Kooijman et al, 2008 show which particular compound parameters can be estimated from a few simple observations and how an increasing number of parameters can be estimated if more quantities are observed at several food densities. A natural sequence exists in which parameters can be known in principle. The methodology evolved from the covariation method (Lika et al 2011).

Estimating parameter values from a set of data sets is done in the AmP collection on the basis of the minimization of a parameter-free loss function, see Marques et al 2018a and 2018b, which takes the different dimensions of the various data sets into account, and penalizes over-estimation as hard as under-estimation, using all data sets simultaneously. The minimum is found using a Nelder-Mead simplex method. A simplex is a set of parameter-sets with a number of elements that is one more than the number of free parameters. One of the elements in the set is the specified initial parameter set, the seed, the others are generated automatically in its "neighbourhood". The simplex method tries to replace the worst parameter set by one that is better than the best one, i.e. gives a smaller value of the loss-function. During the procedure the parameter are (optionally, but by default) filtered to avoid that combinations of values are outside their logical domain (Lika et al 2014).

The procedure starts from a set of initial values. Provided that a global minimum has been found, the result does not depend on the initial value.

Obtaining parameter estimates

Estimation of some 15 parameters simultaneously from a variety of data cannot be routine work. You can only expect useful results if your initial estimates are not too far from the resulting estimates. It is best to either use a time-length-energy framework (as done here) or a time-length-mass framework in the selection of primary parameters and not mix them. Both frameworks can be used to predict energies and masses, using conversion factors.

To obtain the estimates, you have to prepare a script-file run_my_pet and three function files mydata_my_pet,pars_init_my_pet and predict_my_pet.

You can follow the instructions to start an Add-my-pet estimation for a single species. The DEBtool also enables you to estimate parameters for two or more species simultaneously. This can be interesting in the case that different species share particular parameter values, and/or parameter values have particular assumed relationships. The general idea is that the total number of parameters to be estimated for the group is (considerably) smaller than the sum of the parameters to be estimated for each species.

Weight coefficients

The weight coefficients serve to (subjectively) quantify the confidence of the user in the data-sets as well as for specific data points. The AmP procedure distinguished between real and pseudo data. The weight coefficients are automatically set to Weight coeff.png where i designates the data set and j the point on data set i, where ni designates the number of points in data set i. The motivation is to ensure that each data set contributes equally to the loss function (instead of each data point contributing equally). The default weight coefficients for pseudo-data are handled differently).

The user can overwrite default weight values (for either the whole data set or else particular values. This is done in the mydata file. The overwriting of the weight coefficient is done by multiplying the default value by a dimensionless factor. See Setting weight coefficients.

Estimation options

The AmP estimation procedure includes several loss functions. The user defines which loss function to use in estimation options - the default weight coefficients for pseudo-data depends on which loss function is being used. 'sb' stands for the symmetric bounded loss function and 'su' stands for the symmetric unbounded loss function. Please refer to the Estimation options page to check what are the default options.

Goodness of fit criterion

For comparative purposes, e.g. to find patterns in parameter values among species, it helps to judge the goodness of fit using the mean relative error (MRE) and the symmetric mean squared error (SMSE). MRE can have values from 0 to infinity, while SMSE has values from 0 to 1. In both cases, 0 means predictions match data exactly. MRE assesses the differences between data and predictions additively, judging equally an overestimation and underestimation of the same relative size (e.g, +20% or -20% will give the same contribution), while SMSE assesses the difference multiplicatively, judging overestimation and underestimation by the same factor equally (e.g. x2 or x/2 will give the same contribution). Notice that the result of the minimization of loss functions does not, generally, correspond with the minimum of MRE or SMSE (unless the fit is perfect).

Relative errors in a univariate data set are summarized to that of a single data-point by taking the MRE for all data-points. Only real data, not pseudo-data, are included in the assessment. If all weight coefficients of a data set are zero, it is not included in the computation of the MRE. The best situation is, of course, that of a small MRE. It is likely that the marks for completeness and goodness of fit will be negatively correlated.

References: Add-my-pet papers