Multi-species parameter estimation

From DEBwiki
Jump to: navigation, search
AmP estimation
Concepts

Data and completeness
Parameter estimation
Goodness-of-fit: SMSE / MRE
AmP Literature

Practice - essentials

Starting an estimation for a new species
Setting initial parameter values
Setting weight coefficients
Computing implied properties
Submitting to the collection

Practice - extra modules
Code specification
User-defined files: run, mydata, pars_init, predict
Data: Zero-variate, Univariate, Pseudo-data

Typified models
Estimation options

DEBtool (download from GitHub) enables you to estimate parameters for two or more species simultaneously. This can be interesting in the case that different species share particular parameter values, and/or parameter values have particular assumed relationships. The general idea is that the total number of parameters to be estimated for the group is (considerably) smaller than the sum of the parameters to be estimated for each species.

The specification of the group is done in the run-file (see below). A simple example of the required multi-species files is provided in DEBtool_M/lib/pet/example_group_estimation.zip.

Required files

The general setup of the multi-species estimation is the same as for single species. There must be 4 types of files:

mydata-files

Each species must have a mydata-file, identical to the single-species situation; the types of data are not restricted and independent for each species.

predict-files

Each species must have a predict-file (which computes predictions for all data, given parameter values), identical to the single-species situation. You can, therefore, copy any mydata-and-predict pairs from the AmP collection; no changes required, no restrictions apply.

pars_init_group-file

A single (combined) file to initiate all parameters for estimation, named pars_init_group. It specifies which parameters are the same, and which are different for each species. This choice is made by the specification of parameter values, which can be scalar (for parameters that are the same for all species), or vector-valued. The length of the vectors must be equal to the number of species (see the run-file below) and the values must also be in the corresponding sequence. The free-setting (zeros or ones) should correspond with the value setting, so also either scalar or vector-valued.

The general idea is to copy-modify an existing single-species pars_init-file and use it as a template with an eye on the provided example. The example shows that the call of addchem to add chemical parameters is slightly different from the single-species setting, since the method allows that the species in the group differ in classification (phylum and class, as specified in the mydata-files, see above). This might have consequences for the specific densities of biomass (i.e. water content) and the nitrogen-wastes. If the phyla and classes of all species in the group are the same, the default values for these two parameters are again scalars, like in the single-species case. Notice that the default-settings for the chemical parameters can be overwritten in pars_init_group, also in the multi-species situation (using scalars or vectors). If the species do differ in water content, it is best to choose the specific cost for structure, [E_G], vector-valued and free.

The method also allows that the model types for the different species are different. In that case, the model-specification in pars_init_group must be a string of cells, e.g. metaPar.model = {'std','abj'}; for two species. Again, the length of the cell string should equal the number of species. If all models are the same a simple character string will do, as for the single-species situation, e.g. metaPar.model = 'std';, but a cell string of length one also works. The specification of the predict-files (see above) should be consistent with that of the model type. If the predict-files were copied from the AmP collection, please observe the model type that has been used in the corresponding pars_init files. These model-types are also given in the species-list of the AmP website. The model types are of relevance if filters are used (which is the default); these filters avoid that the parameters walk outside their logical domain during the estimation procedure, which substantially contribute to the stability of the estimation process.

The co-variation rules are specified by metaPar.covRules = 'no'; or metaPar.covRules = 'maturities';.

  • 'no': no relationships between different parameters are assumed. This is the default option, which is selected in the case that an explicit option setting of covRules is missing.
  • 'maturities': maturity levels are assumed to be proportional to the cubed zoom factors, using pets{1} as reference (pets are defined in the run_file, see below). This only makes sense if the zoom factor in the pars_init file is specified as a vector. For example: maturity at birth for pets{2} is that for pets{1} times the ratio of the cubed zoom factors for pets{2} and pets{1}. The free-setting of the maturity levels of all species depends on that for pets{1}, irrespective of the actual settings. Also the maturity settings themselves for species pets{2} and higher are ignored with this estim_options setting. The code first executes the 'no'-rule, then any other rule using overwrite. In case of conflict, the other rules take, therefore, priority. Even if the maturity levels in the pars_init file are scalar-valued, they will (generally) not stay the same for all species with 'maturities' as co-variation rule.

User-defined co-variation options can be added in function parGrp2Pets, which maps the parameter setting for the group, like in pars_init_group, to that for each species, where all parameters are scalar; just define a new case, using that for 'maturities' as example. The co-variation rules, like 'maturities', are always on top of the rules for 'no', and take priority in the case of any conflict.

Parameters can be specific for a particular species, as used in their predict-files. If a single species uses this parameter, for instance, the specification in pars_init_group can be either scalar (to be preferred), or vector-valued. In that latter case, the free-setting for species that do not use this parameter must be zero, i.e. these parameters are fixed, else they make a random walk during estimation that hampers conversion to the global minimum of the loss-function and takes unnecessary computation time. To avoid misunderstandings in the interpretation of the results, the values of parameters for species that don't use them can best be set to NaN (Not-a-Number).

Notice that, with the simplex method, it is not that easy to see the difference between the results of a random walk and a proper estimation for any particular parameter; if a changed parameter value results in the same MRE, the result was that of a random walk.

augmented loss function

If you specify one or more metaPar.weights fields in the pars_init_group file for particular parameters, these weights are applied to the variance as fraction of the squared mean for these parameters and the result is added the loss function that is minimized to arrive at parameter estimates. This construct allows to smoothly vary between no relationship between the parameter values of the group members (weights zero, which is the default weight value) and all group members have the same parameter value (a high weight). An example of such a setting is metaPar.weights.p_M = 5; metaPar.weights.v = 5;.

run-file

A single run-file (the name is free), which sets the global variable "pets" as cell-string, defining the group of species, e.g. pets = {'Daphnia_magna','Daphnia_pulex'};. The species names should exactly match the species-names in the mydata-files (see above). The general idea is again to copy-modify an existing single-species run-file and use it as a template with an eye on the provided example.

If you start multi-species estimation for the first time, use estim_options('pars_init_method', 2); and estim_options('results_output', 1);. This has the effect that pars_init_group is used to specify parameter values, file results_group.mat (see below) is written, and the result is printed on screen. After this start you might select estim_options('pars_init_method', 1); for continuation, meaning that you restart from the previous result. See all estim_options.

Results

Results of the estimation are stored in the file results_group.mat, while mat2par_init converts the contents to pars_init_group.m, like in the single-species situation. These resulting files also specify the covRules setting, even if this has not been done explicitly by the user. If the model specification as specified in pars_init_group is a cell-string (e.g., {'std', 'abj'}) rather then a character string (e.g., 'std'), the sequence of the parameters in the file that results from mat2par_init might differ from the original file. This is due to the fact that the models might be different and each model has particular core parameters.

The names pars_init_group and results_group.mat are fixed, so you need to park different group-estimations in different directories if you are working on more than one multi-species estimation.

General remarks

If you found the global minimum, you can copy the result, stored in results_group.mat to the pars_init_group.m file using mat2pars_init, just like the single-species case. Option setting 3 for results_output shows an html-page with all results for all species, which facilitates comparison. Option 4 combines the result with related species in the collection, just like the single-species case. The default plotting of figures can be overwritten with a custom_results_group file; an example is given in DEBtool_M/lib/pet.

The idea is that the multi-species estimation is done in a dedicated subdirectory, which has copies of all mydata, predict and pars_init files of the species in the group. Apart from the strategy to use a particular pars_init file as template for the pars_init_group file, it is possible to generate such a file, as well as the run_group file from the set of pars_init files with pars_init2mat; mat2pars_init. Function pars_init2mat uses all mydata and pars_init files in the subdirectory and if no pars_init_group files exists, it writes one. The resulting file still needs some editing, since all parameters are now vector-valued, while the crux of the exercise is that at least some parameters are the same for all species, to arrive at a reduction in the number of parameters that needs to be estimated.

The finding the global minimum of the loss-function rapidly becomes harder with the number of parameters to be estimated, since the number of local minima increases, and the rate of conversion generally decreases. Although no theoretical limitations apply to the maximum group size, this principle poses a soft ceiling. Generally, the worse is the best fit, the more rugged is the surface of the loss-function, the more local minima exist and the harder it becomes to find the global one. If you think you found a global minimum, you might try to restart from there to check that the parameters don't move again. This restart restores the size of the simplex, which generally shrinks during estimation, increasing the probability to arrive at a local minimum that is not global. (A simplex is a set of parameter-sets with a number of elements that is one more than the number of free parameters. One of the elements in the set is the specified initial parameter set, the seed, the others are generated automatically in its "neighbourhood". The simplex method tries to replace the worst parameter set by one that is better than the best one, i.e. gives a smaller value of the loss-function.) This is one of the reasons why a series of short runs with continuation, using estim_options('pars_init_method', 1); in the run-file (see below), is to be preferred above a single long run. We suggest to use estim_options('max_step_number',500);. With key `arrow-up' you can go back to your previous Matlab-command; pressing key `enter' you continue estimation. So it takes just two key-hits to continue estimation. Sometimes a small simplex size is required to reach the global minimum (e.g. to squeeze through the "key-hole"), motivating a larger maximum step number in the direct neighborhood of that minimum.

The MRE's and SMSE's of the species in the group must be higher than for each species separately, since fewer free parameters are used, giving further constraints on their values. The general idea of a successful multi-species estimation is that the decrease in number of estimated parameters more than compensates the decrease in goodness of fit, with the hope that the parameter values increase in realism (especially relevant for poorly-determined parameters).

Although multi-species parameter estimation can be scientifically important, the AmP data base has presently no system to accommodate the cross-linking between different entries. This means only results from single-species estimation can be submitted for publication in the AmP data base, since we want to maximize clarity about the step from data to parameter values.