Keywords

Agent-based models; Individual-based models; Estimation; Calibration; Approximate Bayesian Computation; Random Forest; Generalized Additive Model; Bootstrap;

1 Introduction

1.1 Comparing estimation algorithms provides no winner

A mathematical model is a set of causal mechanisms connecting together numerical variables. These mechanisms may depend on one or more parameters; some can be readily observed but many cannot. Because un-observed parameters affect the model output, it may be possible to identify their value by comparing the model output against what actually occurred in the data.
Estimation is this process of identifying parameters by comparing model output to data. Many estimation algorithms for simulation models have emerged in the past ten years(for a general review see Hartig et al. 2011; for an agent-based review see Thiele, Kurth, and Grimm 2014; for agent-based models in economics see Fagiolo et al. 2019; Platt 2020).

There are three limitations to current estimation literature. First, papers that introduce new estimation algorithms tend to showcase their performance on few idiosyncratic examples so that comparisons across methods remain difficult. Second, reviews that compare estimation algorithms tend to be small, focusing only on one field, few models and estimation algorithms. Third, existing reviews tend to mix together two steps: the processing of model outputs into useful summary statistics (or distance functions) and the actual algorithm used for estimation. The processing phase is very specific to each discipline which makes it hard to apply lessons from one paper to agent-based models in another field.

Here we build a more thorough comparison of nine estimation algorithms across 41 simulation models (both agent-based and not). Our original objective was to pick the best estimation algorithm so that authors could default to it without worrying about the rest of the estimation literature. We establish instead that there is no best algorithm: both the absolute and relative performance are context-dependent.

The best performing estimation algorithm changes not just between models but sometimes even between parameters of the same model. Worse, even though the best algorithm is context dependent, choosing the right one matters more than quadrupling the number of simulations. Worse still, for all estimation algorithms there is always at least a case where they fail entirely to estimate a parameter that at least another algorithm identified.

This dooms the hope of there being a “best” estimation algorithm. More practically this prevents agent-based authors from delegating to a literature review such as this one the task of picking the estimation algorithm for them. The cross-validation testing that we implement here needs to be repeated for any new model.

Critically, the same cross-validation testing that ranks algorithms can also be used to diagnose identification failures: the inability to recover the parameters’ value from data(Lewbel 2019; Canova and Sala 2009). We show that about half of the agent-based models tested have at least one unidentifiable parameter. Identification failures are common in agent-based models but identifying them provides us with clues on how to solve them or understand their causes and consequences.

1.2 The challenge of estimating agent-based models

Two factors complicate the estimation of agent-based models. First, agent-based models tend to simulate complex systems with many moving parts and parameters. Second, it is almost always impossible to identify a likelihood function for an agent-based model (the only exception we know of is Monti, De Francisci Morales, and Bonchi 2020). A likelihood function is an equation connecting parameters with data and is useful to operationalize the estimation problem into a numerical maximization (change the parameters to maximize the likelihood).
For simple models we can substitute the unknown likelihood with a quasi- or pseudo-likelihood object and maximize that instead (Hooten, Wikle, and Schwob 2020 provides a good overview on this topic in the context of agent-based models). This avenue however remains too computationally expensive for most agent-based models.

Without a likelihood, the only alternative is to condense both model output and data into a set of comparable summary statistics.
The challenge in agent-based models is that there is an immense number of potential summary statistics to generate from extremely heterogeneous model outputs, such as maps, histograms and time series(Lee et al. 2015 reviews the complexities of agent-based outputs). In principle having many summary statistics ought to be an advantage as we have many dimensions on which to measure the discrepancy between data and simulation. In practice however many summary statistics will provide either noisy, useless or duplicate information and degrade estimation performance.

Subject expertise can sometimes be used to select the most important summary statistics (the list of recent developments in Fagiolo et al. 2019 for example deals almost exclusively with this task) but the choice of the best summary statistics will often be arbitrary. An alternative is to start from a large set of summary statistics and then use statistical methods to pick the summaries and weigh their information to (see Carrella, Bailey, and Madsen 2020 for an agent-based model application; see Blum et al. 2013 for an approximate Bayesian computation review; Jiang et al. 2017 for a neural network approach to discover summary statistics from simulated data). Even after choosing which summary statistic to deal with however, we still need to choose the right estimation algorithm: the procedure that maps summary statistics back to the parameters that generated them.

1.3 The qualities we look for in an estimation algorithm

The theoretical literature on simulation inference, inherited from economics and in particular the indirect inference tradition (Gourieroux, Monfort, and Renault 1993; Smith 2008; Grazzini and Richiardi 2015), is concerned with asymptotics and in particular consistency. Consistency is achieved when the estimation algorithm converges to the real value as the data used to train it (both real raw data and number of simulation runs) grows to infinity. Zhao (2010) shows that two conditions are sufficient for consistency. First, no equifinality: two different parameter inputs cannot produce the same model output. Second, once we fix a parameter input and run the model for an infinite time steps and replications, all summary statistics must converge (that is, there cannot be summary statistics that never “settle” or do so at different values for different runs).

This theoretical contribution remains largely ignored in the applied agent-based literature. There are three justifications for this. First, consistency conditions are probably violated by many agent-based models as equifinality is common (Poile and Safayeni 2016; Williams et al. 2020) and many summary statistics are either generated by non-stationary dynamics (Grazzini and Richiardi 2015) or by distributions whose sample moments do not converge, particularly power laws (LeBaron 2001; R. Axtell 1999).
Second, it is often impossible to test whether the consistency conditions are violated. Third, asymptotic results hold little appeal for applied work facing limited data and a finite computational budget. We are usually interested in the performance we can achieve for the problem at hand rather than guaranteed consistency for infinite data we will never collect.

Applied work looks instead for three qualities in an estimation algorithm. First, we want to achieve good accuracy: estimate parameters as close as possible to the ones that generated the data we observe. Second, we want to achieve good coverage: estimate the smallest range of values that includes the real parameters with the correct pre-specified probability (confidence level). Third, we want high estimation efficiency: achieve the previous two objectives with the least amount of runs since agent-based models are expensive to simulate.

The thesis of this paper is that a fourth practical quality should be prioritized instead: testing efficiency. We should prefer estimation algorithms that can cheaply measure their own accuracy and coverage in any given context. We can test any estimation algorithm by running a simulation with known parameters, treat its output as if it was the real data and then ask the estimation algorithm to discover the parameters we started with. While the testing technique is universal, its computational costs differ between algorithms.

1.4 Why this paper focuses exclusively on reference table algorithms

There are two main families of estimation algorithms: rejection-based and search-based algorithms. Rejection algorithms repeatedly run a model with random parameters until a stopping condition is reached. Reference table algorithms (Cornuet et al. 2008) are the subset of rejection algorithms where the stopping condition is simply to run the model a fixed amount of times.

Rejection algorithms are inefficient because many runs will produce output far from the real data. Search-based algorithms then replace random sampling with minimizing a distance function between model output and data (Calvez and Hutzler 2006; Sisson, Fan, and Tanaka 2007; Pietzsch et al. 2020). This approach increases estimation efficiency.

The estimation efficiency of search-based methods becomes a liability when testing, however. Search algorithms explore only the part of the parameter space closest to the original data-set while testing requires it to minimize the distance to many new targets. Search-based algorithms have no alternative but to restart their minimization for every new test. The computational costs of testing search-based algorithms quickly become astronomical (a point well demonstrated in Platt 2020).

Testing reference table algorithms, by contrast, involves running no new simulation. The original simulation runs were already spread out across the parameter space and the same runs used to estimate the parameters of the real data-set can be used to estimate the parameters in any new test. The advantage is even greater when we want to rank the performance of many estimation algorithms. This is because we can recycle the simulation output used to train one reference table algorithm to perform the same estimation with any other. In contrast because search-based algorithms control the trajectory of parameters fed into the model, we need to re-run the model again for each search-based algorithm we want to rank.

A numerical comparison may help. Imagine producing a testing set of 500 runs with known input parameters. We want to rank five alternative algorithms by their ability to re-discover the known parameters given a computational budget of 1,000 simulations per estimation. Testing five search-based algorithms will require us to run the simulation model 2,500,000 times: 1,000 runs for each of the five search algorithms for each of the 500 items in the testing set.
Comparing five different reference table methods will require only 1,000 runs in total to produce a training data-set which will be shared across each algorithm and re-used for each item of the testing set.

2 Materials and Methods

We define here a simulation model as any function that depends on a set of parameter \(\theta\) to generate a set of summary statistics \(S(\theta)\). We are interested in the estimation problem where we observe summary statistics \(S^*\) and we want to know which parameter \(\theta^*\) most likely generated them.

We parametrize 41 simulation models (described in section 2.1 and the appendix). We have nine estimation algorithms to do so (described in section 2.2). All are “reference table” algorithms: algorithms whose only input for estimation is a table of simulation parameters \(\theta\), selected by random sampling, and the summary statistics \(S(\theta)\) they generate. We split this reference table into training and testing sets and ask each algorithm to estimate the parameters of the testing set observing only the training runs.

We want to measure the quality of an estimation algorithm along two dimensions: point predictions “performance” and confidence interval “coverage.” We measure point prediction performance as:

\[\begin{equation} \text{Performance} = 1 - \frac{ \sum_j \sqrt{\left( \theta^*_j - \hat \theta_j \right)^2}}{ \sum_j \sqrt{\left( \theta^*_j - \bar \theta \right)^2}} \end{equation}\]

Where \(\hat \theta\) is the estimated parameter, \(\theta^*\) is the real hidden parameter, \(\bar \theta\) is the average parameter value in the training data and \(j\) is the row of the testing data-set we are estimating. In other words, performance measures how much more accurate (measured in root mean square error) estimation is compared to just guessing the average parameter value without performing any estimation. Performance ranges from 1 (perfectly estimated) to 0 (unidentified) to negative values (mis-identified). Without square roots this is equal to predictivity (Salle and Yıldızoğlu 2014) and modelling efficiency (Stow et al. 2009).

We define coverage as in Raynal et al. (2018) as the percentage of times the real parameter falls within the 95% prediction intervals suggested by the estimating algorithm. The best coverage is 95%: higher generates type I errors, lower generates type II errors.

2.1 Models

We estimate the parameters of 41 separate simulation models. We selected them either because they appeared as examples in at least another estimation paper (20 models) or because they were open source agent-based models (21 models) available on the COMSES model library (Rollins et al. 2013). We can roughly categorize these models into four groups: simple, ill posed, complicated and agent-based models. Table 2.1 lists them all. In the appendix we provide a brief description of each.

Table 2.1: List of all estimated models
Experiment No. of parameters No. of summary statistics No. of simulations Testing
\(\alpha\)-stable 3 11 1,250 or 5,000 5-fold CV
Anasazi ABM 4 28 1,250 or 5,000 5-fold CV
Birds ABM 2 2 or 105 5,000 5-fold CV
Bottom-up Adaptive Macroeconomics ABM 8 180 1,250 or 5,000 5-fold CV
Broken Line 1 10 1,250 or 5,000 5-fold CV
Coalescence 2 7 100,000 Single testing set
COVID-19 US Masks ABM 4 51 1,250 or 5,000 5-fold CV
Earthworm 11 160 100,000 Single testing set
Ecological Traits 4 4 1,250 or 5,000 5-fold CV
Ebola Policy ABM 3 31 1,250 or 5,000 5-fold CV
FishMob ABM 5 104 1,250 or 5,000 5-fold CV
Food Supply Chain ABM 5 99 1,250 or 5,000 5-fold CV
Ger Grouper ABM 4 41 1,250 or 5,000 5-fold CV
\(g\)-and-\(k\) distribution 4 11 1,250 or 5,000 5-fold CV
Governing the Commons ABM 4 44 1,250 or 5,000 5-fold CV
Hierarchical Normal Mean 2 61 1,250 or 5,000 5-fold CV
Intra-Organizational Bandwagon ABM 2 43 1,250 or 5,000 5-fold CV
Insulation Activity ABM 3 45 1,250 or 5,000 5-fold CV
Lotka-Volterra 2 16 (noisy or non-noisy) 100,000 Single testing set
Locally Identifiable 2 2 1,250 or 5,000 5-fold CV
Moving Average (2) 2 2 1,250 or 5,000 5-fold CV
Median and MAD 2 2 or 4 1,250 or 5,000 5-fold CV
Multilevel Selection ABM 4 44 1,250 or 5,000 5-fold CV
\(\mu\)-\(\sigma^2\) 2 2 10,000 5-fold CV
NIER ABM 4 60 1,250 or 5,000 5-fold CV
Normal 25 2 25 1,250 or 5,000 5-fold CV
Pathogen 4 11 200,000 Single testing set
Partially Identifiable 2 2 1,250 or 5,000 5-fold CV
Peer Review Game ABM 5 77 1,250 or 5,000 5-fold CV
OfficeMoves ABM 3 36 1,250 or 5,000 5-fold CV
RiskNet ABM 4 40 1,250 or 5,000 5-fold CV
Real Business Cycle 6 44 or 48 2,944 or 2,961 5-fold CV
Scale 2 1 1,250 or 5,000 5-fold CV
Schelling-Sakoda Extended ABM 3 77 1,250 or 5,000 5-fold CV
Standing Ovation ABM 3 20 1,250 or 5,000 5-fold CV
Sugarscape ABM 5 146 1,250 or 5,000 5-fold CV
Unidentifiable 2 1 1,250 or 5,000 5-fold CV
Toy Model 2 2 1,250 or 5,000 5-fold CV
Two-factor Theory ABM 4 41 1,250 or 5,000 5-fold CV
Wilkinson 1 1 1,250 or 5,000 5-fold CV
Wolf Sheep Predation ABM 7 33 1,250 or 5,000 5-fold CV

Simple simulation models have few parameters and summary statistics. They feature prominently in the ABC literature both as a teaching tool and to compare different algorithms. They are useful because they run quickly but they may bias comparisons towards simpler estimation algorithms.

Ill-posed simulation models face clear identification issues: the inability to recover parameters given the information we have. There are many sources of identification issues and each ill-posed model highlights one particular form. A good estimation algorithm facing an ill-posed problem should display two features. First we would like to maximize the quality of our estimated parameters when the information is available but noisy (the lesser problem of “weak” identification). Second we would like our estimation algorithm to recognize when the model cannot be identified and return wide confidence intervals, signalling estimation uncertainty.

We split complicated simulation models into two sets, agent-based models and other complicated simulations. They face similar problems: they tend to be large, involve many input parameters and summary statistics. From an estimation point of view there is no qualitative difference between the two but in practice agent-based models tend to be slower and produce more summary statistics.

2.2 Algorithms

We test nine reference table algorithms to parametrize simulations: five are ABC (Approximate Bayesian Computation) and four are regressions-only. We ignored search-based algorithms, such as synthetic likelihood (Wood 2010; Fasiolo and Wood 2014), ABC-MCMC (Hartig et al. 2011) and Bayesian optimization (Snoek, Larochelle, and Adams 2012). We also ignored regression-only algorithms that do not generate prediction intervals such as the deep neural networks proposed in Creel (2017) and the elastic nets proposed in Carrella, Bailey, and Madsen (2018).

All reference table algorithms share a common estimation procedure. First, we run the model “many” times and collect the random parameters we input and summary statistics the model outputs into a “reference table.” The estimation algorithm then produces a rule to generalize the information in the reference table and to go from summary statistics back to the parameters that generated them. Finally we plug in the real summary statistics vector \(S^*\) we observe from the data into this rule and obtain the estimated parameters.

2.2.1 ABC

The first algorithm we use is the simple rejection ABC (Pritchard et al. 1999; Beaumont, Zhang, and Balding 2002). Start by ranking all training observations by their euclidean distance to the testing summary statistics \(\sum_i\left( S_i(\theta) - S^*_i \right)^2\). Ignore all training observations except the closest 10%. The distribution of \(\theta\) parameters from the closest training observations becomes the posterior of the estimate \(\theta^*\).

The second algorithm is the local-linear regression adjusted ABC (Beaumont, Zhang, and Balding 2002). Weigh all training observations by an Epanechnikov kernel with bandwidth equal to euclidean the distance between the testing summary statistics \(S^*\) and the furthest \(S(\theta)\) we would have accepted using simple rejection. Then run a local-linear regression on the weighted training set to estimate \(\theta^*\) as the predicted \(E[\theta |S(\theta)]\), using the residuals of that regression to estimate its posterior distribution.

The third algorithm, neural network ABC, inputs the same weighted training set to a feed forward neural network (Blum and Francois 2010). The approach is similar to the local-linear regression above but the residuals are also weighted by a second regression (on the log squared residuals) to correct for heteroskedasticity.

These three algorithms are implemented in the abc package(Csilléry, François, and Blum 2012) in R. We used the package default settings for its neural networks (10 networks, 5 units in the hidden layer and weight decay randomly chosen for each network between \(0.0001\),\(0.001\) and \(0.01\)).

The fourth and fifth algorithm are semi-automatic ABC methods which “pre-process” summary statistics before applying rejection ABC(Prangle et al. 2014). More precisely, the original summary statistics \(S(\theta)\) are fed into a set linear regressions estimating \(r_i=E[\theta_i|S(\theta)]\) (one for each parameter of the model) and the values are used as summary statistics for the simple rejection ABC. The rationale is that these regressions will project the summary statistics into a space where rejection ABC performs better. We do this in two different ways here: by running first or fourth degree linear regressions in the pre-processing phase. This is done using the R package abctools(Nunes and Prangle 2015) and their default parameters: using half of the training set to run the regression and the other half to run the rejection ABC.

A feature of all ABC methods is that they are local: they remove or weight training observations differently depending on the \(S^*\) (the “real” summary statistics). This means that during cross-validation we need to retrain each ABC for each row of the testing set.

2.2.2 Regression Only

Estimating parameters by regression is a straightforward process. We build a separate regression \(r\) for each \(\theta\) in the reference table as dependent variable using the summary statistics \(S(\theta)\) as the independent variables. We plug the real summary statistic \(S^*\) in each regression and the predicted value is the estimated parameter \(\theta^*\).

The simplest algorithm of this class is linear regression of degree one. It is linear, its output is understandable and is fast to compute. This speed allows us to estimate the prediction interval of \(\theta^*\) by resampling bootstrap(Davison and Hinkley 1997): we produce 200 bootstrap data sets and run the same linear regression on each one. From each regression \(i\) we collect their prediction \(\beta_i S(\theta^*)\) and sample one standardized residual \(e\) (a residual divided by the square root of one minus the hat value associated with that residual). This produces a set of 200 \(\beta_i S(\theta) + e_i\). The 95% prediction interval is then defined by 2.5 and 97.5 percentile of this set.

In practice then predictions are distributed with the formula: \[ r(S)+A+B \] where \(r(S)\) is the regression prediction, \(A\) is an standard error adjustment due to uncertainty about the estimated coefficients (in this case \(S(\hat \beta - \beta_i)\) where \(\hat \beta\) is the original OLS estimated parameters, and \(\beta_i\) is a bootstrap estimate of the same) and B is an adjustment due to irreducible noise (in this case, a random sample of standardized residuals).

A more complex algorithm that is not linear but still additive is the generalized additive model(GAM), where we regress: \[ \hat \theta = \sum s_i(S_i(\theta)) \] \(s_i\) is a smooth spline transformation (see chapter 9 in Hastie, Tibshirani, and Friedman 2009; also Wood and Augustin 2002). We use the mgcv R package(Wood 2017, 2004). The bootstrap prediction interval we built for the linear regression is too computationally expensive to replicate with GAMs. Instead we produce prediction intervals by assuming normal standard errors (generated by the regression itself) and by resampling residuals directly: we generate 10,000 draws of \(z(S(\theta))+\epsilon\) where \(z\) is normally distributed with standard deviation equal to regression’s standard error at \(S(\theta)\) and \(\epsilon\) is a randomly drawn residual of the original regression.
The 95% prediction interval for \(\theta^*\) is then defined by 2.5 and 97.5 percentile of the generated \(z(S(\theta^*))+\epsilon\) set.

A completely non-parametric regression advocated in Raynal et al. (2018) is the random forest(Breiman 2001). We implement this in two ways here. First, as a quantile random forest (Meinshausen 2006), using in quantregForest R package (Meinshausen 2017); prediction intervals for any simulation parameter \(\theta^*\) are the predicted 2.5 and 97.5 quantile at \(S(\theta^*)\). Second, as a regression random forest using the ranger and caret packages in R (Wright and Ziegler 2015; Kuhn 2008). For this method we generate prediction intervals as in GAM regressions. We generate 10,000 draws of \(z(S(\theta))+\epsilon\) where \(z\) is normally distributed with standard deviation equal to the infinitesimal jackknife standard error(Wager, Hastie, and Efron 2014) at \(S(\theta)\) and \(\epsilon\) is a resampled residual; we then take the 2.5 and 97.5 percentile of the \(z(S(\theta^*))+\epsilon\) set as our prediction interval.

3 Results

3.1 Point performance

Table 3.1 summarises the performance of each algorithm across all identifiable estimation problems (here defined as those where at least one algorithm achieves performance of 0.1 or above). Estimation by random forests achieves the highest average performance and the lowest regret (average distance between its performance and the highest performance in each simulation). Even so, regression random forests produce the best point predictions only for 77 out of 226 identifiable parameters. GAM regressions account for another 69 best point predictions.

Local linear regression and neural-networks are also useful in the ABC context, achieving the highest performance for 47 parameters. Local linear regressions face numerical issues when the number of summary statistics increase and they were often unable to produce any estimation. The performance of neural network ABC could be further improved by adjusting its hyper-parameters, but this would quickly accrue high computational costs.
The appendix contains a table with the performance for all parameters generated by each algorithm.

Table 3.1: Table showing for each algorithm how many parameters were best estimated by that algorithm; regret, defined as the median % loss between the performance of the algorithm and the best performance in each estimation; the median performance overall. Only estimations for which at least one method achieved performance above 0.05 were considered.
Algorithm # of times highest performance Regret Median performance
Rejection ABC 2 -0.518 0.240
Semi-automatic ABC 4D 1 -0.256 0.437
Semi-automatic ABC 1D 4 -0.310 0.388
Local-linear ABC 16 -0.083 0.560
Neural Network ABC 33 -0.103 0.502
Linear Regression 11 -0.232 0.399
GAM 69 -0.034 0.518
Quantile Random Forest 14 -0.039 0.556
Regression Random Forest 77 -0.009 0.566

It is important to note that the advantages of random forests become apparent only when estimating agent-based models. Simpler estimation algorithms perform just as well or better in smaller problems. This is shown in table 3.2.
This implies that interactions between summary statistics as well as the need to weigh them carefully matters more when estimating agent-based models than simpler simulations, justifying the additional algorithm complexity.

Table 3.2: Table showing how many times each algorithm had the highest performance when estimating a parameter arranged by type of simulation problem
Algorithm ABM Complicated Ill-posed Simple
Rejection ABC 0 0 1 1
Semi-automatic ABC 4D 0 0 0 1
Semi-automatic ABC 1D 3 0 0 1
Local-linear ABC 1 3 1 11
Neural Network ABC 22 7 0 4
Linear Regression 3 0 4 4
GAM 39 10 9 11
Quantile Random Forest 11 2 0 1
Regression Random Forest 58 8 1 10

Another important qualification is that random forests tend to perform better for the parameters where maximum performance is below 0.3. This is intuitive because we expect non-linear regressions to function even when summary statistics are noisy and uninformative but it also means that many of the parameters that are best estimated by random forests remain only barely identified. This we show in table 3.3.

Table 3.3: Table showing how many times each algorithm had the highest performance when estimating a parameter, arranged by best performance achieved by any algorithm for that parameter.
Algorithm 0.1-0.3 0.3-0.5 0.5-0.7 0.7-1
Linear Regression 2 1 2 6
Local-linear ABC 3 3 3 7
GAM 10 7 20 32
Neural Network ABC 2 9 9 13
Rejection ABC 1 0 1 0
Quantile Random Forest 0 1 7 6
Regression Random Forest 23 9 26 19
Semi-automatic ABC 1D 1 3 0 0
Semi-automatic ABC 4D 0 0 1 0

Table 3.4 lists the number of identification failures: parameters for which the algorithm performance was below 0.1 but at least a competing algorithm achieved performance of 0.3 or above. In other words we are tabulating cases where the parameter can be identified but an estimation algorithm fails to do so. Local-linear regression struggled with the “natural mean hierarchy” simulation. Linear regression failed to estimate the \(b\) parameter from the Lotka-Volterra models, the \(\sigma\) parameter from the normal distribution and the \(A\) parameter from the ecological traits model. Random forests failed to identify \(\mu\) and \(\delta\) from the RBC macroeconomic model. GAM regressions, in spite of having often been the best estimation algorithm, failed to identify 10 parameters, all in agent-based models (particularly in “Sugarscape” and “Wolf-Sheep predation”).

Table 3.4 also shows mis-identifications, cases where an algorithm estimates significantly worse than using no algorithm at all (performance is below -0.2). These are particularly dangerous failures because the estimation algorithm “thinks” it has found some estimation pattern which proves to be worse than assuming white noise. Mis-identification seems to apply asymmetrically: it is more common for complicated approximate Bayesian computations and simple regressions.

Table 3.4: In this table we tabulate the number of identification failures and mis-identifications for each algorithm. Identification failures are parameters where the algorithm had performance below 0.1 but at least another algorithm had performance above .3. Mis-identification occurs whenever an algorithm achieves negative performance below -0.2 i.e. estimating significantly worse than just predicting the average
Algorithm Identification Failures Mis-identifications
Rejection ABC 32 2
Semi-automatic ABC 4D 2 0
Semi-automatic ABC 1D 2 0
Local-linear ABC 9 8
Neural Network ABC 1 5
Linear Regression 13 17
GAM 13 3
Quantile Random Forest 3 0
Regression Random Forest 3 0

Figure 3.1 compares algorithms pairwise with respect to their performance. Even the “best” algorithm, random forest, has only a 52% chance of doing better than GAMs and performs worse than neural-network ABC for 30% of the parameters. The humble first degree linear regression (even after accounting for its identification failures and mis-identifications) wins more than half of the comparisons against any ABC except neural-networks.