Model selection criteria for dynamic brain PET studies

Background Several criteria exist to identify the optimal model for quantification of tracer kinetics. The purpose of this study was to evaluate the correspondence in kinetic model preference identification for brain PET studies among five model selection criteria: Akaike Information Criterion (AIC), AIC unbiased (AICC), model selection criterion (MSC), Schwartz Criterion (SC), and F-test. Materials and Methods Six tracers were evaluated: [11C]FMZ, [11C]GMOM, [11C]PK11195, [11C]Raclopride, [18F]FDG, and [11C]PHT, including data from five subjects per tracer. Time activity curves (TACs) were analysed using six plasma input models: reversible single-tissue model (1T2k), irreversible two-tissue model (2T3k), and reversible two-tissue model (2T4k), all with and without blood volume fraction parameter (V B). For each tracer and criterion, the percentage of TACs preferring a certain model was calculated. Results For all radiotracers, strong agreement was seen across the model selection criteria. The F-test was considered as the reference, as it is a frequently used hypothesis test. The F-test confirmed the AIC preferred model in 87% of all cases. The strongest (but minimal) disagreement across regional TACs was found when comparing AIC with AICC. Despite these regional discrepancies, same preferred kinetic model was obtained using all criteria, with an exception of one FMZ subject. Conclusion In conclusion, all five model selection criteria resulted in similar conclusions with only minor differences that did not affect overall model selection. Electronic supplementary material The online version of this article (10.1186/s40658-017-0197-0) contains supplementary material, which is available to authorized users.


Background
The full potential of a novel PET tracer can only be achieved when the uptake or clearance of the tracer can be extracted from the normal activity distribution of the dynamic PET data. To evaluate this, a pharmacokinetic model describing the tracer in vivo kinetics is necessary. With a kinetic model, the in vivo physiological behaviour of the tracer both in tissue and blood can be described. The model that best describes the kinetic behaviour of the tracer may be used to reproducibly and reliably quantify tracer distribution or binding. Finding the optimal model can, however, be a challenge.
Several criteria exist to identify the preferred kinetic model that provides the best fit to the time course of the tissue tracer concentration [1] , [2] , [3] , [4] , [5]. Each of these criteria ranks the quality of the model fit and thus provides a means for model selection. These criteria are all based on the goodness of fit (sum of squared differences between measured and estimated data, also termed residuals) and the complexity of the model (number of fit parameters). In addition, both number of frames and weighting factors applied to residuals are taken into account. The various criteria have substantial similarities in their equations and parameters used (Eqs. [1][2][3][4]. For example, Eq. 3 can be written as the difference between ln of nominator and denominator; this now is quite similar to other equations. The only difference is in the penalty parameter (added based on the complexity of model) in these equations. We therefore hypothesize that these criteria would result in (overall) comparable model preferences in actual clinical data. To the best our knowledge, a direct comparison of their performance in clinical PET brain studies has never been reported and the selection of the criterion used is frequently a matter of debate.
The purpose of the present study was to evaluate the similarity and/or difference in model preference for several commonly used criteria: Akaike Information Criterion (AIC) [6], [1], AIC unbiased (AIC C ) [7], [2], Model Selection Criterion (MSC) [8], [3], Schwartz Criterion (SC) [9], [4], and finally validate them against the F-test [10] , [5]. To cover a range of kinetics, these criteria were used to identify the preferred model for six tracers with known differences in kinetic (in vivo) properties.  Figure  S1 illustrates typical whole brain gray matter time activity curves (TACs) of the six tracers. A full description of the original studies has been reported previously for [ 11 C]FMZ [11], [ 11 C]PK11195 [12], [ 11 C]Raclopride [13], [ 18 F]FDG [14], [ 11 C]GMOM [15], and [ 11 C]PHT [1]. For each tracer, PET scan data from five subjects were included. Subjects were randomly selected from existing databases, and researchers had no information on group status (both healthy subjects and patients). Five subjects per tracer reflect the number of subjects often included in first-time studies examining tracer kinetics. Each of these subjects had a T1weighted magnetic resonance imaging (MRI) scan and a metabolite corrected arterial input function was also available. For scanner properties, attenuation correction, scan duration, and reconstruction methods, see Table 1. All scans were corrected for dead time, randoms, scatter, and decay. The original studies had all been approved by the Medical Ethics Review Committee of the VU University Medical Center. All subjects had provided written informed consent after complete explanation of the study procedures.

Methods
For anatomical delineation of volumes of interest (VOIs), MRI was used. Subjects' T1-weighted MRI scan was co-registered with the summed PET data using VINCI software (Cologne, Germany). The co-registered MRI scan was segmented into gray matter, white matter, and extra-cerebral fluid, and TACs were extracted using PVElab [16] in combination with the Hammers template [17]. Sixty-eight gray matter VOIs were delineated onto MRI, i.e. 68 TACs were extracted per subject. TACs were analysed using six plasma input models: a reversible single-tissue model (1T2k), an irreversible two-tissue model (2T3k), and a reversible two-tissue model (2T4k), all three both with and without additional blood volume parameter (V B ) [18].
For all six tracers, the preferred models across TACs were obtained using various where n represents the number of observations, p the number of parameters, w i the weight applied to residual of ith acquisition, y are the measured values (PET scan), ŷ are the predicted (fitted) values, y is the mean of the measured values, and t i time of ith acquisition. For the F-test, Q 1 represents the sum of squares for the model with p1 parameters, and Q 2 the sum of squares of model with p2 parameters, where p1 < p2 is required. w i (w i = 1/σ i 2 ) was estimated using the following equation [19]: where σ 2 represents the variances for each frame and is calculated based on the whole scanner trues counts (T); dcf is the decay correction factor, L represents frame length, and α is the proportionality constant signifying the variance level.
For, AIC, AIC C , and SC, a lower value implies a better fit. Therefore, the lowest value obtained from the model selection criterion indicates the preferred kinetic model, i.e. the model that provides the best fit to a TAC with the smallest number of model parameters. For MSC, a higher value implies a better fit and hence the highest value will be indicative of the most appropriate model. F-test directly compares two models and returns a F-statistic with p2-p1 and n-p2 degrees of freedom. If this F-statistic is larger than the tabulated value at a specified p value, the complex model has a better fit (with a significance of p < 0.05). Agreement of the model selection criteria was evaluated by calculating the percentage preference for a kinetic model across all TACs. This was performed for all subjects and for each model selection criteria and tracer. The F-test was considered as the reference for model comparison as it is a frequently used hypothesis test [20] , [21]. This F-test, however, is difficult to apply when comparing multiple models, subjects, TACs, and tracer studies. For example, in the present study, this would have resulted in 61,200 comparisons (30 comparisons per tracer per subject per VOI). Therefore, a more pragmatic approach was followed by first identifying the most likely plausible model based on known tracer kinetics, visual inspection of the fits, and AIC results, which can be calculated very efficiently. Next, the F-test was applied to determine if this model was the preferred one (per TAC) when compared to all other models. Finally, regional agreement/disagreement of AIC C , SC, and MSC to AIC and eventually F-test was evaluated. To examine if any existent disagreement between model selection criteria was a general effect or driven by a single subject, the frequency of model preferences per subject were also evaluated for each model selection criteria. Finally, in order to examine the effect of VOI size (assuming that smaller VOIs result in TACs with higher noise levels) on the model selection criteria, similar analyses were performed for VOIs with sizes smaller and larger than 5 mL separately.

Results
For all six radiotracers, strong agreement was observed between the different model selection criteria when examining the frequency of model preferences across all TACs (Fig. 1). Only one small deviation was observed for FMZ: AIC C preferred 2T3k_V B , whereas the other model selection criteria preferred 2T4k_V B . The reduced regional agreement for FMZ when comparing AIC with AIC C turned out to be an 18% difference in model preference. When examining within subject preference, it was shown that this difference was mainly driven by one subject. For this subject, 2T3k_V B was preferred according AIC C and F-test where AIC, MSC, and SC all preferred 2T4k_V B for this subject. For the other remaining four FMZ studies, all five criteria agreed on 2T4k_V B being the preferred model for FMZ. Even though all model selection criteria agreed on the preferred model across TACs (with exception of one FMZ study), slight disagreements between criteria did exist. The F-test confirmed the AIC preferred models within single subjects for 87% of all cases. AIC and MSC gave exactly the same model preferences for all the tracers studied. Regional disagreement of only 13 and 7% for AIC C and SC, respectively, with AIC was observed for [ 11 C]PK11195. In case of [ 11 C]Raclopride, regional disagreement of 22% when comparing AIC with AIC C and 10% between AIC and SC was observed. For [ 18 F]FDG, relatively low disagreement of 4 and 6% for AIC C and SC, respectively, was observed. A relatively low (4%) disagreement of AIC with both AIC C and SC (4%) was obtained for [ 11 C]GMOM. A regional disagreement of 16% for AIC C and 9% for SC with AIC was seen for [ 11 C]PHT. Despite regional differences, and between subject deviations, all model selection criteria showed strong agreement in model preference.
Effect of noise on the model preference can be observed in Figs. 2 and 3. For all the tracers, a variation in the model preference is observed with varying VOI size, which is primarily due to noise. However, the VOI size (either smaller or larger than 5 mL), in general, did not have a notable effect on the agreement of the model selection criteria (Figs. 2 and 3).

Discussion
The aim of the present study was to assess the overall agreement and/or disagreement in model preference of five commonly used model selection criteria. The model  [22], [23], [24].
In short, small differences between criteria exist with respect to the penalty applied for the number of parameters and observations (frames). AIC, AIC C , SC, and MSC all include an extra penalty term for the number of parameters, trying to account for the possibility of over-fitting the model. The penalty used in AIC C seems to be stronger than that used in SC and MSC when compared to AIC. AIC becomes strongly biased when the ratio of model parameters to observations (acquisition frames) increases, for which AIC C tries to correct. When the number of observations increases, AIC C converges to AIC. Due to the stronger penalty, AIC C seems to prefer the models with less parameters when the ratio is bigger. For [ 11 C]FMZ and [ 11 C]PHT, both with < 20 observations, AIC C indeed preferred the model with less parameters slightly more frequently compared with the other criteria. For [ 11 C]PHT, this was reflected in a Fig. 2 Model preference (percentage of all TACs ≥ 5 mL) per selection criterion for the six brain PET radiotracers regional disagreement between AIC and AIC C of 16% of the TACs. For FMZ, this resulted in AIC C identifying a different kinetic model when examining all VOIs across all subjects as illustrated in Fig. 1. However, this difference in model preference was mainly driven by a single subject where AIC C disagreed strongly with the other model selection criteria, which was confirmed by the F-test. For the remaining four subjects, AIC C and F-test preferred the same model for FMZ as did the other criteria.
For [ 11 C]PK11195, [ 11 C]raclopride, [ 18 F]FDG, and [ 11 C]FMZ, the F-test did not agree with AIC for only four individual (subject) cases (one subject from each tracer study). For [ 11 C]PHT and [ 11 C]GMOM, the F-test agreed with AIC in all cases. The overall agreement for all tracers and subjects was 87% between AIC and F-test. However, on a (tracer) study level, this did not affect the model preference for that tracer. The F-test is designed to directly compare two models and is especially suited for this purpose. AIC and MSC yielded identical results. Compared with AIC, MSC includes normalization and is therefore independent of scaling of the data points. The normalization incorporated in MSC, however, seems to have little effect for the data described in this study. Fig. 3 Model preference (percentage of all TACs < 5 mL) per selection criterion for the six brain PET radiotracers Even though the model selection criteria showed strong overlap, it is not always straightforward which model provides the best fit for the data. For example,~30% of all TACs clearly identified 2T4k_V B for [ 18 F]FDG, whereas the kinetics of all other TACs were described by 2T3k_V B .
A total of 68 gray matter VOIs were included per subject, implying that the analysis comprised of both receptor-rich and receptor-devoid VOIs (which varies depending on the target of interest/tracer) and thus the observed regional differences in model preferences (Figs. 1, 2, and 3). In addition to the receptor density, another reason for regional differences in the same subject for the same tracer could be the size of VOI. Smaller VOI tends to have lower counts and thereby higher noise, which makes it difficult to define the optimal pharmacokinetic model. Figure 2 illustrates the impact of noise on the model selection when using different model selection criteria. Noise could be a probable cause for the observed differences between Figs. 2 and 3.
Preferred model can be affected by several factors such as in vivo kinetics of the VOI (underlying receptor density), subject status, input function (particularly parent fraction and metabolites), tracer free fraction, motion, and even the scan duration. It should be noted that a description of tracer kinetics and the identification of the preferred kinetic model using model selection criteria is only one of the many evaluations needed to identify the optimal model. For example, in order to determine the optimal kinetic model, several other datasets and studies are required such as test-retest data, data for both healthy controls and diseased subjects, data correlating model results with pathology, and/or outcome (disease duration, cognitive scores, or survival). Also, the intended application, e.g. differential diagnosis or response assessment, may or may not allow for the use of simplified models or methods. Moreover, formulating a hypothesis on the preferred kinetic model based on physiological properties of the tracer, e.g. derived from preclinical studies, is advised.
The number of subjects included for each tracer might be a limitation for this study; however, typically first in man PET studies are limited in subject number and often restricted to healthy subjects. In these circumstances, use of model selection criteria provides a first indication of one or more suitable candidate kinetic models which should be further developed and evaluated. Even though both healthy controls and patients were included for the analysis, no impact on the conclusions is expected, as over a wide range of kinetics with quite consistent results in the performance of the various criteria was observed. Therefore, possible changes in tracer kinetics in healthy versus diseased subjects would likely have little impact on the performance of these criteria, but should be verified in individual cases.

Conclusions
All model selection criteria tested resulted in similar conclusions with only minor, non-relevant differences in overall observed model preference.

Additional file
Additional file 1: Figure S1. Typical whole brain gray matter TACs of all the six brain PET radiotracers. (TIFF 43 kb) Abbreviations