Evaluation of PET quantitation accuracy among multiple discovery IQ PET/CT systems via NEMA image quality test

Introduction Quantitative imaging biomarkers are becoming usual in oncology for assessing therapy response. The harmonization of image quantitation reporting has become of utmost importance due to the multi-center trials increase. The NEMA image quality test is often considered for the evaluation of quantitation and is more accurate with a radioactive solid phantom that reduces variability. The goal of this project is to determine the level of variability among imaging centers if acquisition and imaging protocol parameters are left to the center’s preference while all other parameters are fixed including the scanner type. Methods A NEMA-IQ phantom filled with radioactive 68Ge solid resin was imaged in five clinical sites throughout Europe. Sites reconstructed data with OSEM and BSREM algorithms applying the sites’ clinical parameters. Images were analyzed according with the NEMA-NU2-2012 standard using the manufacturer-provided NEMA tools to calculate contrast recovery (CR) and background variability (BV) for each sphere and the lung error (LE) estimation. In addition, a 18F-filled NEMA-IQ phantom was also evaluated to obtain a gauge for variability among centers when the sites were provided with identical specific instructions for acquisition and reconstruction protocol (the aggregate of data from 12 additional sites is presented). Results The data using the 68Ge solid phantom showed no statistical differences among different sites, proving a very good reproducibility among the PET center models even if dispersion of data is higher with OSEM compared to BSREM. Furthermore, BSREM shows better CR and comparable BV, while LE is slightly reduced. Two centers exhibit significant differences in CR and BV values for the 18F NEMA NU2-2012 experiments; these outlier results are explained. Conclusion The same PET system type from the various sites produced similar quantitative results, despite allowing each site to choose their clinical protocols with no restriction on data acquisition and reconstruction parameters. BSREM leads to lower dispersion of quantitative data among different sites. A solid radioactive phantom may be recommended to qualify the sites to perform quantitative imaging.


Introduction
The current clinical trend is to use PET/CT as a quantitative biomarker in oncology for assessing the reliable evaluation of lesion's uptake changes over time. The standardized uptake value (SUV) is still the standard quantitative index in positron emission tomography (PET), but new surrogates are now reported as references like response criteria in solid tumors (PERCIST) [1], metabolic active tumor volumes (MATV), or total lesion glycolysis (TLG) [2].
For all these new quantitative indexes, long-term accuracy and reproducibility are mandatory [3,4]. Nevertheless, these parameters are known to be strongly influenced by various factors such as injected activity, uptake and acquisition time, reconstruction parameters, and lesion's dimension and localization [5,6]. The technological advances and design improvements in PET systems lead to higher sensitivities and are now associated with advanced reconstruction algorithms including accurate corrections applied during the iterative process (e.g., point spread function (PSF), time of flight (TOF), block sequential regularized expectation maximization (BSREM) for noise regularization). As a result, this large number of parameters lead to variations observed in quantitative measurements [7]. The need for harmonization in quantitative reporting has become of the utmost importance due to the increasing number of multi-center clinical trials. Several protocols are available to limit the differences among the scans outcome, from the patient preparation to the final report: for instance, the Uniform Protocols for Imaging in Clinical Trials (UPICT), the ResEARrch 4 Life (EARL) accreditation protocol, or the EANM FDG guidelines [7][8][9]. These protocols are expected to increase SUV's absolute evaluation, but not all the sites in Europe are currently accredited by EANM.
Chauvie et al. [10] clearly demonstrated that the use of a radioactive phantom filled with long-lived decaying isotopes could reduce sources of error such as dose calibrator variability and site-specific physicist hands-on training and experiences.
Encouraged by the above evidence, we aimed in this research to evaluate the relevance of mixing quantitative results coming from multisite PET exams among European facilities involved in a multi-center clinical trial. The QUICK project (Quantitation Unified Intercomparison Control Kit) was launched with the purpose of evaluating the impact of clinical protocol differences over scanner introduced variability. We realize the variability introduced with different scanner types is the ultimate goal for such a comparison, but we focus on protocol variability impact first as the initial step to understand the larger picture.
With the purpose of understanding the impact of different scanner operators, the same acquisitions and reconstructions were done on each site using the QUICK phantom. This step was followed with a series of acquisitions and reconstructions using local setup preferences as well as the site's recommendations for clinical reconstruction.
The final statistical analysis of all data provided by the sites was expected to point out the impact of sites' clinical practice on absolute quantitation in terms of contrast recovery, background variability, and lung error.
Results obtained using optimized protocols should be compared within the different participating sites to evaluate the global achievable accuracy of the multi-center clinical trials.
In summary, the aim of this study was to estimate the global statistical multi-center quantitation accuracy taking into account the discrepancies between participating PET facilities, using a solid phantom with fixed and accurately known activities and volumes.
The auxiliary aim of this collaborative work was to evaluate the impact of BSREM reconstruction on quantitation reliability, which is a unique feature of GE's D-IQ PET-CT scanner selected for this multi-center study.

Materials and methods
All the participating sites were equipped with a GE Discovery IQ PET/CT system, which is a state-of-the-art bismuth germanate (BGO) PET. It offers very high sensitivity and, associated with a new regularization algorithm (BSREM) named Q.Clear [11], raises an opportunity to get more accurate quantitation. Moreover, its high sensitivity allows to redefine dose prescription and exam durations.
The acquisitions were performed by medical physicists on fully calibrated PET/CT scanners.
In the BSREM reconstruction algorithm, the noise penalty function strength named "Beta " (β) is directly correlated with CR and BV. Consequently, it is relevant to this study to keep track of the site's preference in β value selection. This penalty term provides level of regularization based on the difference between neighboring voxels as well as their sum.
A solid 68 Ge phantom (QUICK phantom) based on IEC61675/NEMA Body IQ phantom geometry and activity concentrations was manufactured for this study allowing a better identification of the eventual causes of quantitative discrepancies among the participating sites.
In order to avoid discrepancies related to data analysis in each participating site, it was decided to use the built-in software provided by GE Healthcare and available on operating console to perform NEMA NU-2 2012 image quality test. 18

F-FDG phantom acquisition
In the first part of the study, data from standard NEMA image quality test with an 18 F-FDG phantom from seventeen different sites across Europe were combined. Each of the participating sites followed the manufacturer-provided protocol conducting three acquisitions with the IEC61675/NEMA NU2-2012 PET Image Quality Body Phantom. This phantom is a torso-like cavity containing 6 hollow spheres (internal diameters of 10, 13, 17, 22, 28, and 37 mm) surrounding a lung insert. The phantom was filled on each site with an 18 F-FDG solution of approximately 53 MBq at the start of acquisition (Fig. 1a). The four smallest spheres were filled keeping sphere (21.2 kBq/ml) to background (5.3 kBq/mL) ratio of 4:1, while the largest two spheres were filled with water. The lung insert was filled with a non-radioactive mix of water and Styrofoam beads.
According to the NEMA NU2-2012 image quality test, there was also out-of-field activity (120 MBq) in a line source in the scatter-fraction phantom.
The acquisition and reconstruction parameters for the NEMA procedure, followed during the 18 F-FDG phantom tests, are listed in Table 1.

QUICK phantom acquisition
Five 1 out of the seventeen sites performed acquisitions with the QUICK solid radioactive phantom ( 68 Ge) (Fig. 1b) that was designed and manufactured by Eckert & Ziegler 2 to meet the IEC61675/NEMA NU2-2012 image quality body phantom features (6 spheres and a lung insert). To meet the requirements of this Fig. 1 a Scheme of the IEC61675/NEMA NU2-2012 Image Quality Body Phantom (in black the cold spheres and in gray the hot spheres) and b a picture of the QUICK phantom standard procedure, the total 68 Ge activity in the phantom was 63.94 MBq at calibration, allowing a relatively easy shipment from one site to another. The two biggest spheres (28 and 37 mm in diameter) were made of cold resin (non-radioactive) and the four smallest ones with 68 Ge resin at a concentration of 25.2 kBq/ mL. The background activity concentration was at 6.36 kBq/mL so that the ratio of spheres to background was 3.96:1. Contrary to the NEMA NU2-2012 image quality test, there was no out-of-field activity during the acquisitions to avoid any unexpected effect of external source positioning.
The five centers involved in the QUICK phantom acquisition are identified in this paper as QUICK sites 1-5.
The data acquisition involved 5 measurements per day for a span of 5 days to account for the effect of the Poisson statistics and variability in phantom positioning, respectively. Thus, a dataset of 25 images have been used for this analysis for each site. Table 2 summarizes the acquisition and reconstruction parameters for the different tests.
In test 1, the QUICK phantom was scanned with the acquisition and reconstruction parameters recommended by the manufacturer for the standard NEMA NU2-2012 procedure (same as in Table 1 for the 18 F-FDG phantom). The NEMA-IQ protocol recommended by the manufacturer does not include PSF because not all  users have access to PSF, whereas a user with access to BSREM automatically includes access to PSF.
In test 2, the RAW data was re-processed using the BSREM algorithm and a β value of 25 optimized to match the background noise level (BV) observed in the corresponding OSEM reconstruction (test 1). Increasing beta value results in smoother background (preferred by physicians) at the expense of reduced contrast recovery [16]. This avenue was not explored as the intent of the paper is to compare site to site variations, not the impact of different beta values in clinical images.
In test 3 and test 4, each QUICK site re-processed the same RAW data using their own clinical reconstruction parameters optimized for OSEM and BSREM algorithms, respectively. For OSEM, the number of iterations for clinical reconstruction was the site's choice as preferred by the site's reading physicians and is a compromise between signal recovery and background noise level. The number of iterations for BSREM (25 iterations) is selected by the manufacturer until convergence is reached as intrinsically established by the BSREM algorithm ( Table 2).
In test 5 and test 6, the QUICK phantom was re-scanned in each site with its own clinical acquisition protocol (in terms of FOV (field of view), scan time, and acquisition matrix) and re-processed with clinically optimized OSEM and BSREM algorithms already used in test 3 and test 4, respectively.
Detailed acquisition and reconstruction parameters for each test are shown in Table 3.
In each site, the acquisition time for the QUICK phantom was adjusted to compensate for the decay of 68 Ge activity over time to meet the NEMA NU2-2012

Data analysis
For a consistent analysis, all the datasets were analyzed by each site using the automated GE NEMA image quality analysis software available on the operating console: CR and BV for each sphere in the phantom were calculated using each reconstruction method. Similarly, the LE was also determined using each reconstructed method. These variables are defined in the NEMA standards publication NU2-2012 image quality test [17].

F-FDG phantom acquisition
The results of the NEMA NU2-2012 image quality test, in terms of CR, BV, and LE, are presented in Fig. 2, Fig. 3, and Fig. 4, respectively. As expected, the majority of participating centers obtained similar results, and only two centers (QUICK Site 4 and QUICK site 5) showed unexpected data.
Excluding these 2 centers, the CR values range from 30.8-43.6 to 63.7-77.2 from the smallest to the largest of the hot spheres, respectively. For the cold spheres (28 mm and 37 mm), the same parameter ranges from 65.7-75.4 to 73-79.1.

QUICK phantom acquisition
The results of the six QUICK phantom tests, as average and standard deviation of acquisitions among all QUICK sites, are shown in Table 4 ( Figure 5 for graphic plot) and Fig. 6 and Fig. 7.

Discussion
The most often recommended procedure to evaluate the quantitation variability among PET facilities is to perform acquisitions and reconstruction of a fillable image quality phantom using the local setting of parameters and to extract quantitative data from resulting images. Following suit, the first part of this project showed that the variability introduced in the filling of the 18 F-filled NEMA-IQ phantoms, scanned and imaged following same protocols across sites, is larger than the one from a 68 Ge filled phantom where each site was allowed  to select the acquisition and reconstruction protocols. The main goal of limiting to a single scanner type (D-IQ from GE) was to control all other parameters in this comparison except for the imaging protocol selection between sites. This conclusion is further supported by the small variability observed when same imaging protocol is used with the QUICK phantom.
In this study, we have tried to assess and demonstrate the possibility of performing quantitative multisite clinical trials in PET using QUICK phantom. In addition, we also made an effort to understand the impact and potential causes of discrepancies among different sites if any. In order to avoid additional possible causes of discrepancies, we decided to extract quantitative data using the automatic processing available on the scanner's console.
As mentioned above, two centers of the 17 sites (QUICK site 4 and QUICK site 5) showed unexpected data from the 18 FDG phantom, despite having detailed directions for filling and imaging of the phantom. Their results can be explained by placing a larger than prescribed ratio of activity concentration between spheres and background (QUICK site 5) and too high activity in the background above the NEMA prescribed value (QUICK site 4).
These results clearly demonstrate the possible issues when trying to compare systems using 18 F-FDG fillable phantom due to a different level of experience of the operators (e.g., handling of FDG, accurate filling of the spheres, careful alignment of the phantom). A solid radioactive phantom filled with 68 Ge can be used to avoid such errors [10], and the reproducibility of the measurements can thus be enhanced.
Results of the standard NEMA test using the QUICK phantom support this assessment by showing more consistent results. With equal acquisition conditions among sites, the BSREM algorithm gives better or similar results than the OSEM algorithm, i.e., higher CR, lower or similar BV, and a lower LE. The exceptions are due to the specific selection of β value (25) was intended to produce comparable background variability in the BSREM images with respect to OSEM images. This selection was not optimal, particularly for smaller ROIs (Fig. 5 top right), highlighting the trade between beta selection and CR. Larger clinical beta values (e.g., 350) produce smoother images at the expense of CR.
In general, all the tests performed using the 68-Ge QUICK phantom led to small variability of the CR which is the most relevant parameter in the perspective of quantitative PET studies. One aspect of the results presented here, that is "the 68 Ge NEMA-IQ standard NEMA-2012 test", provides a unique opportunity to observe the variability introduced from scanner to scanner when most other variables are confined to a very small change of their own.
Focusing on test 6 (the site's clinical acquisition and reconstruction parameters), BSREM algorithm reduces the CR variability among sites when compared with the OSEM algorithm. This result supports the use of BSREM as the preferred protocol in a multi-center clinical trial regardless of the site-specific beta value. This may reduce the variability introduced by patient size and body habitus [18].
Logistical challenges in transporting the QUICK phantom from site to site (e.g., custom approvals, local regulations with respect to transport and handling of radioactive material), resulted in delays of sites performing the measurements. Hence, each site performed the measurements with different total activity of the phantom. We compensated the decaying activity through longer acquisition times. However, it should be noticed that the acquisition conditions are not fully equivalent as the ratio of true events to random events does not remain constant with changing activity. The QUICK project was designed as a phantom study for comparing reconstruction protocols and selection of parameters. Hence, it does not take into account other parameters like dose calibrator, injected activity, serum glucose level, and post-injection time. To minimize variability of these parameters, multi-center participants must meet accreditation requirements that should include, for instance, the use of traceable sources for calibration of their dose calibrator, and similar patient preparation [7,19].

Conclusion
This study demonstrates the variability introduced by using different imaging protocols is smaller than the variability introduced by the operator when a NEMA-IQ phantom is handled and imaged by different sites following identical protocols. In turn, this demonstrates that the use of phantoms like the NEMA-IQ phantom for multi-center evolution of protocols is unreliable.
BSREM algorithm reduces the variability of contrast recovery in PET images compared to the OSEM algorithm in different clinical acquisition conditions and reconstruction parameters across multiple centers. In the future, we plan to study the impact of the patient's preparation on the variability of these measurements and to evaluate if harmonization of the injected dose is a necessary requirement for this kind of comparison studies.