Convolutional neural networks are usually trained with pairs of low- and high-quality images. In nuclear medicine, however, the high-quality images (e.g. obtained with long scan time) can be substantially different from true distributions, due to technical limitations such as the limited spatial and energy resolution of the gamma camera. To prevent the neural network from learning the errors that arise from these limitations, this work introduced an additional intermediate reconstruction and projection step. The neural network could with this method potentially learn how to approach the true distribution better than the iterative image reconstructor. The separate steps are discussed in more detail below.
Patient data
Our retrospective study was approved by the local ethics committee, who also waived the need for informed consent of the patients involved. Projections of 128 SPECT/CT scans from the pre-treatment 99mTc-MAA radioembolization procedure were available. All scans were performed on a dual-head Symbia T16 detector system (Siemens Healthineers, Erlangen, Germany). Projections were obtained in 20 min under 120 angles using a low energy high resolution (LEHR) parallel hole collimator with a photopeak window between 129 and 150 keV and a scatter window between 108 and 129 keV. Out of the 128 distributions, 100 were used for network training, 20 for validation, and 8 for testing purposes.
Ground truth reconstruction
The patient projections were first reconstructed using the Utrecht Monte Carlo System (UMCS) [10]. This software package has been validated for several isotopes [11,12,13,14] and is considered state-of-the-art. UMCS accounts for attenuation with the μ-map obtained from the CT, resolution through point spread function modelling, and scatter using Monte Carlo simulation of the photon interactions in the body. A total of 10 iterations with 8 subsets were performed and no post-reconstruction filter was applied. The volumes had 128 × 128 × 128 voxels with 3.9 mm isotropic voxel size. The resulting reconstructions were set as ground truth distributions and were used for comparison of the reconstruction methods at a later stage.
Synthetic volumes
The performance of the neural network should improve when more unique volumes are available to train on. The 100 ground truth distributions in the training set were hence used to create additional synthetic volumes. From a random patient, the liver mask with corresponding attenuation map was first selected. A sphere with a random diameter of 7 to 20 pixels was then positioned at a random location in the liver and filled with a random patch of the activity distribution from another patient. The process was repeated until the entire liver mask was filled. The generated synthetic volumes were thus a composition of patches from tens of separate reconstructions. In total, 900 synthetic volumes were created with this method.
Projection generation
Projections of the 100 ground truth distributions and 900 synthetic volumes (with collimator and detector effects and up to ten orders of scatter) were generated using UMCS. The use of a high number of photon tracks combined with convolution-based forced detection yielded nearly noise-free projections. Poisson noise was then added so that the simulated projections became representative of real detector measurements. Projections were simulated for two scan times: 20 min, as is customary for a regular diagnostic SPECT/CT scan, and 5 min, which we envision for use in interventional radioembolization procedures. The total activity of the ground truth volumes was set to 150 MBq, as this is the average injected dose in the radioembolization pre-treatment procedure in our hospital. The detector was configured with a single head, in anticipation of the compact mobile system mentioned in the introduction.
Reconstruction methods
The above projection sets were reconstructed using four different methods:
Filtered back projection (FBP). The ramp filter was used in combination with Chang’s correction [15] to compensate for attenuation (using the attenuation map from the CT scan). A post-reconstruction Gaussian filter of 5 mm full width at half maximum (FWHM) was applied to remove the most severe artefacts.
Monte Carlo-based reconstruction (MC). The projections were reconstructed using the same reconstructor as used in the initial reconstruction (UMCS). A total of 10 iterations with 8 subsets were performed and no post-reconstruction filter was applied.
Clinical reconstruction (CLINIC). An iterative reconstruction method, as can be found in state-of-the-art clinical methods (such as Flash3D in Siemens systems), was used. This reconstruction method included attenuation correction and resolution recovery and used dual-energy window scatter correction [16]. Scattered photons were smoothed with a Gaussian filter of 5 mm FWHM and added to the reconstruction loop at fraction k = 0.5. A post-reconstruction Gaussian filter of 5 mm FWHM was employed and a total of 10 iterations with 8 subsets were performed. These settings were chosen as they are the current clinical practice in our institute.
Convolutional neural network approach (CNN). The projections were first reconstructed using FBP as above and then fed to the trained network to increase the image quality.
Network design
The neural network used a deep convolutional encoder-decoder structure (see Fig. 1), which is frequently used for denoising applications [17, 18]. Network training was performed by minimizing the voxel-wise mean squared error of the FBP reconstructions with the combination of the 100 ground truth distributions and 900 synthetic volumes. The network consisted of layers with several resolutions, which were connected with each other via concatenation (to ensure small objects were not lost in training). Five adjacent slices per sample were used as input so that resolution in all directions was preserved. By inserting all 128 slices from FBP into the network, the entire volume was reconstructed. Separate networks were trained for both simulated scan times.
For the encoding layers, every step consisted of two 3 × 3 convolutional layers with ReLu activation function, followed by 2 × 2 max pooling. The decoding layers first upsampled the resolution and again used two 3 × 3 convolutional layers. Learning was performed using the ADAM optimizer [19] with a learning rate of 1e−4. The data was fed to the network with a batch size of 128. Training continued until no further decrease in the loss function was observed for 20 epochs. The training was performed using TensorFlow 1.7.0 [20] with Keras 2.1.6 [21].
Evaluation
Network performance
It was first studied whether the introduced reconstruction and projection step (by setting the initial reconstructed images as ground truth) performed better than when training directly to the Monte Carlo-based generated reconstructions. It was subsequently evaluated whether the augmentation of training data with synthetic volumes aided network performance by separately training with 0, 300, 600, and 900 synthetic phantoms, in addition to the 100 ground truth distributions. The minimum acquired losses were used as a measure for network performance. Since the neural network is slightly sensitive to the random initial weights chosen, five realizations were performed per setting.
Validation performance
The mean squared error of the four reconstructions (normalized to the total reconstructed activity) with the associated ground truth was calculated for the two scan times and used as a quantitative measure for reconstruction quality for the 20 patients in the validation set. The difference of the LSF with the ground truth distributions was furthermore measured because this measure is often assessed in hepatic radioembolization.
Phantom measurements
A phantom study was performed to evaluate to what extent the neural network approach could reconstruct true detector projections. An anthropomorphic phantom was adjusted from a commercially available phantom (Anthropomorphic Torso Phantom: ECT/TOR/P) by the inclusion of three extrahepatic spheres (with volumes of 2.0, 4.1, and 8.1 mL) and one solid sphere (15.7 mL) and one sphere with cold core (5.6 mL cold volume; 18.7 mL hot volume) inside the liver (see Fig. 2). The extrahepatic spheres were filled with 2.7 uptake ratio in relation to the liver background activity, for the spheres inside the liver, this ratio was 7.7. The lungs were filled with LSF of 5.2%. The phantom was filled with water and had 157 MBq total activity of 99mTc. The phantom was configured in this way to represent situations encountered in hepatic radioembolization [22, 23].
The anthropomorphic phantom was scanned for 20 min on the same scanner with the same acquisition settings as in the patient scans but now with a single head. By uniform random subsampling from the obtained projections, projections with 5 min scan time were additionally acquired. The liver, sphere, and lung masks were generated from delineation on the low-dose CT. The uptake ratios of the solid spheres, contrast-to-noise ratio (CNR), and LSF were calculated from the reconstructions and compared with the values of the phantom. The CNR was calculated with the solid sphere inside the liver and the liver background.
Original detector projections
Reconstructions were finally performed on the 8 patient projections in the testing set to give an indication of the reconstruction performance on true detector projections for patient distributions with varying levels of activity. Since no ground truth is present for these cases, the reconstructions were solely visually compared.