Comparison of machine learning and semi-quantification algorithms for (I123)FP-CIT classification: the beginning of the end for semi-quantification?

Background Semi-quantification methods are well established in the clinic for assisted reporting of (I123) Ioflupane images. Arguably, these are limited diagnostic tools. Recent research has demonstrated the potential for improved classification performance offered by machine learning algorithms. A direct comparison between methods is required to establish whether a move towards widespread clinical adoption of machine learning algorithms is justified. This study compared three machine learning algorithms with that of a range of semi-quantification methods, using the Parkinson’s Progression Markers Initiative (PPMI) research database and a locally derived clinical database for validation. Machine learning algorithms were based on support vector machine classifiers with three different sets of features: Voxel intensities Principal components of image voxel intensities Striatal binding radios from the putamen and caudate. Semi-quantification methods were based on striatal binding ratios (SBRs) from both putamina, with and without consideration of the caudates. Normal limits for the SBRs were defined through four different methods: Minimum of age-matched controls Mean minus 1/1.5/2 standard deviations from age-matched controls Linear regression of normal patient data against age (minus 1/1.5/2 standard errors) Selection of the optimum operating point on the receiver operator characteristic curve from normal and abnormal training data Each machine learning and semi-quantification technique was evaluated with stratified, nested 10-fold cross-validation, repeated 10 times. Results The mean accuracy of the semi-quantitative methods for classification of local data into Parkinsonian and non-Parkinsonian groups varied from 0.78 to 0.87, contrasting with 0.89 to 0.95 for classifying PPMI data into healthy controls and Parkinson’s disease groups. The machine learning algorithms gave mean accuracies between 0.88 to 0.92 and 0.95 to 0.97 for local and PPMI data respectively. Conclusions Classification performance was lower for the local database than the research database for both semi-quantitative and machine learning algorithms. However, for both databases, the machine learning methods generated equal or higher mean accuracies (with lower variance) than any of the semi-quantification approaches. The gain in performance from using machine learning algorithms as compared to semi-quantification was relatively small and may be insufficient, when considered in isolation, to offer significant advantages in the clinical context.


Background
(I123) Ioflupane (FP-CIT) or DaTSCAN SPECT imaging is used routinely for evaluation of the function of the striatal dopaminergic pathway. Image interpretation enables differentiation between Parkinsonian and non-Parkinsonian diseases, which may present clinically with similar features. Pooled analysis of phase three and phase four trials showed that (I123)FP-CIT images, when interpreted visually by expert readers, achieved a sensitivity of 88.7% and specificity of 91.2% in the detection of different striatal dopaminergic deficit disorders [1].
In recent years, semi-quantification software, which is intended as an aid to visual reporting, has become commercially available for use in the clinic. In particular, it is recommended by European Association of Nuclear Medicine (EANM) guidelines [2]. Typically, such software provides striatal binding ratios (SBRs) results, which describe the tracer density within small regions of interest as compared to an area of non-specific uptake. These figures give an objective measure of dopaminergic function and give an insight into the likelihood of disease being present. Several studies have suggested that the addition of semi-quantification can improve reporting performance, particularly in terms of reduced equivocal reporting rates and improved inter-observer variability [3][4][5][6][7][8].
However, semi-quantification is a relatively limited tool for interpreting and classifying (I123)FP-CIT images into different diagnostic groups. Information related to the shape and particular pattern of striatal uptake, which may be important for diagnosis, is not reflected in the SBR results. The figures produced may also be highly dependent on the accuracy of the image registration used, particularly if tight, sub-striatal regions of interest are applied. Semi-quantification software typically produces multiple SBR results from different brain regions, alongside associated normal ranges. The clinician must interpret each SBR result, in light of the normal ranges, to come to an overall decision on image classification.
These shortcomings can potentially be overcome through machine learning algorithms, which can receive multiple input variables describing different features to produce a single metric, such as a probability value, relevant to image classification. Table 8  All screening examinations from the PPMI database were downloaded (209 healthy controls, (HC), 448 with Parkinson's disease (PD)), including data acquired from multiple different centres, using the same acquisition settings (see Table 1). SBRs were derived from figures supplied by the core lab, whose methods are detailed elsewhere [9]. In short, images were reconstructed in HOSEM software (Hermes Medical, Stockholm, Sweden) using eight iterations and eight subsets OSEM with Chang attenuation correction but without scatter correction or resolution modelling. Images were then passed to PMOD software (PMOD technologies, Zurich, Switzerland) for non-rigid registration to the Montreal Neurological Institute (MNI) template (with manual adjustment), before combining eight axial slices and applying regions of interest in 2D in the putamen, caudate and occipital regions. Images and SBRs from each patient were calibrated using a striatal phantom scanned on the same equipment. Importantly, the diseased group only included patients for which the SPECT images had been visually assessed as having features consistent with PD.
For the local analysis all (I123)FP-CIT, images were downloaded from the archives at Sheffield Teaching Hospitals and anonymised for inclusion in the study. This included data acquired from four different dual-headed gamma cameras (3 GE Infinia and 1 GE Millenium, GE Healthcare, Chicago, USA), using the same acquisition settings (see Table 1). No camera-specific calibration was performed. However, the similarity in the collimators and detectors between systems should ensure that systematic differences between scanners were small. Details on administered activity and injection-to-scan delay are summarised in Table 1 for both the local database and the PPMI database, alongside image acquisition parameters.
Local images were reconstructed using Xeleris software version 2.1 (GE Healthcare, Chicago, USA), with 2 iterations and 10 subsets, as per the local clinical protocol. Neither attenuation nor scatter correction was performed nor resolution modelling. Each dataset was registered to a template using an affine transformation derived from the Sheffield Image Registration Toolkit (ShIRT; [10]). The registration was performed in stages, transforming the whole brain first and then focusing on individual hemispheres. Registration parameters were set through iterative optimisation, using visual analysis and Dice coefficients to compare results. Regions of interest were derived from those used in DaTSCAN neuro analysis in MIM software v6.7.3 (MIM software Inc., Cleveland, USA), propagating to the template space through non-linear registration. These were applied to image data in 3D to derive SBR values.
Diagnosis was based on the image report, which was produced in a group reporting setup with at least two reporters present in each case. The reporters had full access to previous imaging and other clinical information from the referrer. Cases where significant vascular disease or significant artefacts were identified were excluded. In total, 304 images were retained (113 patients without PDD and 191 with PDD) and 17 excluded. Patients were referred with a range of indications but differential diagnosis of Parkinsonian syndrome vs. essential tremor was the most common. Table 2 provides a summary of the patient population demographics for both the local data and PPMI data. These sets of data present different challenges to semi-quantification and machine learning algorithms. Accuracy is likely to be superior for the PPMI dataset as patient diagnosis is well-established through screening, and diseased patients without obvious dopaminergic deficit are excluded. The local clinical database is more heterogeneous with less certain diagnostic information, deliberately limited exclusion criteria and without quantitative calibration between scanners. This is likely to give rise to a wider array of uptake patterns, with more cross over between normal and abnormal groups, suggesting that accuracy will be lower. However, it is the relative performance of semi-quantification and machine learning that is of most interest, rather than absolute results.

Semi-quantification methods
There is a range of semi-quantification methods described in the literature and used in commercially available tools. These techniques calculate SBRs from regions of interest applied to the full SPECT volume or selected slices, typically after automated registration to a chosen template. In the clinic, results are usually compared with that of a group of 'normal' patients, which may be age-matched, as suggested by EANM guidelines [2].
Normal ranges are often calculated using simple statistical measures (for example, mean SBR ± 2 standard deviations). Usually, the limits of the normal ranges are used as a 'soft' cut-off, providing an indication of where the limit of normality lies but open to interpretation by the clinician. Some institutions may define a single cut-off between normal and abnormal groups by considering previously collected data from both healthy and diseased individuals and balancing sensitivity and specificity.
In order to provide objective figures on the accuracy of semi-quantification, hard limits must be defined on SBR figures, with rigid rules on overall classification. In this study, it was assumed that any SBR outside a normal limit cut-off would lead to an overall classification of abnormal. All SBRs must be within normal limits for an overall classification of normal. Although most clinicians would not treat semi-quantification results in this rigid manner, such results provide an indication of the accuracy of the software as an aid to clinical reporting. However, its precise influence is ultimately dependent on the reporting clinician.
In this study, two different approaches to defining SBR cut-offs are investigated: normal limits based on training data from normal subjects only and limits based on data from both diseased and healthy populations. This reflects the different ways in which semiquantification is used clinically. When using data from normal subjects only, limits are set based on different numbers of standard deviations from the mean or based on a minimum SBR value. Without consideration of SBR figures from diseased patients, this is a naïve approach to classification and is unlikely to achieve the best accuracy. For the second approach, using data from both normal and abnormal patient groups, the best cut-off is defined from the optimal operating point on the receiver operator characteristic (ROC) curve, where the highest classification accuracy is achieved. Only SBRs from individual putamina (with or without caudate results) are considered. It should be noted, however, that due to limitations in SPECT resolution, it is impractical to isolate uptake in the putamen from that of the adjacent globus pallidus. Thus, all results in this work which refer to the putamen are actually based on uptake in the whole lentiform nucleus. The convention of describing combined putamen and pallidum uptake as that of the putamen alone is maintained to ensure consistent terminology with the literature.
Inclusion of other ratios for performance assessment of semi-quantification (such as right to left ratio and caudate to putamen ratio) is likely to increase the chances of type I error and so are excluded from the analysis. The putamen is the region of the brain that often displays the first signs of dopaminergic degeneration so should be the most sensitive SBR value.
Given the natural decline in SBRs with increasing patient age [11], the semi-quantitative methods investigated account for this confounding variable by either limiting the normal comparison set to an age-matched subset of the training data (test patient age ± 5 years), or they perform a linear regression of SBR against age to derive a mean value from the normal population for the particular test case. The different semi-quantification approaches are summarised in Table 3, grouped according to the method of defining the SBR cut-off. By testing multiple different approaches with different numbers of SBR values and different comparison sets, a comprehensive evaluation of the potential performance of semi-quantitative software can be established.

Machine learning algorithms
In line with general trends seen in Table 8, SVM was used as a classification method, in both conventional linear form and using a radial basis function (RBF) kernel. The simplest image features cited in Table 8 are arguably: image voxel intensities, striatal binding ratios and principal component analysis of image voxels. This study applies these features and classifiers using a pipeline described in Fig. 1. Patient age is used as an added input variable in order to force the classifier to model changes in image appearance with age.
For algorithms taking SBRs as the input, pre-processing involved normalising the binding ratios in each putamen and caudate such that the mean value was zero with a standard deviation of 1. This ensured that each region of uptake was treated with equal importance by the SVM. For the other sets of features, additional pre-processing of the images was first required. Regions of interest were placed over the left and right striata. If necessary, images were flipped about the central axis of the brain to ensure that the most diseased striatum (with the lowest uptake) was always on the left side of the image, as described by Towey et al. [12].
The voxel intensities of each image were scaled to the mean value in the occipital lobe. The central area of the brain, containing the striata was masked with a single, loose region of interest, thus excluding areas that were not considered to be diagnostically important. The remaining normalised voxels or coefficients corresponding to their principal components (either the first 3, 5, 10, 15 or 20 components) were set to a mean value of zero (SD of 1). In the case of features based on voxel values, only a linear SVM was used. Given the very large number of voxel value inputs, the addition of a kernel was unnecessary and could have led to reduced performance due to overfitting. For all other features, both a standard SVM and SVM with RBF kernel were trained and validated.

Performance comparison
A fair and unbiased comparison between classification techniques is crucial. Classification boundaries should be defined from training data, independent of test data. In this study, each semi-quantitative method and each machine learning algorithm was trained and validated using both sets of clinical databases. A repeated, nested and stratified k-fold cross-validation approach was chosen. This technique splits the available data into different training and test subsections (i.e. different folds) such that classification rules are derived from and applied to different combinations of patient cases. Nesting is used for machine learning algorithms where hyperparameters must be chosen. Here, the training data is further subdivided in order to find the particular combination of hyperparameter values that gives the best accuracy. In this study, a 10-fold cross-validation strategy was chosen. This was repeated 10 times (though not for the inner, nested loops due to limitations in computational resources). All training and testing procedures were carried out with Matlab software (Matlab, Natick, USA), using the libSVM library [13] for defining the SVM classifiers. The hyperparameters of each machine learning algorithm (the 'C' regularisation term in the SVM objective function and the gamma term in the RBF kernel) were selected through a coarse grid search in each nested loop. Values between 2 −3 and 2 8 were tested for the C parameter and 2 −8 to 2 3 for the gamma parameter. The highest mean F-score was used as a metric for selecting the most appropriate values. Figure 2 provides an overview of the testing methodology adopted. Tables 4 and 5 show cross-validation results from the semi-quantitative methods, using local and PPMI data respectively. The mean accuracy of the methods for classification of local data varied from 0.78 to 0.87, which as expected was less than that for the PPMI data where mean accuracies varied between 0.89 and 0.95. In general, there appeared to be little influence on performance results when SBR results from the caudate were added to those of the putamen. Tables 6 and 7 show cross-validation results from the machine learning methods, using local and PPMI data respectively. Once again, mean accuracies for the local database are lower than that for the PPMI dataset (0.88 to 0.92 and 0.95 to 0.97 respectively). Importantly, every machine learning algorithm exceeded or matched the accuracy results of every semi-quantification method. Standard deviation figures are also smaller than those of the semi-quantification methods in most cases. Figures 3 and 4 summarise the accuracy results of the semi-quantification methods and machine learning algorithms.

Discussion
This study directly compares the performance of a range of semi-quantification approaches and three machine learning algorithms for classification of (I123)FP-CIT images into normal and abnormal groups. For local data, classification was between patients with pre-synaptic dopaminergic deficit and those without. For the PPMI database, the classification task involved separating patients with Parkinson's disease from healthy controls. In contrast to much of the literature, the validation method used for comparison was carefully chosen to reduce possible bias. Performing just one iteration of k-fold crossvalidation is known to be associated with increased variance [14], and so in this case, the process was repeated 10 times (in the outer validation loops). Stratifying samples in order to maintain similar proportions of normal and abnormal patients in train and test sets has been shown to reduce cross-validation bias [15] and so was also adopted in this study. Nesting the cross-validation, such that any hyperparameter selection was carried out separately in each fold, and with different data to training and testing steps, was also vital for ensuring that bias in performance results was kept to a minimum. This form of validation has been shown to provide an almost unbiased estimate of true classifier error [16].
Clinically, multiple SBRs and other derived ratios may be provided by semi-quantitative software to guide diagnosis. Typically, SBRs from the whole striatum as well as individual caudates and putamina on the left and right side are given. In addition, the caudate to putamen ratio and the right to left ratio may also be displayed. If all these individual SBRs and their associated normal limits are treated as individual tests, the final semiquantification classification is likely to be overly sensitive (increasing the risk of type I Fig. 2 Overview of performance comparison method error) and may give a pessimistic view on this form of analysis. Therefore, in this study, only SBRs from individual putamina (with or without caudate results) were considered.
As expected (see Tables 4 and 5), semi-quantification performance was superior for the PPMI dataset as compared to the local clinical database, reaching a maximum accuracy of 0.95 for PPMI and 0.87 for the local data. Variance on performance was also substantially lower for the PPMI data. These differences highlight the substantial difference between performing measurements on well-screened research data acquired according to a rigid protocol with healthy controls and realistic clinical data without an equivalent gold-standard diagnosis and without inter-camera calibration. Results from semi-quantitative evaluation of the local database are similar to those found by other researchers for evaluation of data from a mixed clinical cohort [17], adding confidence to these findings.
Semi-quantitative methods gave a relatively narrow range of accuracy scores across all the methods tested, with a wider range of sensitivities and specificities. Deciding on the 'best' performing method depends on the intended application. In clinic for example, a higher specificity than sensitivity may be preferred such that the false positive rate is low. There is no method that stands out in terms of its performance. However, it is interesting to note that two of the methods which treat classification as a two class   problem, generating cut-offs from both normal and abnormal putamenal SBRs (i.e. methods SQ 15 and SQ 17), produced some of the highest accuracy figures, with lower variance and well balanced sensitivity and specificity values. This is perhaps unsurprising as all other semi-quantitative methods (which are more reflective of commercially available tools) define cut-offs from the normal population only, with no knowledge of the distribution or likely crossover of abnormal data. In general, the addition of caudate data to semi-quantitative calculations caused a slight increase in sensitivity and slight reduction in specificity with little effect on accuracy, other than for methods based on ROC curve calculations, which saw a drop in performance. This suggests that the vast majority of diagnostically useful information can be gleaned from consideration of putamen uptake only. Again, this is unsurprising as image appearances often show more marked reduction in the putamen uptake than in the caudate [18].
It is worth noting that the Southampton semi-quantification method [19] was not investigated in this study. Recent research [17] suggests that the sensitivity of this approach is very low when calibration is not performed between different camera systems and is also significantly reduced when correction (including scatter correction) is not  . 3 Accuracy results for all semi-quantification and machine learning methods applied to local data. Semi-quantification results are grouped to the left of the graph and machine learning algorithms to the right. Whiskers represent one standard deviation performed. Unfortunately, camera-specific calibration data was not available for the local database of images and scatter data were not accessible for the PPMI dataset and so the method was excluded. The three chosen machine learning approaches are relatively simple and are largely based on previously described algorithms. Undoubtedly, they are not state-of-the-art. In recent years, techniques such as convolutional neural networks have become the dominant technology used by researchers for a range of classification tasks [20]. However, (I123)FP-CIT images have relatively low resolution, with limited variation seen in both normal and abnormal data. Therefore, advanced machine learning techniques may not be necessary to justify consideration for clinical translation. If superior performance can be demonstrated with these classical techniques, then there is a good argument for switching research emphasis from the creation of ever more complex algorithms to clinical evaluation of existing tools.
As shown by Tables 6 and 7 (and Figs. 3 and 4), the machine learning algorithms produced performance metrics that generally exceeded that of the semi-quantitative methods on the same data. All the machine learning algorithms gave accuracies as high as or higher than any of the semi-quantitative methods. Accuracy, sensitivity and specificity were generally high and well balanced for each machine learning tool, with smaller standard deviation values, providing evidence that these approaches are more accurate and less variable than semi-quantification. Machine learning performance metrics for the PPMI data matched the best performing algorithms produced by other authors (see Table 8), with results that are comparable with current state-of-the-art. As with the semi-quantitative results, performance for the PPMI database was substantially higher than for the local data, reinforcing the assertion that classification of the PPMI dataset is a simpler task than that seen in clinical reality.
For both databases, algorithms using different numbers of principal components as features gave the highest accuracies (methods ML 1 to ML 10), though the addition of larger numbers of principal components and the use of a non-linear RBF kernel appeared to have little additional impact on results. Although this study considered three principal components as a minimum, preliminary work using just one or two principal components demonstrated relatively high performance figures: mean accuracies (and standard deviations) of 0.87 (0.03), 0.96 (0.02) for linear SVM algorithms trained on PPMI data, Fig. 4 Accuracy results for all semi-quantification and machine learning methods applied to PPMI data. Semi-quantification results are grouped to the left of the graph and machine learning algorithms to the right. Whiskers represent one standard deviation Algorithms using only (I123)FP-CIT SPECT data are considered, multimodal inputs are excluded. Literature lacking accuracy data are grouped at the bottom of the table using one and two PCs respectively, and mean accuracies of 0.86 (0.06) and 0.89 (0.06), for linear SVM algorithms trained on local data, using one and two PCs respectively. Taken together, these results imply that linear separation between groups can be achieved with very limited numbers of variables.
Features based on raw voxel values and SBRs gave slightly lower performance values in general, more so for the PPMI data. Using voxel intensities as a direct input to a classifier dictates that the problem is ill-posed (due to the very large number of voxel values in comparison to the number of training images). Even with regularisation, performance may be still be affected by over-fitting, which may explain the slightly reduced accuracy. Classifiers based on SBRs are likely to suffer from limitations that are similar to that of semi-quantitative methods, in particular, that information on uptake patterns or striatal shape is lost.
Although machine learning algorithms appeared to perform better than the semiquantification tools, the clinical context needs to be understood in order to appreciate the significance and value of the results. Firstly, the level of classification performance improvement offered by the machine learning tools is relatively small in this study. It is difficult to determine whether differences were statistically significant due to the re-use of data in each test run. However, examination of the standard deviation on performance results (see Figs. 3 and 4) suggests that there is some crossover in accuracy of the machine learning and semi-quantitative methods. Given that standalone semi-quantification accuracy is approximately 87% for clinical data (and 95% for research data), the margin available for performance gains is real but narrow. Even with the introduction of more advanced tools, there cannot be a substantial gain in accuracy over the algorithms presented here.
Considering that (I123)FP-CIT is a low volume test, used on relatively few clinical patients, the investment required to develop a new clinical reporting tool and pass necessary regulatory hurdles (such as CE marking) may not be commercially justified. In addition, standalone classification performance is a relatively narrow and limited measure of clinical utility. In addition to being untested with radiologists in a realistic reporting scenario, the machine learning classifiers presented here, in common with most of the literature, only provide a decision score as to whether an image is likely to be abnormal or not. Localisation information, providing an indication of the location of any potential abnormalities, is not usually given. This contrasts with semi-quantification approaches which usually provide data on the quadrant(s) of the striata that is (are) affected, which may also be useful for determining the disease subtype. Furthermore, semi-quantification lends itself to use in research as a simple means of grading the severity of disease in response to an intervention. Although machine learning could achieve similar goals (see for example [21]), this aspect of 123I-FP imaging is usually considered as a separate problem.
However, machine learning can offer other benefits. Firstly, these algorithms simplify the information that is shown to the clinician. Rather than having to examine and interpret multiple SBR results and other ratio data, along with their normal ranges, clinicians are presented with a single number representing the overall likelihood of abnormality. Semi-quantification figures are known to be substantially influenced by factors such as the acquisition hardware and reconstruction parameters used [22][23][24][25][26][27], dictating that normal databases are often acquired separately by individual hospitals. It is possible that machine learning algorithms may be more robust to differences between hospital equipment and protocols, particularly if derived features such as striatal shape are used as input. More work is needed to verify the extent to which such benefits are realisable, which may augment the advantages offered by small increases in classification performance.
In addition, machine learning algorithms can learn disease patterns from multiple heterogeneous inputs. It is possible that by including patient clinical symptoms or results from other tests, diagnostic accuracy and robustness could be further improved. Furthermore, by learning classification models from subtle image features, it may be possible to distinguish between different Parkinsonian syndrome subtypes, such as multiple system atrophy (MSA) and progressive supranuclear palsy (PSP) from (I123)FP-CIT data. Despite the promising research that has been conducted using multimodality inputs [28] and in distinguishing Parkinsonian subtypes [29], rigorous tests on a range of realistic clinical data are lacking.
Although the gain in raw classification performance offered by machine learning may not be sufficient to justify moving away completely from semi-quantification, the results presented here do justify further exploration of machine learning tools. In addition to addressing gaps in our knowledge that have already been mentioned, an interesting avenue of future research would be to combine machine learning and semiquantification software in such a way as to enhance the information provided to the clinician. In the local context of Sheffield Teaching Hospitals NHS Foundation Trust, the authors will continue to advance machine learning towards the clinic by evaluating the impact of machine learning output on radiologists' decision-making.

Conclusions
This study has compared a range of semi-quantification approaches with three selected machine learning methods in order to evidence whether classical machine learning techniques are a superior means of classifying (I123)FP-CIT data into normal and abnormal groups. A research and local clinical database were used for repeated 10-fold cross-validation.
Results showed that classification performance was lower for the local database than the research database for both semi-quantitative and machine learning algorithms. However, for both databases, the majority of the machine learning methods generated higher mean accuracies (with lower variance) than any of the semi-quantification approaches. Mean accuracies for semi-quantification varied from 0.78 to 0.87 for the local database and from 0.89 to 0.95 for the PPMI database. The machine learning algorithms gave mean accuracies between 0.88 to 0.92 and 0.95 to 0.97 for local and PPMI data respectively. In addition, sensitivity and specificity were generally well balanced for the machine learning tools, while they varied more significantly for semi-quantification. This study was performed with machine learning baseline algorithms that can readily be modified for improved performance.
The gain in accuracy from using machine learning algorithms as compared to semiquantification was relatively small and may not be sufficient to justify a move to exploiting machine learning in the clinical context. A case for clinical translation would have to recognise that machine learning might offer other benefits, such as greater robustness to differences in acquisition conditions.