Main findings
A deep learning AI segmentation model was capable of fully automatic accurate CAG segmentation, as checked by a reference segmentation obtained with validated software and also when assessed by a broad assessment score we previously developed [
5].
Diameters at both healthy segments (proximal and distal lesion borders) and diseased segments (diameter at maximum obstruction zone) were similar between the two groups, with statistically significant differences only at the proximal obstruction border. However, in absolute terms, the difference was very small (0,19 mm, a < 10% difference considering the proximal diameter in either group) and we therefore believe it is unlikely to be of clinical significance. The stenosis severity as assessed by percentage stenosis only differed by < 5% in absolute terms, a difference not meaningful either statistically or clinically. The latter is perhaps the single most important finding, as percentage diameter stenosis is the fundamental criteria assessed in clinical practice for proceeding with either revascularization or functional testing, as recommended in current guidelines [
10]. Importantly, there were no significant differences in performance regarding target vessel, stenosis severity or centers.
When considering the overlap between the segmented image and the original image, accuracy, specificity and negative predictive value scored close to 100%. This was expected, because most of the image is composed by background rather than artery. As a result, we believe metrics that do not take into account true negatives provide a more faithful indication of the actual model performance. In that regard, sensitivity and positive predictive value still scored quite high, at approximately 95%. The metric that more directly assesses the true overlap between the original and segmented images in the region of interest (correctly identifying all of the vessel while avoiding non-artery pixels) is the intersection over union criteria, which fell just short of 90%. Lastly, the Dice Score puts greater emphasis on the fundamental task of segmentation – correctly identifying the target structure i.e. true positives – in this case, the coronary tree. With an average score of approximately 95%, while also considering all the remaining metrics, we believe our model can be described as accurate. Importantly, the Dice Score in our previous study was 93%, thus very similar to what we now found [
5]. There were statistically significant differences in the IoU and Dice Scores between target vessel stenosis severity. Notwithstanding, the absolute differences were very minor (around 1–2%) and therefore of little or no clinical relevance.
With regards to the GSS, our model achieved a high score with a median of 92/100 points, exceedingly similar to what we had previously described in the dataset used to train and develop the model. The model scored very high in almost all tasks, while maintaining minor imperfections with regards to mild gaps in collateral branches, which were very frequent. Catheter segmentation was not as good as coronary segmentation, as usually small catheter artifacts or gaps in the vicinity of the coronary tree origin were common. This was due not only to contrast backflow, but also because of how AI models are trained and function. Indeed, performance is very dependent on class frequency. Because the catheter is a less frequent class (i.e. corresponds to much fewer pixels), the models receive less penalty for errors regarding its segmentation when compared to the coronary tree. This is partly mitigated by the use of an appropriate loss function, but the imbalance nevertheless persists to some extent. Once again, this was very similar to what we saw in the training dataset [
5]. With regards to precise catheter measurements, the differences between original and segmented images (for both 5 and 6 Fr catheters) were not statistically significant, suggesting the catheter’s segmentation, from a calliper precision point of view, is accurate. However, due to the above-mentioned limitations and to a small number of images where only a small portion (or none at all) of the catheter was discernible, our sample was somewhat reduced, thereby limiting this assessment.
Other studies in the field
There are very few studies published in medical/biology journals to date where a comparison with our results can be made. With regards to the GSS in particular, no similar application has ever been undertaken, to our knowledge.
The largest published study [
3] included a dataset of 1050 images distributed across all incidences and vessels for performance evaluation. An average 98% accuracy was obtained. While specificity and negative predictive values scored very highly, sensitivity and positive predictive value came closer to 80%. The performance was slightly inferior in more distal vessels. Intersection over union or Dice Score were not reported. Importantly, however, that study’s evaluation used the baseline human annotation as reference, rather than an external validated software, thereby not enabling the identification of bias or imperfections which might have become embedded in their AI model. In our previous study, we demonstrated that even with a small group of annotators and continuous review of the annotations, there is always some degree of imperfection in human annotation [
5], hence the relevance of comparing against an automated and validated external software. Additionally, the reported accuracy focuses on the overlap across the entire coronary tree rather than the percentage stenosis of diseased segments. This is advantageous in the sense that a globally accurate performance can be tested. Notwithstanding, we believe testing only for diseased segments actually renders the comparison more demanding. This is because the segmentation of stenotic segments is harder from a technical point of view and also due to the fact that the number of true positive pixels is necessarily smaller in such segments – leading to a lower likelihood of true positives. Whichever interpretation is made, it is clear that an exact comparison with Du et al. [
3] is not possible. However, broadly speaking, the accuracy of both models seems quite high and our model seems at least as accurate, if not more.
Su Yang et al. [
4] also produced AI models for CAG segmentation. Their validation dataset was somewhat larger (181 images), but their performance seems slightly lower, with all overlap metrics generally scoring just short of 90% and a Dice Score of 89%. Importantly, they also only segmented diseased segments, with a minimum lesion of 30% and used the same reference software as we did. Thus, their results are more directly comparable to ours and our model seems to have superior performance. Two other works [
1,
2], from the same baseline dataset, also went on to develop AI-based CAG segmentation, this time with a validation dataset of 550 images. While the model performed well, with an accuracy of 98% and a sensitivity of 87%, they also based their validation dataset on human annotation of the coronary tree without using external software. Thus the above-mentioned considerations for Du et al. [
3] also apply.
Recently, Gao et al. [
11] published the results of a CAG segmentation model trained on only 130 images. Their methodology, however, is somewhat different, since they combined features from deep learning segmentation models’ features and non-AI image filters to perform pixel-wise classification using gradient-boosting decision trees [
12] and deep forests [
13]. Their results also show good performance, with a Dice Score of 87,4%, sensitivity of 90,2% and specificity of 99,2%. This highlights that merging deep learning with traditional computer vision methods can yield good results, when working with relatively small datasets. However, no external validation software was used and the whole coronary tree was evaluated. As a result, once more, the previous considerations for Du et al. [
3] apply.
Other works in the application of AI to coronary segmentation are primarily technical and featured in engineering publications. A detailed review of these falls outside the scope of this paper and can be consulted in our previous technical publication [
7]. However, some considerations regarding these provide further contextualization of our findings.
Xian et al. [
14] used a very large dataset of 3200 manually annotated images and experimented with the U-Net architecture as well, with a sensitivity of 90,1%, positive predictive value of 89,8% and Dice Score 90%. However, the annotations were undertaken with a specific software for the purpose of coarsely signaling the vessel route, and focused only on the main vessels. Since we achieved higher performance metrics, it seems a smaller but higher quality dataset, with very precise and cumbersome manual annotations, may be a better approach.
Yang et al [
15] have obtained a sensitivity, positive predictive value and Dice Score of 91,3%, 92,5% and 91,9%, respectively, by using popular image classification backbones pre-trained on ImageNet instead of the U-Net’s encoder, while also using a modified generalized dice loss function. Their findings were influential in our training method, as we used a combination of their proposed loss function and the focal loss [
16]. Other authors have explored the use of dense connections, improving on the performance of the standard U-Net [
17]. This approach is also present in the U-Net ++ [
18], which we used in our approach.
In all of the above studies, metrics regarding vessel diameters were not performed. Thus, a direct comparison with this study regarding those is not possible. M’hiri et al. addressed the issue of CAG diameter measurements, when dealing with the issue of diameter variation during the cardiac cycle due to vessel distensibility. They focused mainly in measuring specific segments of the coronary tree, as we did. However, they used a graph-based segmentation method, then tracked the changes across the cardiac cycle using a spatio-temporal segmentation method. They obtained a Dice Score of 98%, with a very small diameter mean error (0,18 mm) [
19]. However, they did not focus on diseased regions. While this study is not focused on AI methods, it highlights that other methods may be of use for accurate CAG segmentation, potentially in combination with AI tools [
11].
In light of all these studies, the performance of our model seems at least as good, if not better, than previously proposed AI models. We believe this is related to its neural network architecture, which was carefully chosen over a series of experiments [
7], taking into consideration the invaluable contributions of previously mentioned studies. In addition to that, we also believe that our manual annotations methodology was essential, as it allowed us to obtain a highly reliable training dataset: a small number of annotators (to reduce heterogeneity) well trained in the interpretation of coronary angiograms; very careful review of annotations with recurrent iterations of quality checks and improvements; and further manual improvement of the already accurate segmentation images produced by an earlier AI model, thus combining the best of AI and human annotations into a final training dataset, as mentioned in the
methods section and previous publication [
5].
Limitations
Our study is not without limitations. Despite the multicentric approach, our dataset is relatively small when compared to previously published studies. We also only tested the model performance against validated software in diseased locations, rather than on the whole coronary tree. Therefore, we cannot affirm that the performance would be identical in the remaining areas. However, as previously explained, segmenting zones with lesions is actually more challenging for the model than segmenting broad, mostly healthy segments. In addition to that, we did not find differences regarding target vessel or lesion severity. Plus, considering the results of the GSS, the overall performance regarding CAG segmentation was quite appropriate. Thus, we believe that it is unlikely that performance would be significantly different had we tested for the whole coronary tree. Importantly, if we had chosen to segment whole vessels, it would be very likely that some manual corrections had to be undertaken, which might induce bias or imperfections in the reference images. Hence, the decision to proceed as described was deliberate. The assessment of catheter segmentation was also more limited than that of the coronary tree, as described above.
The exclusion of potential sources of artifacts from devices or previous cardiac surgery means our model is not yet applicable to such patients. Notwithstanding, we didn’t exclude cases with previous implantation of stents, but we did not perform detailed measurements on such segments.
The total number of patients/images who fully met exclusion criteria was somewhat high, thereby limiting the final amount of available images for analysis, which may raise questions as to whether this sample is representative of everyday CAGs and an therefore constitutes an adequate validation dataset. This was the result of somewhat stringent criteria, which we felt were nonetheless necessary due to basic feasibility (such as excluding single-vessel complete occlusion cases where QCA is not applicable, or excluding imaging artifacts for which the models are not yet trained), reduction of bias (such as not allowing for manual QCA correction), or excluding patients with normal/near-normal arteries (where testing would be much less challenging or useful in future clinical application). Notwithstanding, we included patients consecutively rather than selectively and the clinical characteristics of included patients are in agreement with everyday clinical practice. We therefore believe our sample to be reasonably representative of real-world practice. Furthermore, we exceeded the minimum validation target of 100 images, yielding relative rates of training vs. validation cases in agreement with other AI studies [
2‐
4].
The imbalance in sample size limits the comparison between centers.
It has long been established that operators significantly differ in their interpretation of lesion severity and have a tendency to overestimate the importance of a stenosis [
20‐
24], as we also saw in this study. Indeed, while visually all lesions were interpreted as > 50% stenosis, a significant amount of the sample actually had a < 50% stenosis, which further reflects the real-world nature of the dataset.
Lastly, the distance between the 2D centerline and the distance to the closest edge would have also been a good metric for assessing model performance in this setting. We did not perform such testing.
In light of all of the above, concerns may be raised regarding generalization from this dataset. However, we believe that the absence of statistically significant differences across all subgroups at least partially attenuates this concern.
Future directions
We are currently working in automatic anatomical interpretation, lesion severity based on auto-QCA and integration with physiology. We believe without effective segmentation models, none of these will be possible. Much like for human interpretation of CAG, separating the coronary arteries from everything else in the image is an essential first step. Our ultimate goal is to produce an intelligence augmentation tool that helps physicians perform a more objective and streamlined interpretation of CAG, hopefully contributing for better patient outcomes. As we continuously improve its performance, while also adding new capabilities, clinical application will potentially be possible in the near future, opening a new perspective and potentially more accurate method to assess coronary artery disease.
We are also continuously working to expand and improve the model, as segmentation alone is not a final goal in itself, but rather a fundamental step. We hope to release a public version in the near future, which other researchers may use for whichever application they may deem useful. Importantly, comparing or even merging with future models from other groups may also be very relevant. Since it uses an inherently data-hungry deep learning model, our coronary artery segmentation system would surely benefit from training on a larger volume of data. Manual annotation of coronary angiography images, however, is very cumbersome and time-consuming, and therefore it is difficult to obtain much larger labeled datasets. Hence, significant improvements to the model could probably be achieved, for example, by using self-supervised learning on existing very large volumes of unlabeled data. These possibilities are described in detail in our previous technical publication [
7].