Optimizing a Retinal Disease Image Recognition Model

Introduction

This is the second part of creating a better model for the MURED dataset. If you would like to read the first part, which will provide context and a basic understanding of what my goals with the research were, you can do so at https://josedavidlomelin.blogspot.com/2022/11/introduction-to-my-research-project.html. My goal with the research was to create a more effective model for the MURED dataset. In addition to this, the MURED dataset was made using very high quality images, which does not always reflect real world conditions. In order to account for this, I also created an image noise generator reflective of real world conditions. Lastly, by making a better CNN model, even if only better through a few metrics, would imply that Transformer models are not always the best option to use in certain image recognition tasks. There is an ongoing debate for which type of model is better for certain image recognition tasks[1] and this would show that more complex models are not necessary in order to get similar accuracy.



New Model Changes

In the first part, I described the initial model architecture that was going to be used. However, due to Colab memory usage limitations and other errors, including a low accuracy and hard to train model, I decided to go with a much simpler model. The model takes the following architecture.

New Model Architecture 


There are several differences between the original model and the current one. Firstly, the new model has data augmentations and an image noise step. I will go more into detail later in the blog, but for now keep in mind that the images are much better prepared than in the initial model. The second large change made is that the model now only combines an Xception model’s output with a ResNet model’s output. The two models have a very small size, and are still very accurate. 

Backbone Data[2]:





No. of Parameters

Top-1 Accuracy

Top-5 Accuracy

Inception

23.62 million

0.782

0.941

VGG

138 million

0.715

0.901

Xception

22.85 million

0.79

0.945

ResNet

23 million

0.77

0.933

Before, there were four models being combined in different ways, but after many iterations

I found that the simpler the model, the better it was at learning. Lastly, I increased the

amount of fully connected layers from two to four, adding many more units. Although there are a few

smaller changes I did not go over, and some not shown on the graphic, those were by far the changes

that gave the biggest difference in the model performance. 



Data Augmentation, Noise and Images

Data augmentation, or the practice of creating “new” data from existing data was a crucial part of

increasing the model’s metrics. Although the Transformer based model utilizes data augmentation,

many of the augmentations used do not reflect real world conditions. For example, a vertical flip and

a rotation of up to 30 degrees were used on the dataset as augmentations. There is no reason for a

retinal scan to be taken at a 30 degree angle, much less upside down. Additionally, most of the

augmentations were applied at a rate of .3 or .5, while my model’s augmentations were applied across

all of the dataset. Some of the augmentations that were used include a width, height and zoom shift, to

account for slightly differing eye sizes and inconsistent pictures, a 50% horizontal flip to train the model

for both right and left eyes, and a brightness range, samplewise center and samplewise standard

normalization to account for inconsistent lighting between scans. In addition to this, the “OTHER”

class was removed from the training set, minimizing the amount of diseases with an unproportionally

small sample size. 

Eye Scans Before Image Augmentations
Eye Scans After Image Augmentations (samplewise_std_normalization set to False)



The MURED dataset comes from many high quality images, and a full breakdown can be found of how the dataset was found at https://arxiv.org/pdf/2207.02335.pdf. Although clean data is nice to train a model on, it does not teach the model how to recognize retinal diseases when given a non-perfect image. To account for this, 20% of images underwent an artificial noise generator to simulate real world eye scanning conditions. An example of scans with the added noise are seen below, with image augmentations. (The Samplewise Center augmentation greatly reduces the visibility of the noise added, but is still present).

Eye Scans After Image Augmentations with Noise Function



Model Metrics 

The main goal of the research conducted was to create a more effective model on the MURED dataset while making the model account better for real life situations. Before comparing the two models, it is important to keep in mind that these metrics were taken with the addition of random noise and augmentations that reflect the real world closely, which the Transformer model did not have. The main issue with the Transformer model, outlined in the first research blog, was a low precision, shown in both the mean average precision metric and in the model’s F1 score. Their model's mAP score was 0.685, while their F1 score was 0.573. Although not specified, the assumption was made that the score was based off of the dataset’s validation data. In comparison, as of February 14, 2023, our best model’s mAP score is 0.733, improving the precision. However, our model’s F1 score is 0.4614, which is a little worse than the Transformer’s model’s metric. Additionally, our AUC was very similar to theirs, as we had 0.9233 while they had 0.962. Lastly, the ML score, defined by the previous MURED research paper, is defined as ML Score = (ML mAP + ML AUC)/2. While the proposed Transformer based model scored a maximum ML Score of 0.824, the CNN based model scored a maximum ML Score of 0.834, which shows an improvement over the Transformer based model.


Conclusion

The results of our model are significant because our model is a major improvement over the Transformer based model, due to the improved image augmentations and an overall improvement in the model metrics. Although not perfect, the model is an important step in using machine learning to help doctors diagnose retinal diseases because it shows that RNNs and Transformers are not necessary to do so. These models typically take more space and more data to be effective, and their main advantage is that they can “understand context” in an image. However, this is not important as most eyes have a very similar structure to each other. Additionally, the model was trained to account for variations in retinal scans through data augmentation. In conclusion, all of the goals set initially were achieved. The CNN based model was better than the Transformer model on most of the metrics used, a successful data augmentation and noise generator that reflect the real world were created, and this successfully demonstrates that Transformer and RNN based image recognition models are not always better than CNN based ones.

This research will be published sometime over the next month or two, and more updates will come as soon as we get a publication.



Citations

[1]https://becominghuman.ai/transformers-in-vision-e2e87b739feb

[2] https://analyticsindiamag.com/a-comparison-of-4-popular-transfer-learning-models/


Comments

Popular Posts