Multi-task Fundus Image Quality Assessment via Transfer Learning and Landmarks Detection

  • Yaxin Shen
  • Ruogu FangEmail author
  • Bin ShengEmail author
  • Ling Dai
  • Huating Li
  • Jing Qin
  • Qiang Wu
  • Weiping Jia
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11046)


The quality of fundus images is critical for diabetic retinopathy diagnosis. The evaluation of fundus image quality can be affected by several factors, including image artifact, clarity, and field definition. In this paper, we propose a multi-task deep learning framework for automated assessment of fundus image quality. The network can classify whether an image is gradable, together with interpretable information about quality factors. The proposed method uses images in both rectangular and polar coordinates, and fine-tunes the network from trained model grading of diabetic retinopathy. The detection of optic disk and fovea assists learning the field definition task through coarse-to-fine feature encoding. The experimental results demonstrate that our framework outperform single-task convolutional neural networks and reject ungradable images in automated diabetic retinopathy diagnostic systems.


Fundus image quality assessment Multi-task learning Optic disk detection Fovea detection 

1 Introduction

Fundus images are widely used in diagnosis of Diabetic Retinopathy (DR), age-related macular degeneration, and other retinal diseases [3]. Diagnosis requires high quality images, because blurred images may cause misdiagonsis. Recently, automatic DR diagnosis has gained significant research and clinical interests, while the image quality plays a key role in the diagnosis accuracy. Even though the technology used in digital fundus imaging has improved, non-biological factors resulting from improper operation can still reduce image quality. Therefore, image quality assessment (IQA) is essential for identifying ungradable images.

The IQA of fundus image belongs to No-Reference approaches without additional reference images. Lalonde et al. [6] used global edge histogram. Kohler et al. [5] used vessel segmentation. However, these methods only consider specific quality factor using hand-crafted features, neglecting the holistic picture of image quality. Deep convolutional neural networks (CNNs) have shown promising results in computer vision tasks. [9] proposed shallow CNNs to classify fundus image quality. [10] combined the CNN feature with salience map. These approaches lack analysis of quality factor. Hence, automatic retinal image quality assessment remains an open and important research direction.

In this paper, we propose a deep learning framework to extract fundus image features in both rectangular and polar coordinates to classify if input images are “gradable” as shown in Fig. 1. We analyze image quality in terms of artifact, clarity and field definition. To address a lack of data, we preinitialize weights by DR grading network. The field definition task is not reliable when the optic disk (OD) are invisible. The framework can filter clear and artifact-free images for fovea and OD localization and provide accurate field definition analysis. Our experimental results indicate that auxiliary tasks contribute to the image quality task while avoiding overfitting, and learning with both coordinate systems outperforms any single coordinate system. From these observations, our framework can provide reliable image quality assessment with quality factor analysis.
Fig. 1.

The framework of the automatic fundus image quality assessment system.

2 Dataset

The dataset was collected from Shanghai Diabetic Retinopathy Screening Program (SDRSP). We used 11,653 retinal images for quality assessment and 275,146 for DR grading. The image quality is graded according to standards in terms of artifact, clarity, and field definition, as shown in Table 1. Figure 2 shows sample images with quality issues. Meanwhile graders also evaluate whether fundus images are adequate for grading. OD and fovea localization tasks use Indian Diabetic Retinopathy Image Dataset including 516 images without quality detects.
Fig. 2.

From left to right, the images have problem on artifact, clarity and field definition respectively. Image (d) is a gradable image.

Table 1.

Image quality scoring criteria


Image quality specification



Do not contain artifacts


Outside the aortic arch with range less than 1/4 of the image


Do not affect the macular area with scope less than 1/4


Cover more than 1/4, less than 1/2 of the image


Cover more than 1/2 without fully cover the posterior pole


Cover the entire posterior pole



Only Level 1 vascular arch can be identified


Can identify Level 2 vascular arch and a small number of lesions


Can identify Level 3 vascular arch and some lesions


Can identify Level 3 vascular arch and most lesions


Can identify Level 3 vascular arch and all lesions


Field definition

Do not include the optic disc and macular


Only contain either optic disc or macula


Contain both optic disc and macula


The optic disc and macula are within 2PD of the center


The optic disc and macula are within 1PD of the center


3 Method

3.1 Multi-task Convolutional Neural Networks

Multi-task Learning. We propose an architecture with two input branches in rectangular and polar coordinates respectively and four output task branches as shown in Fig. 3. CNNs exploit spatially-local correlation, and through polar transformation, can better remove the black background of the original fundus image while extracting features of ringing artifacts. We preprocess images by cropping the circular region, and rescaled to \(224\times 224\) pixels. The proposed architecture uses hard parameter sharing [7] to optimize four tasks simultaneously: image artifact, clarity, field definition and overall quality. For the first three tasks, the network makes multiple predictions. The clarity task is a multi-label binary classification including whether the clarity score is greater than 1, 4, 6 and 8 respectively, while the artifact task and field definition task are similar based on Table 1. The overall image quality task makes a binary prediction on whether the image is gradable, where label 0 indicates gradable, and label 1 ungradable.

Transfer Learning. These tasks share same CNN encoders to extract structural features of retinal images. The network architecture of CNN encoder for each single coordinate branch includes the first 11 convolutional layers of ResNet-18 [1] as shown in Fig. 3. To speed up the training and to address the limited training data issue, we use two stage transfer learning method. (1) We perform DR grading using ResNet-18 network pretrained on ImageNet dataset. (2) We perform the second transfer learning using the model parameters from the trained DR grading network to initialize CNN encoders for image quality assessment.
Fig. 3.

Multi-task deep learning architecture of our proposed fundus image quality assessment network.

Architecture. Figure 3 shows the overall network architecture of our proposed multi-task retinal image quality assessment framework. Let \(\varvec{X}=\left\{ \left( x_i,y_i \right) \right\} _{i=0}^{N}\) represent labeled dataset of N samples. Let p(x) represents a polar transformation function. Subscripts r, p represent rectangular and polar coordinate branches respectively, a, c, f and q represent artifact, clarity, field definition, and image quality tasks. Let \(E(x, \varvec{\theta }_{e})\) be a function parameterized by \(\varvec{\theta }_{e}\) maps image x to a hidden representation shared by tasks. Batch normalization is used after each convolution layer. The polar coordinate branch encodes feature into 256-D embedding and the rectangular coordinate branch encodes feature into 1024-D embedding. The concatenation of these two embeddings is used for task-specific prediction. To ensure the representation of each single branch is useful and to improve generalizability, we optimize the losses after each branch and the concatenate representation at training stage, whereas at test stage, only the concatenate representations make the final prediction for quality factors. As depicted in Fig. 3, layers marked in yellow are only passed during training. The overall image quality task is optimized by the rectangular branch because polar image will lose some local structure features of vessels, OD and fovea.

3.2 Optic Disc and Fovea Detection

The field definition task relies on clear images, which is not reliable only depending on the network mentioned in Sect. 3.1. According to the clinical requirements, we choose score 8 to separate whether images are correct on field definition. We filter out images with adequate clarity and free of artifact through multi-task quality assessment network, and localize the OD center and fovea center to improve the performance of field definition by deciding whether they are within 2PD of the center of the image. The architecture is depicted in Fig. 4.
Fig. 4.

The architecture for the detection of optic disc and fovea. This framework shows the data flow in inference stage. The global CNN encoder, the local optic disc encoder and the local fovea encoder are trained separately.

Global Encoder. The global CNN encoder is defined as a neural network component locating both the optic disc and fovea centers simultaneously. We extract the entire image feature and detect both the centers of OD and fovea through a global encoder with a backbone network of ResNet-50.

Local Encoder. The local encoders are designed as components that only focus on single object localization which refine the prediction of optic disc and fovea centers respectively. The retinal images in IDRiD dataset are only labeled with target center coordinates. The shapes and sizes of OD or fovea are basically invariant, but the sizes of target objects may vary due to the photographer’s operation. Thus the bounding box can be located with the center coordinates and the approximate shape and scale. Square RoIs are directly cropped from the original image and will not be deformed after reducing the image size. The backbone network architecture of local encoder is VGG-16[8]. See Fig. 1 in the supplement for detail about RoIs selection.

For all encoders, we replace all the max pooling layers with average pooling layers compared with the original network architecture, due to the fact that the max pooling may lose some useful pixel-level information for our regression to predict the coordinates. The described algorithm won the first place on ISBI-2018 challenge for fovea and optic disc detection.

3.3 Learning

The goal of training the multi-task quality assessment network is to minimize the total loss:
$$\begin{aligned} L=\alpha \sum _{t=a,c,f}L_{tp}+\beta \sum _{t=a,c,f}L_{tr}+\gamma \sum _{t=a,c,f}L_{t}+\delta L_q \end{aligned}$$
where \(\alpha \), \(\beta \), \(\gamma \) and \(\delta \) are the weights of loss terms, subscripts r, p represent rectangular and polar coordinate branches respectively, a, c, f and q represent artifact, clarity, field definition and image quality tasks. The classification loss \(L_{tp}\) and \(L_{tr}\) make task-specific prediction from polar and rectangular branch respectively, \(L_t\) makes prediction from the concatenate representation. The quality task loss \(L_q\) is the negative log-likelihood of the label for whether it is gradable:
$$\begin{aligned} L_q=-\sum _{i=0}^{N}(y_i^q\cdot log\hat{y}_i^q+(1-y_i^q)\cdot log(1-\hat{y}_i^q)) \end{aligned}$$
where \(y_i^q\in \left\{ 0,1 \right\} \) is the class label and \(\hat{y}_i^q=\frac{1}{1+e^{h_i}}\) represents the sigmoid prediction. The artifact, clarity and field definition tasks are learned by multi-label classification, and the losses of these tasks use sigmoid cross-entropy loss function as well.
For the OD and fovea localization task, the regression loss for the center location is Euclidean loss:
$$\begin{aligned} L=\frac{1}{2N}\sum _{i=1}^{N}\left\| y_i-\hat{y}_i \right\| _{2}^{2} \end{aligned}$$
where N is 2, \(y_0\) and \(y_1\) are the ground truth coordinates, \(\hat{y}_0\) and \(\hat{y}_1\) are the predicted coordinates. The loss functions are:
$$\begin{aligned} L_{global}= & {} 0.0045\cdot (L_{OD}+L_{fovea}) \end{aligned}$$
$$\begin{aligned} L_{local}= & {} 0.0045\cdot L_{landmark} \end{aligned}$$
We scale the loss since the original Euclidean distance is too large in practice to converge.

4 Experiments and Results

The quality labeled dataset contain 11,653 images including 3,373 un-gradable images and 8,280 gradable images. The training set contains 10,000 images selected by stratified sampling. We compare our proposed multi-task framework with baselines: (a) image quality task using the rectangular coordinate, (b) all tasks using the rectangular coordinate, (c) all tasks using both coordinates and (d) image quality task using the rectangular coordinate and quality factor tasks using both coordinates without transfer learning. Experiment (a) uses single-task learning, and (b) to (d) use multi-task learning with different architectures. For the OD and fovea localization, we randomly select 350 images from the IDRiD dataset as training set.

All the models are implemented in Caffe framework [4] and are trained using Stochastic Gradient with momentum. The multi-task quality networks are trained for 15 epochs and fine-tuned from DR grading network. The new initialized weights are initialized as in [2]. We use a mini-batch size of 128 and a weight decay of 0.0001. For the OD and fovea localization task, we train the global encoder for 200 epochs, local encoders for 30 epochs. The batch size for the global encoder is 16, and 64 for the other two local encoders. The learning rate is set as 0.01 and is divided by 10 when the error plateaus.
Table 2.

Experimental results of overall image quality task we evaluated on all the methods. We replicated the experiments from [9] and [10] at the first two rows. The proposed methods in this paper is reported in the last row.





Tennakoon et al. [9]




Yu et al. [10]




Quality task only(rectangular) single task learning




All tasks(rectangular)




All tasks(rectangular + polar)




Quality task(rectangular)sub-tasks(rectangular + polar) without transfer learning




Quality task(rectangular) sub-tasks(rectangular + polar)




Receiver operating characteristic (ROC) curves for the comparison methods of our proposed architecture are shown in Fig. 5. The weights of loss terms for Eq. 1 are set to: \(\alpha = 0.2\), \(\beta =0.5\), \(\gamma =1.0\) and \(\delta =4.0\). Figure 5(a) shows the ROC for image quality task, the Area Under Curve (AUC) of the proposed method was 0.93168 which is the highest among the experiments. Table 2 reports sensitivity, specificity, and AUC for quality task of all comparing methods. We plot ROC curves of artifact, clarity and field definition as shown in Fig. 5. Image with field definition score larger than 6 can be consider to be image valid on field definition. The ROC curve of score 6 for field definition is corrected by OD and fovea detection and the AUC value is improved by \(0.4\%\). The experimental results of OD and fovea localization are in supplement.

The proposed model outperforms all comparing approaches in our experiments, as it is able to improve the performance by multi-task learning through a representation that captures all tasks and avoids overfitting the overall image quality task. Multi-task learning is also an implicit data augmentation, as the model can learn a more general representation for multiple tasks.
Fig. 5.

ROC plots all tasks. (a) plots quality classification compared with different approaches. (b),(c),(d) plot sub-tasks of our proposed method according to different quality factor scores as described in Sect. 3.1. ROC\(\_i\) plots ROC curve of binary classification on whether quality factor score is greater than score i.

5 Conclusion

In this paper, we have proposed an automated image quality assessment framework by deep multi-task learning, polar transformation and transfer learning which can predict image quality for DR diagnosis and provide interpretable quality factor scores in terms of artifact, clarity, and field definition. The experimental results demonstrate that the proposed multi-task architecture improves the performance on image quality classification. As image quality is an important pre-requisite of diabetic retinal image grading, the proposed technique can be incorporated with automated retinal image grading systems for DR screening programs.



This work is partially supported by National Key Research and Development Program of China (No: 2016YFC1300302, 2017YFE0104000) and by National Natural Science Foundation of China (No: 61525106, 61427807).

Supplementary material

473959_1_En_4_MOESM1_ESM.pdf (757 kb)
Supplementary material 1 (pdf 756 KB)


  1. 1.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2015)Google Scholar
  2. 2.
    He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: ICCV, pp. 1026–1034 (2015)Google Scholar
  3. 3.
    Jelinek, H.F., Cree, M.J.: Automated Image Detection of Retinal Pathology. CRC Press, Boca Raton (2009)CrossRefGoogle Scholar
  4. 4.
    Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J.: Caffe: convolutional architecture for fast feature embedding. In: ACM Multimedia, pp. 675–678 (2014)Google Scholar
  5. 5.
    Kohler, T., Budai, A., Kraus, M.F., Odstrcilik, J.: Automatic no-reference quality assessment for retinal fundus images using vessel segmentation. In: IEEE International Symposium on Computer-Based Medical Systems, pp. 95–100 (2013)Google Scholar
  6. 6.
    Lalonde, M., Gagnon, L., Boucher, M.C.: Automatic visual quality assessment in optical fundus images. In: Vision Interface (2001)Google Scholar
  7. 7.
    Ruder, S.: An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098 (2017)
  8. 8.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. Comput. Sci. (2014)Google Scholar
  9. 9.
    Tennakoon, R., Mahapatra, D., Roy, P., Sedai, S., Garnavi, R.: Image quality classification for DR screening using convolutional neural networks. In: MICCAI Workshop on OMIA 2016, pp. 113–120 (2016)Google Scholar
  10. 10.
    Yu, F.L., Sun, J., Li, A., Cheng, J., Cheng, W., Liu, J., et al.: Image quality classification for DR screening using deep learning. Eng. Med. Biol. Soc. 664–667 (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringShanghai Jiao Tong UniversityShanghaiChina
  2. 2.Department of Biomedical EngineeringUniversity of FloridaGainesvilleUSA
  3. 3.Shanghai Jiao Tong University Affiliated Sixth People’s HospitalShanghaiChina
  4. 4.School of NursingThe Hong Kong Polytechnic UniversityHong KongChina

Personalised recommendations