Yue Wang / Mar 22 2021

Cassava Leaf Disease Classification 3/3

Introduction

Hello, we are one of the contesting teams for Kaggle's Cassava Leaf Disease Classification challenge. The team members are Yue Wang and Tianqi Tang, both from Brown University's Data Science Initiatives.

Cassava roots are a good source of carbohydrates, vitamin C, thiamine, riboflavin, and niacin. Cassava leaves, if prepared properly, can contain up to 25 percent protein. As a resilient crop, cassava is resistant to heat and does not require much fertilizer. However, it is vulnerable to bacterial and viral diseases. One way to detect the diseases is to examine the look of cassava leaves. Therefore, it is important to identify different diseases affecting cassava leaves based on the images, which, with the utilization of deep learning, is exactly what this project tries to accomplish. We hope that our modest contribution will be useful for the development of cassava disease treatments.

We deployed our deep learning model based on ResNet-50. Our model achieved above 0.72 accuracy on the test sets. In the following sections, we will discuss the architectures of our model, the hyperparameters, the optimization, and possible future improvements. Various links relevant to our work are shown at the end of this notebook.

Model Architecture

We've employed an ensembled model consisting of three independent sub-models with the same structures, yet different hyperparameters, and randome_state in the train_test_split function from the sklearn library. For each sub-model, it employed a ResNet-50 as the base model with pre-trained weights on ImageNet.

We usually want to go deeper into networks so that the network can learn the subtle features we may want it to notice. However, deeper networks may cause overfit easily and the accuracy starts to decrease after a certain depth because all of the layers are fighting with each other and newer layers are trying to study what was extracted from the last layer. Imagine a scenario that a CNN is trained to tell the difference between a basketball from a room door. The network has learned that they have different shapes, but the later layers go into details such as colors, textures, and if they have a doorknob which the CNN classifier might make mistakes on, because a drop of water on the basketball might be similar to the presence of doorknob, and there might be doors wrapped with rubber, just like the cover of a basketball.Obviously, for this simple task, the most important factor this classifier needs is just the shape of these two objects. It will be so easy for the model, or for humans to distinguish a sphere from a cuboid, so we would not like this feature to be ruled out by over-training the network.

ResNet is a model architecture that even with more than 100 layers, it can still achieve better results comparing with other such deep convolutional networks because of the concatenation of the input on the output for each residual block. So that features learned by previous layers will be kept after every new set of convolutional layers as an offset to overly deep neural networks.

Our model uses ResNet-50 as a base model without the top layers (classification layer) trained with 1000-class imagenet. Instead, after the convolution layers of ResNet-50, we added a global average pooling layer, then flattened, and added two dense layers with ReLu activations. The dense layers have output sizes of 512 and 256, and each of them is followed by a batch normalization layer and a dropout layer. The final layer is another dense layer with 5 nodes, which matches the number of classes. It uses softmax as the activation function to offer the probabilities of each class.

We use Adam as the optimizer. The loss and the metrics are sparse_categorical_crossentropy and sparse_categorical_accuracy, respectively.

For fitting, we adopt learning rate scheduling and early stopping.

We fit the model for 500 epochs but the training terminates well before that limit due to early stopping.

Finally, we take the 3 trained sub-models and average out their prediction probabilities as the prediction of our final ensemble model.

Hyperparameter Selection

Unfreeze base model layers?

We did not unfreeze the original ResNet model layers after comparing metrics of 'unfreeze first 30 layers1, 'unfreeze last 30 layers', and 'unfreeze all'. Small learning rate as inline_formula not implemented were used, yet the model with unfreezed layers still cannot beat an accuracy of 64%.A StackOverFlow about unfreezing layers was viewed and one of the response says that there are cases where inline_formula not implemented can be too much. I'm not sure if it's the case of our situation as well.

Fig. 4 Unfreeze All *(Training accuracy exceeds the validation accuracy too much.)*

Fig. 5 Unfreeze First 30 Layers *(Validation accuracy stayed to be 0.1~0.2, yet training loss as in a decrease.)*

And when unfroze the last 30 layers, the validation accuracy stayed around 0.6 with the highest to be 0.6462 after 100 epochs.

Optimizer?

We've tested four different optimizers, namely adagrad, adam, adadelta, and rmsprop, while every other hyperparameters remain the same to compare their performances and determine the one that can find the minimum in the most stable way with 15 epochs each, and here is the plot of their behavior:

Adadelta seems to perform better than others, which makes it a winner here.

Learning Rate?

With a similar approach, the learning rate is compared within 15 epochs with the plot showing their behavior:

inline_formula not implemented was considered to be more stable in performance. However, it’s observed that all learning rates give a bumpy behavior, so before we make a final decision, we’ve used a callback function of LearningRateScheduler to graph the training and validation loss.

tf.keras.callbacks.LearningRateScheduler(lambda epoch: 1e-8*10**(epoch/20))

The resulted plot is shown below:

All learning rates below inline_formula not implemented and above inline_formula not implemented were leading us to lower losses smoothly. Hence, inline_formula not implemented was decided to be used as our learning rate.

Decay Rate?

After that, different decay rates in tf.keras.optimizers.schedules.ExponentialDecay were also compared. However, since epoch of 15 is just a small amount and so it's just at the beginning flat-ish curve of an exponential decay function.

Final Model

Hyperparameters are chosen as described above. Additionally, as we mentioned, we ensembled 3 models trained under different random states during the train/ test split to make the final model less biased because of how our data was split.

Here is the accuracy of half of the given dataset:

Here is the difference of accuracy after submission, with the one on the bottom to be the ensembled model.

Next Steps

Due to the lack of time, we were unable to implement some techniques that may boost the accuracy of our model further.

Create a more balanced dataset. Our dataset is imbalanced as more than 60% of the data belong to one class and there are 5 classes in total. This may cause a bias to our model as predicting any data to be the majority class will yield a decent accuracy, but then the model will perform poorly on predicting minority classes. By constructing a balanced dataset, this concern can be avoided and our model may not be biased towards the majority class.
Image augmentation image-by-image, instead of batch-by-batch.
More hyperparameter-tuning. We've explored lots of options but we haven't find a good set of parameters to let us get a better result when unfreezing the ResNet-50 model.
Exploiting more models. We have tried various models including VGG, MobileNet, EfficientNet, etc. Among them, ResNet yields the best results. But this may not be conclusive as we did not fully explore many of the models.
Collaborating with other teams. We wish to collaborate with other Kaggle teams to produce a better model.

Kaggle Submission

Links

Github Repo

Kaggle Notebook Saved in Git Repo

References

Sample Submission from Dan

Sample model from Kaggle

Ensemble keras models

Ensemble models with scikit-learn

Save and load sklearn models

StackOverFlow about unfreezing layers

Appendix

YueWangpl/DATA2040