Cassava Leaf Disease Classification 1/3

Cassava roots are a good source of carbohydrates, vitamin C, thiamine, riboflavin, and niacin. Cassava leaves, if prepared properly, can contain up to 25 percent protein. As a resilient crop, cassava is resistant to heat and does not require much fertilizer. However, it is vulnerable to bacterial and viral diseases. One way to detect the diseases is to examine the look of cassava leaves. Therefore, it is important to identify different diseases affecting cassava leaves based on the images, which, with the utilization of deep learning, is exactly what this project tries to accomplish. We hope that our modest contribution will be useful for the development of cassava disease treatments.

Exploratory Data Analysis

# load data
import pandas as pd
df = pd.read_csv("train.csv")
df.head()
Python
len(df)
Python

There are 21397 images in our training dataset, and each of them is labeled with a number that corresponds to a disease that cassava may have.

To create a baseline model, the mode of the labels will be used. To find the mode, the function .mode() from pandas package can be used:

df.label.mode()
Python

So 3 will be the most common label used in our dataset. Hence, a model trained with this dataset will be biased towards 3 in the end. We can further examine how much more is 3 occurring as a label by a histogram and a pie chart:

plt.hist(df['label'], bins=5, )
plt.title('Classification Histogram')
plt.xticks(range(5))
plt.xlabel('labels')
plt.ylabel('counts')
txt="This histogram shows the respective counts of each label in this dataset. The x-axis represents different labels and the y-axis represents the counts."
plt.figtext(0.5, -0.1, txt, wrap=True, horizontalalignment='center', fontsize=10)
Python
label_count = df.groupby('label').count().reset_index()
Python
ser = pd.read_json("label_num_to_disease_map.json", typ='series')
names = ser.to_frame('disease')
names
Python
import matplotlib as mpl
import matplotlib.pyplot as plt
# Pie chart, where the slices will be ordered and plotted counter-clockwise:
labels = tuple(names['disease'])
sizes = list(label_count['image_id'])
explode = (0.1, 0.1, 0.1, 0.1, 0.1) 
colors = ['yellowgreen', 'gold', 'lightskyblue', 'lightcoral','#A481CF']
fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()
Python

From the pie chart, it's clear that the disease with label 3, Cassava Mosaic Disease, constitutes 61.5% of all the classification of images. Hence, with a baseline model predicting 3 with whatever inputs, the accuracy will get to 61.5%. And our final deep learning model should outperform that, at least.

from sklearn.dummy import DummyClassifier
ML_algo = DummyClassifier(strategy = "most_frequent")
param_grid = {
             }
X = df['image_id']
y = df['label']
dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X, y)
dummy_clf.score(X, y)
Python

Next Steps:

  • Image augmentation using ImageDataGenerator.With some tweaks on the images using image augmentation such as rotation, translation, denoising, etc, we hope to achieve a smoother learning process.

  • Model training from scratch.We will build our own model scratch with convolutional layers, max-pooling, and methods such as hyperparameter tuning.

  • Adopt trained models such as VGG and ResNet and perform transfer learning.There is a copious amount of well-trained models readily availble for use. We will adopt some of them and perform transfer learning and see how well the adaptation performs.

blockquote not implemented
blockquote not implemented
blockquote not implemented

Kaggle Submission

Kaggle Notebook

Git Repo: https://github.com/YueWangpl/DATA2040

References

Cassava usage

Sample Submission from Dan

Sample model from Kaggle

To load Kaggle dataset to colab

Authors:

Yue Wang, Tianqi Tang.

DATA 2040

Appendix

Runtimes (1)