[SOLVED] inf264 -  Homework 1 - k-NN for a classification problem on the Iris dataset

30.00 $

Category:

Description

5/5 - (1 vote)

Iris is a small dataset consisting of 150 vectors describing iris flowers, split into three different classes representing three species of the iris family. Each vector comes with a label (the name of the species) and a set of four features which are measurements of different parts of the flower.

Left: The three species in the Iris dataset

Right: The four features in the Iris dataset (petal and sepal width and length)

Those measurements tend to differ between the different species, thus it is possible to train and evaluate a classifier from this dataset whose task is to predict the species of an iris flower represented by aforementioned set of features. In this exercice we will use k-NN classifier.

  1. Iris Dataset:
    • Load the Iris dataset directly from sklearn. You can alternatively download the dataset here: https://archive.ics.uci.edu/ml/datasets/iris.
    • Store the first 2 features (sepal length and sepal width) in a matrix X and labels in a vector Y .
    • Split the dataset into 3 datasets: training set, validation set and a testing set, i.e. split X and Y into Xtrain, Xval, Xtest and Ytrain, Yval, Ytest You can for instance use a train/validation/test ratio of 0.7/0.15/0.15.
  2. Perform a k-NN classification of your dataset for each k in 1,5,10,20,30:
    • Plot both training and validation Iris datapoints with respect to the two selected features. Since there are three classes, you will need three different colors.
    • Create an instance of the KNeighborsClassifier class
    • Train your instance of k-nn on your training data set
    • Plot the decision boundaries as decided by the trained k-nn.

1

  • Compute model accuracy on training dataset and validation dataset
  • Which model (i.e which k) would you select? Compute model accuracy on testing dataset 3. Interpretation:
  • Plot a curve representing the training accuracy as a function of k and same for the validation accuracy.
  • From your observations, for which values of k does k-NN overfit ?
  • For k =1, k-NN train accuracy should be equal to 1 (100% correct predictions). Explain why this is not the case here.

2