Description
1 Overview
In this assignment, you will implement a decision tree algorithm and use it to determine one of the indoor locations based on WIFI signal strengths collected from a mobile phone. See Figure 1 for an illustration of the experimental scenario.
2 Setup
We do recommend you work on the Ubuntu workstations in the lab. This assignment and all code were tested for Linux and Mac OS machines. We cannot guarantee compatibility with Windows machine and cannot promise any support if you do choose to work on a Windows machine.
2.1 Working on DoC lab workstations (recommended)
You can also work from home and use the lab workstations. See this list of https://www.doc.ic.ac.uk/csg/facilities/lab/workstations to ssh into one of the machines.
In order to load all the packages that you might need for the course work, you can run the following command:
export PYTHONUSERBASE=/vol/lab/ml/mlenv
We installed numpy, matplotlib, pytorch, which should cover most of the packages you will need for all the courseworks in this course. If you don’t want to type that command every time that you connect to your lab machine, you can add it to your bashrc:
echo “export PYTHONUSERBASE=/vol/lab/ml/mlenv” >> ~/.bashrc
This way, every terminal you open will have that environment variable set. It is recommended to use ”python3” exclusively. The current python3 version in lab machines is 3.8.5. To test the configuration:
Figure 1: Illustration of the scenario. The WIFI signal strength from 7 emitters are recorded from a mobile phone. The objective of this coursework is to learn decision tree that predict in which of the 4 rooms the user is standing.
python3 -c “import numpy as np; print(np)”
This should print:
<module ‘numpy’ from ‘/vol/lab/ml/mlenv/lib/python3.8/site-packages/numpy/__init__.py’>
2.2 Working on your own system
If you decide to work locally on your machine, then you will need to give us explicit instructions how to run your code. Anything we cannot run will result to your points of the question reduced by 30%.
Python>= 3.5: All provided code has been tested on Python versions 3.5 or 3.6. Make sure to install Python version 3.5 or 3.6 on your local machine. Otherwise, you might encounter errors! If you are working on Mac OSX, you can use Homebrew to brew install python3.
Part 1: Implementation
Step 1: Loading data
You can load the datasets from the files ”WIFI db/clean dataset.txt” and ”WIFI db/noisy dataset.txt”. They contain a 2000×8 array. This array represents a dataset of 2000 samples. Each sample is composed of 7 WIFI signal strength while the last column indicates the room number in which the user is standing (i.e., the label of the sample). All the features in the dataset are continuous except the room number. You can load the text file with the ”loadtxt” function from Numpy. Given the nature of the dataset you will have to build decision trees capable of dealing with continuous attributes and multiple labels.
Step 2: Creating Decision Trees
To create the decision tree, you will write a recursive function called decision tree learning(), that takes as arguments a matrix containing the dataset and a depth variable (which is used to compute the maximal depth of the tree, for plotting purposes for instance). The label of the training dataset is assumed to be the last column of the matrix. The pseudo-code of this function is described below. This pseudo-code is taken from Artificial Intelligence: A Modern Approach by Stuart Russell and Peter Norvig (Figure 18.5), but modified to take into account that the considered dataset contains continuous attributes (see section 18.3.6 of the book).
Algorithm 1 Decision Tree creating
1: procedure decision tree learning(training dataset, depth)
2: if all samples have the same label then
3: return (a leaf node with this value, depth)
4: else
5: split← find split(training dataset)
6: node← a new decision tree with root as split value
7: l branch, l depth ← DECISION TREE LEARNING(l dataset, depth+1) 8: r branch, r depth ← DECISION TREE LEARNING(r dataset, depth+1)
9: return (node, max(l depth, r depth))
10: end if
11: end procedure
The function FIND SPLIT chooses the attribute and the value that results in the highest information gain. Because the dataset has continuous attributes, the decision-tree learning algorithms search for the split point (defined by an attribute and a value) that gives the highest information gain. For instance, if you have two attributes (A0 and A1) with values that both range from 0 to 10, the algorithm might determine that splitting the dataset according to ”A1>4” gives the most information. An efficient method for finding good split points is to sort the values of the attribute, and then consider only split points that are between two examples in sorted order, while keeping track of the running totals of positive and negative examples on each side of the split point.
To evaluate the information gain, suppose that the training dataset Sall has K different labels. We can define two subsets (Sleft and Sright) of the dataset depending on the splitting rule (for instance ”A1>4”) and for each dataset and subset, we can compute the distribution (or probability) of each label. For instance, {p1p2 …pK} (pk is the number of samples with the label k divided by the total number of samples from the initial dataset). The information gain is defined by using the general definition of the entropy as follow: Gain(Sall,Sleft,Sright) = H(Sall) − Remainder(Sleft,Sright)
k=K
H(dataset) = − X pk ∗ log2(pk)
k=1
|Sleft| |Sright|
Remainder(Sleft,Sright) = H(Sleft) + H(Sright) |Sleft| + |Sright| |Sleft| + |Sright|
Where |S| represents the number of samples in subset S.
Implementation:
You are only allowed to use Numpy and Matplotlib for this coursework. Any other library, like scikit learn is NOT allowed. For the implementation of the tree (in Python), it is advised to use dictionaries to store nodes as a single object, for instance by using this kind of structure {’attribute’, ’value’, ’left’, ’right’}. In this case, ’left’ and ’right’ will also be nodes. You might also want to add a Boolean field name ”leaf” that indicates whether or not the node is a leaf (terminal node) or not. However, this is not strictly necessary, as there are other methods to determine if a node is terminal or not.
Bonus (5 points): Write a function to visualize the tree. See Figure 2 for an example. You can only use Numpy and Matplotlib for this question as well.
Part 2: Questions
Answer to the following questions in your report:
Noisy-Clean Datasets Question Is there any difference in the performance when using the clean and noisy datasets? If yes/no explain why. Discuss the differences in the overall performance and per class.
Pruning Question Briefly explain how you implemented your pruning function and details the influence of the pruning on the accuracy of your tree on both datasets.
Depth Question Comment on the maximal depth of the tree that you generated for both datasets and before and after pruning. What can you tell about the relationship between maximal depth and prediction accuracy?







