Description
Our first project is focused on preprocessing, analysing and visualizing real-world datasets. Applying the basic statistical methods and extracting valuable summary about it. During the third part of the project you are expected to employ multiple datasets and extract insights from them.
Dataset
Requirements to the dataset for Part I and II:
- Represent the real-world data
- Contain at least 50k entries
- Should be different from the one used in class
Possible resources includes:
- Open Data Buffalo – https://data.buffalony.gov /
- Google Dataset – https://datasetsearch.research.google.com /
- US Government’s Data https://www.data.gov /
Tasks
Part I: Perform data analysis of the dataset
- How many entries and variables does the data set comprise?
- What types of data is included?
- Are there any data missing?
- Provide the main statistics about the entries of the dataset (mean, std, etc.)
- Visualize the data (min 3 graphs), e.g. correlation between different variables. Are there any interesting patterns?
Part II: Apply ML analysis
- Choose the features and targets in the dataset.
- Preprocess the dataset for training (e.g. cleaning and filling the missing variables, split between training/testing/validation).
- Apply ML algorithms (min 3 algorithms) to model the target variable. This can be either classification or regression task. You can use any of the libraries with inbuilt ML functions.
- Provide the comparison of the results of different ML models you have used. This can be in the form of graph representation and your reasoning about the results.
Part III: Employ multiple datasets and extract insights
- Choose any related dataset to your current one. Combine the two into one dataset. The combined dataset doesn’t have size requirements.
- Choose the correlated variables.
- Perform statistical analysis on finding the correlation between selected features from both datasets. Examples:
- Find the correlation between the crime and the number of schools in the area.
- Find the correlation between the traffic and the population in the area
- Analyse the results and any interesting patterns.



