Description
Download the HW1 Skeleton before you begin.
Homework Overview
Vast amounts of digital data are generated each day, but raw data are often not immediately “usable”. Instead, we are interested in the information content of the data: what patterns are captured? This assignment covers a few useful tools for acquiring, cleaning, storing, and visualizing datasets.
In Question 1 (Q1), you will collect data using an API for The Movie Database (TMDb). You will construct a graph representation of this data that will show which actors have acted together in various movies, and use Argo Lite to visualize this graph and highlight patterns that you find. This exercise demonstrates how visualizing and interacting with data can help with discovery.
In Q2, you will construct a TMDb database in SQLite, with tables capturing information such as how well each movie did, which actors acted in each movie, and what the movie was about. You will also partition and combine information in these tables in order to more easily answer questions such as “which actors acted in the highest number of movies?”.
In Q3, you will visualize temporal trends in movie releases, using a JavaScript-based library called D3. This part will show how creating interactive rather than static plots can make data more visually appealing, engaging and easier to parse.
Data analysis and visualization is only as good as the quality of the input data. Real-world data often contain missing values, invalid fields, or entries that are not relevant or of interest. In Q4, you will use OpenRefine to clean data from Mercari, and construct GREL queries to filter the entries in this dataset.
Finally, in Q5, you will build a simple web application that displays a table of TMDb data on a single-page website. To do this, you will use Flask, a Python framework for building web applications that allows you to connect Python data processing on the back end with serving a site that displays these results.
Q1 [40 points] Collect data from TMDb and visualize co-actor network
Q1.1 [30 points] Collect data from TMDb and build a graph
complete the Graph class, the TMDbAPIUtils class, and the two global functions. The Graph class will serve as a re-usable way to represent and write out your collected graph data. The TMDbAPIUtils class will be used to work with the TMDB API for data retrieval.
NOTE: You must only use a version of Python ≥ 3.7.0 and < 3.8 for this question. This question has been developed, tested for these versions. You must not use any other versions (e.g., Python 3.8). While we want to be able to extend to more Python versions, the specified versions are what we can definitively support at this time.
- [10 pts] Implementation of the Graph class according to the instructions in py
- [10 pts] Implementation of the TMDbAPIUtils class according to the instructions in py. You will use version 3 of the TMDb API to download data about actors and their co-actors. To use the TMDb API:
- Create a TMDb account and obtain your client id / client secret which are required to obtain an authentication Token. Refer to this document for detailed instructions (log in using your
GT account).
- Refer to the TMDB API Documentation as you work on this question. The documentation contains a helpful ‘try-it-out’ feature for interacting with the API calls.
- [10 pts] Producing correct csv and edges.csv. You must upload your nodes.csv and edges.csv files to Argo-Lite as directed in Q1.2.
NOTE: Q1.2 builds on the results of Q1.1
Q1.2 [10 points] Visualizing a graph of co-actors using Argo-Lite
Using Argo Lite, visualize a network of actors and their co-actors.
You will produce an Argo Lite graph snapshot your edges.csv and nodes.csv from Q1.1.c.
- To get started, review Argo Lite’s readme on GitHub. Argo Lite has been open-sourced.
- Importing your Graph
- Launch Argo Lite
- From the menu bar, click ‘Graph’ → ‘Import CSV’. In the dialogue that appears:
o Select ‘I have both nodes and edges file’
- Under Nodes, use ‘Choose File’ to select csv from your computer o Leave ‘Has Headers’ selected o Verify ‘Column for Node ID’ is ‘id’
- Under Edges, use ‘Choose File’ to select csv from your computer o Verify ‘Column for Source ID’ is ‘source’ o Select ‘Column for Target ID’ to ‘target’ o Verify ‘Selected Delimiter’ is ‘,’
- At the bottom of the dialogue, verify that ‘After import, show’ is set to ‘All Nodes’
- The graph will load in the window. Note that the layout is paused by default; you can select to ‘Resume’ or ‘Pause’ layout as needed.
- Dragging a node will ‘pin’ it, freezing its position. Selecting a pinned node, right clicking it, then choosing ‘unpin selected’ will unpin that node, so its position will once again be computed by the graph layout algorithm. Experiment with pinning and unpinning nodes.
NOTE: If a malformed .csv is uploaded, Argo-Lite could become un-responsive. If you suspect this is the case, open the developer tools for your browser and review any console error messages.
- [7 points] Setting graph display options
- On “Graph Options” panel, under ‘Nodes’ → ‘Modifying All Nodes’, expand ‘Color’ menu o Select Color by ‘degree’, with scale: ‘Linear Scale’
- Select a color gradient of your choice that will assign lighter colors to nodes with higher node degrees, and darker colors to nodes with lower degrees ● Collapse the ‘Color’ options, expand the ‘Size’ options.
- Select ‘Scale by’ to ‘degree’, with scale: Linear Scale’ o Select meaningful Size Range values of your choice or use the default range.
- Collapse the ‘Size’ options
- On the Menu, click ‘Tools’ → ‘Data Sheet’ ● Within the ‘Data Sheet’ dialogue:
- Click ‘Hide All’
- Set ‘10 more nodes with highest degree’ o Click ‘Show’ and then close the ‘Data Sheet’ dialogue
- Click and drag a rectangle selection around the visible nodes
- With the nodes selected, configure their node visibility by setting the following:
- Go to ‘Graph Options’ → ‘Labels’ o Click ‘Show Labels of Selected Nodes’
- At the bottom of the menu, select ‘Label By’ to ‘name’
- Adjust the ‘Label Length’ so that the full text of the actor name is displayed
- Show only non-leaf vertices. On the Menu, click ‘Tools’ → Data Sheet→ ‘Show k More Nodes with Highest Degree’. (where k is the input number of nodes such that only nodes with a degree > 1 are visible). To make this easier, we suggest writing a utility function in your Graph class to find the count of leaf nodes in order to determine how many nodes should be shown.
The result of this workflow yields a graph with the sizing and coloring depend upon the node degree and the nodes with the highest degree are emphasized by showing their labels. Also,
- [3 points] Designing a meaningful graph layout
Using the following guidelines, create a visually meaningful and appealing layout:
- Reduce as much edge crossing as possible
- Do not allow any nodes to overlap
- Keep the graph compact and symmetric as possible
- Use the nodes’ spatial positions to convey information (e.g., “clusters” or groups)
- Experiment with showing additional node labels. If showing all node labels creates too much visual complexity, show at least 10 “important” node labels. You may decide what “importance” mean to you. For example, you may consider nodes (actors) having higher connectivity as potentially more “important” (based on how the graph is built).
The objective of this task is to familiarize yourself with basic, important graph visualization features. Therefore, this is an open-ended task and most designs are acceptable You should experiment with Argo Lite’s features, changing node size and shape, etc. In practice, it is not possible to create “perfect” visualizations for most graph datasets. The above guidelines are ones that generally help.
However, like most design tasks, creating a visualization is about making selective design compromises. Some guidelines could create competing demands and following all guidelines may not guarantee a “perfect” design.
If you want to save your Argo Lite graph visualization snapshot locally to your device, so you can continue working on it later, we recommend the following workflow.
- Select ‘Graph’ → ‘Save Snapshot’ o In the ‘Save Snapshot` dialog, click ‘Copy to Clipboard’
- Open an external text editor program such as TextEdit or Notepad. Paste the clipboard contents of the graph snapshot, and save it to a file with a .json You should be able to accomplish this with a default text editor on your computer by overriding the default file extension and manually entering ‘.json’.
- You may save your progress by saving the snapshot and loading them into Argo Lite to continue your work.
- To load a snapshot, choose ‘Graph’ → ‘Open Snapshot’ ● Select the graph snapshot you created.
- Publish and Share your graph snapshot ● Name your graph: On the top navigation bar, click on the label ‘Untitled Graph’. In the
‘Rename Snapshot’ dialogue window that appears, enter your GTUsername as the ‘Snapshot Name’ and click ‘Done’
- Select ‘Graph ‘ → ‘Publish and Share Snapshot’ → ‘Share’
- Next, click ‘Copy to Clipboard’ to copy the generated URL
- Return the URL in the return_argo_lite_snapshot() function in py If you modify your graph after you publish and share a URL, you will need to re-publish and obtain a new URL of your latest graph.
Q2 [35 points] SQLite
SQLite is a lightweight, serverless, embedded database that can easily handle multiple gigabytes of data. It is one of the world’s most popular embedded database systems. It is convenient to share data stored in an SQLite database — just one cross-platform file which does not need to be parsed explicitly (unlike CSV files, which have to be parsed).
You will modify the given Q2_SQL.py file by adding SQL statements to it. We suggest that you consider testing your SQL locally on your computer using interactive tools to speed up testing and debugging, such as DB Browser for SQLite (https://sqlitebrowser.org).
- [9 points] Create tables and import data.
- [2 points] Create two tables (via two separate methods, part_ai_1 and part_ai_2, respectively in Q2_SQL.py) named movies and movie_cast with columns having the indicated data types:
- movies
- id (integer) 2. title (text)
- score (real)
- movie_cast
- movie_id (integer)
- cast_id (integer)
- cast_name (text)
- birthday (text)
- popularity (real)
- movies
- [2 points] Create two tables (via two separate methods, part_ai_1 and part_ai_2, respectively in Q2_SQL.py) named movies and movie_cast with columns having the indicated data types:
.
- [5 points] Vertical Database Partitioning. Database partitioning is an important technique that
divides large tables into smaller tables, which may help speed up queries. For this question you will create a new table cast_bio from the movie_cast table (i.e., columns in cast_bio will be a subset of those in movie_cast) Do not edit the movie_cast table. Be sure that when you insert into the new cast_bio that the values are unique. Please read this page for an example of vertical database partitioning.
cast_bio
- cast_id (integer)
- cast_name (text)
- birthday (date)
- popularity (real)
- [1 point] Create indexes. Create the following indexes for the tables specified below. This step increases the speed of subsequent operations; though the improvement in speed may be negligible for this small database, it is significant for larger databases.
- movie_index for the id column in movies table ii. cast_index for the cast_id column in movie_cast table iii. cast_bio_index for the cast_id column in cast_bio table
- [3 points] Calculate a proportion. Find the proportion of movies having a score > 50 and that has ‘war’ in the name. Treat each row as a different movie. The proportion should only be based on the total number of rows in the movie table. Format all decimals to two places using printf(). Do NOT use the
ROUND() function as it does not work the same on every OS.
Output format and sample value:
7.70
- [4 points] Find the most prolific actors. List 5 cast members with the highest number of movie appearances that have a popularity > 10. Sort the results by the number of appearances in descending order, then by cast_name in alphabetical order.
Output format and sample values (cast_name,appearance_count):
Harrison Ford,2
- [4 points] Find the highest scoring movies with the smallest cast. List the 5 highest-scoring movies that have the fewest cast members. Sort the results by score in descending order, then by number of cast members in ascending order, then by movie name in alphabetical order. Format all decimals to two places using printf().
Output format and sample values (movie_title,movie_score,cast_count):
Star Wars: Holiday Special,75.01,12 War Games,58.49,33
- [4 points] Get high scoring actors. Find the top ten cast members who have the highest average movie scores. Format all decimals to two places using printf().
- Sort the output by average score in descending order, then by cast_name in alphabetical order.
- Do not include movies with score <25 in the average score calculation. ▪ Exclude cast members who have appeared in two or fewer movies.
Output format and sample values (cast_id,cast_name,average_score):
8822,Julia Roberts,53.00
- [6 points] Creating views. Create a view (virtual table) called good_collaboration that lists pairs of actors who have had a good collaboration as defined here. Each row in the view describes one pair of actors who appeared in at least 3 movies together AND the average score of these movies is >= 40.
The view should have the format: good_collaboration( cast_member_id1, cast_member_id2, movie_count, average_movie_score)
For symmetrical or mirror pairs, only keep the row in which cast_member_id1 has a lower numeric value. For example, for ID pairs (1, 2) and (2, 1), keep the row with IDs (1, 2). There should not be any “self pair” where the value of cast_member_id1 is the same as that of cast_member_id2.
Q3 [15 points] D3 (v5) Warmup
Read chapters 4-8 of Scott Murray’s Interactive Data Visualization for the Web, 2nd edition (sign in using your GT account, e.g., [email protected]). You may also briefly review chapters 1-3 if you need additional background on web development. This simple reading provides important foundation you will need for Homework 2. This question uses D3 version v5, while the book covers D3 v4. What you learn from the book is transferable to v5. In Homework 2, you will work with D3 extensively.
Q4 [5 points] OpenRefine
OpenRefine is a Java application and requires Java JRE to run. Download and install Java if you do not have it (you can verify by typing ‘java -version’ in your computer’s terminal or command prompt).
- Watch the videos on OpenRefine’s homepage for an overview of its features. Then, download and install OpenRefine release 3.3. Do not use version 3.4 (which is in beta status).
- Import Dataset
- Run OpenRefine and point your browser at 127.0.0.1:3333.
- We use a products dataset from Mercari, derived from a Kaggle competition (Mercari Price Suggestion Challenge). If you are interested in the details, visit the data description page.
We have sampled a subset of the dataset provided as “properties.csv”.
- Choose “Create Project” → This Computer → csv”. Click “Next”.
- You will now see a preview of the data. Click “Create Project” at the upper right corner.
- Clean/Refine the data
Q5 [5 points] Introduction to Python Flask
Flask is a lightweight web application framework written in Python that provides you with tools, libraries and technologies to quickly build a web application. It allows you to scale up your application as needed.
You will modify the given file:
- wrangling_scripts/wrangling.py







