Name: INF553 - Foundations and Applications of Data Mining - Assignment 3 - Solved
SKU: 81126
Availability: InStock

Description

5/5 - (1 vote)

1. Overview of the Assignment

In Assignment 3, you will complete three tasks. The goal is to let you be familiar with Min-Hash, Locality Sensitive Hashing (LSH), and various types of recommendation systems.

2. Requirements

2.1 Programming Requirements

You must use Python to implement all the tasks. You can only use standard Python libraries (i.e., external libraries like numpy or pandas are not allowed).
You are required to only use Spark RDD, i.e. no point if using Spark DataFrame or DataSet.
There will be 10% bonus for Scala implementation in each task. You can get the bonus only when both Python and Scala implementations are correct.

Programming Environment

Python 3.6, Scala 2.11, and Spark 2.3.0

3. Yelp Data

In this assignment, we generated the review data from the original Yelp review dataset with some filters, such as the condition: “state” == “CA”. We randomly took 80% of the data for training, 10% of the data for testing, and 10% of the data as the blind dataset.

You can access and download the following JSON files either under the directory on the Vocareum: resource/asnlib/publicdata/ or in the Google Drive (USC email only): https://drive.google.com/open?id=146Re0IDgtHB2OImmKOpzU43pGl12ZLqF a. train_review.json

json – containing only the target user and business pairs for prediction tasks
json – containing the ground truth rating for the testing pairs
json – containing the average stars for the users in the train dataset
json – containing the average stars for the businesses in the train dataset
We do not share the blind dataset

4.1 Task1: Min-Hash + LSH (2pts)

4.1.1 Task description

In this task, you will implement the Min-Hash and Locality Sensitive Hashing algorithms with Jaccard similarity to find similar business pairs in the train_review.json file. We focus on the 0 or 1 ratings rather than the actual ratings/stars in the reviews. Specifically, if a user has rated a business, the user’s contribution in the characteristic matrix is 1. If the user hasn’t rated the business, the contribution is 0.

Table 1 shows an example. Your task is to identify business pairs whose Jaccard similarity is >= 0.05.

Table 1: The left table shows the original ratings; the right table shows the 0 or 1 ratings.

You can define any collection of hash functions that you think would result in a consistent permutation of the row entries of the characteristic matrix. Some potential hash functions are:

𝑓(𝑥) = (𝑎𝑥 + 𝑏) % 𝑚

𝑓(𝑥) = ((𝑎𝑥 + 𝑏) % 𝑝) % 𝑚

where 𝑝 is any prime number; 𝑚 is the number of bins. You can define any combination for the parameters (𝑎, 𝑏, 𝑝, or 𝑚) in your implementation.

After you have defined all the hash functions, you will build the signature matrix using Min-Hash. Then you will divide the matrix into 𝒃 bands with 𝒓 rows each, where 𝒃 × 𝒓 = 𝒏 (𝒏 is the number of hash functions). You need to set 𝒃 and 𝒓 properly to balance the number of candidates and the computational cost. Two businesses become a candidate pair if their signatures are identical in at least one band.

Lastly, you need to verify the candidate pairs using their original Jaccard similarity. Table 1 shows an example of calculating the Jaccard similarity between two businesses. Your final outputs will be the business pairs whose Jaccard similarity is >= 0.05.

user1 user2 user3 user4

business1	0	1	1	1
business2	0	1	0	0

Table 2: Jaccard similarity (business1, business2) = #intersection / #union = 1/3

4.1.2 Execution commands

Python	$ spark-submit task1.py <input_file> <output_file>
Scala	$ spark-submit –class task1 hw3.jar <input_file> <output_file>
	<input_file>: the train review set <output_file>: the similar business pairs and their similarities

4.1.3 Output format

You must write a business pair and its similarity in the JSON format using exactly the same tags as the example in Figure 1. Each line represents for a business pair (“b1”, “b2”). There is no need to have an output for (“b2”, “b1”).

Figure 1: An example output for Task 1 in the JSON format

4.2 Task2: Content-based Recommendation System (2pts)

4.2.1 Task description

In this task, you will build a content-based recommendation system by generating profiles from review texts for users and businesses in the train review set. Then you will use the system/model to predict if a user prefers to review a given business, i.e., computing the cosine similarity between the user and item profile vectors.

During the training process, you will construct business and user profiles as the model:

Concatenating all the review texts for the business as the document and parsing the document, such as removing the punctuations, numbers, and stopwords. Also, you can remove extremely rare words to reduce the vocabulary size, i.e., the count is less than 0.0001% of the total words.
Measuring word importance using TF-IDF, i.e., term frequency * inverse doc frequency
Using top 200 words with highest TF-IDF scores to describe the document
Creating a Boolean vector with these significant words as the business profile
Creating a Boolean vector for representing the user profile by aggregating the profiles of the items that the user has reviewed

During the predicting process, you will estimate if a user would prefer to review a business by computing the cosine distance between the profile vectors. The (user, business) pair will be considered as a valid pair if their cosine similarity is >= 0.01. You should only output these valid pairs.

4.2.3 Output format:

Model format: There is no strict format for the content-based model.

Prediction format:

You must write the results in the JSON format using exactly the same tags as the example in Figure 2. Each line represents for a predicted pair of (“user_id”, “business_id”).

Figure 2: An example prediction output for Task 2 in JSON format

[SOLVED] INF553 - Foundations and Applications of Data Mining - Assignment 3

If Helpful Share:

Description

1. Overview of the Assignment

2. Requirements

3. Yelp Data

4.1 Task1: Min-Hash + LSH (2pts)

4.2 Task2: Content-based Recommendation System (2pts)

Related products

INF553 Foundations and Applications of Data Mining Assignment5,Clustering

INF 553 Assignment 4- Clustering Yelp Reviews Solved

INF553 Foundations and Applications of Data Mining Assignment4

Related in this category

More in this category

INF553 Assignment 3- Frequent Itemsets

INF553- Foundations And Applications Of Data Mining: Assignment 5

INF553 Assignment 2

INF553 Foundations and Applications of Data Mining Assignment1

INF553- Foundations and Applications of Data Mining: Assignment 1

INF553 Assignment 1-Analysis using Spark