[SOLVED] DSCI553 - Foundations and Applications of Data Mining - Assignment 3

30.00 $

Category:

Description

5/5 - (1 vote)

1.   Overview of the Assignment

In Assignment 3, you will complete three tasks. You will first implement Min-Hash and Locality Sensitive Hashing (LSH) to find similar businesses efficiently. Then you will implement various types of recommendation systems.

 

2.   Requirements

2.1 Programming Requirements

  1. You must use Python & Spark to implement all tasks. You can only use the standard Python libraries (i.e., external libraries like numpy or pandas are not allowed).
  2. You are required to only use Spark RDD, i.e. no point if using Spark DataFrame or DataSet.
  3. There will be 10% bonus for Scala implementation in each task. You can get the bonus only when both Python and Scala implementations are correct.

 

  • Programming Environment

Python 3.6, Scala 2.11, and Spark 2.3.0

 

3.   Yelp Data

For this assignment, we have generated sample review data from the original Yelp review dataset using some filters, such as the condition: “state” == “CA”. We randomly took 80% of sampled reviews for training, 10% for testing, and 10% as the blind dataset. (We do not share the blind dataset.) You can access and download the following JSON files either under the directory on the Vocareum: resource/asnlib/publicdata/ or on Google Drive (USC email only): a. train_review.json

  1. json – containing only the target user and business pairs for prediction tasks
  2. json – containing the ground truth rating for the testing pairs
  3. json – containing the average stars for the users in the train dataset
  4. json – containing the average stars for the businesses in the train dataset f. stopwords
  5. We do not share the blind dataset.

 

4.   Tasks

You need to submit the following files on Vocareum: (all in lowercase)

  1. Python scripts: task1.py, task2train.py, task2predict.py, task3train.py, task3predict.py
  2. Model files: task2.model, task3item.model, task3user.model
  3. Result files: task1.res, task2.predict, task3item.predict, task3user.predict
  4. Scala scripts: task1.scala, task2train.scala, task2predict.scala, task3train.scala, task3predict.scala; one jar package: hw3.jar
  5. Model files: task2.scala.model, task3item.scala.model, task3user.scala.model
  6. Result files: task1.scala.res, task2.scala.predict
  7. [OPTIONAL] You can include other scripts to support your programs (e.g., callable functions).

 

4.1  Task1: Min-Hash + LSH (2pts)

4.1.1 Task description

In this task, you will implement the Min-Hash and Locality Sensitive Hashing algorithms with Jaccard similarity to find similar business pairs in the train_review.json file. We focus on 0/1 ratings rather than the actual rating values in the reviews. In other words, if a user has rated a business, the user’s contribution in the characteristic matrix is 1; otherwise, the contribution is 0 (Table 1). Your task is to identify business pairs whose Jaccard similarity is >= 0.05.

 

Table 1: The left table shows the original ratings; the right table shows the converted 0 and 1 ratings.

 

You can define any collection of hash functions to permutate the row entries of the characteristic matrix to generate Min-Hash signatures. Some potential hash functions are:

𝑓(𝑥) = (𝑎𝑥 + 𝑏)    %            𝑚 𝑓(𝑥) = ,(𝑎𝑥 + 𝑏)            %            𝑝.            %            𝑚

where 𝑝 is any prime number; 𝑚 is the number of bins. You can define any combination for the parameters (𝑎, 𝑏, 𝑝, or 𝑚) in your implementation.

After you have defined all hash functions, you will build the signature matrix using Min-Hash. Then you will divide the matrix into 𝒃 bands with 𝒓 rows each, where 𝒃 × 𝒓 = 𝒏 (𝒏 is the number of hash functions). You need to set 𝒃 and 𝒓 properly to balance the number of candidates and the computational cost. Two businesses become a candidate pair if their signatures are identical in at least one band.

Lastly, you need to verify the candidate pairs using their original Jaccard similarity. Table 1 shows an example of calculating the Jaccard similarity between two businesses. Your final outputs will be the business pairs whose Jaccard similarity is >= 0.05.

user1               user2              user3               user4

business1 0 1 1 1
business2 0 1 0 0

Table 2: Jaccard similarity (business1, business2) = #intersection / #union = 1/3

 

4.1.2 Execution commands

Python $ spark-submit task1.py <input_file> <output_file>

Scala     $ spark-submit –class task1 hw3.jar <input_file> <output_file>

<input_file>: the train review set

 

<output_file>: the similar business pairs and their similarities

 

4.1.3 Output format

You must write a business pair and its similarity in the JSON format using exactly the same tags like the example in Figure 1. Each line represents a business pair, e.g., “b1” and “b2”. For each business pair “b1” and “b2”, you do not need to generate the output for “b2” and “b1” since the similarity value is the same as “b1” and “b2”.  You do not need to truncate decimals for the ‘sim’ values.

 

 

Figure 1: An example output for Task 1 in the JSON format

4.2  Task2: Content-based Recommendation System (2pts)

4.2.1 Task description

In this task, you will build a content-based recommendation system by generating profiles from review texts for users and businesses in the train_review.json file. Then you will use the model to predict if a user prefers to review a given business by computing the cosine similarity between the user and item profile vectors.

During the training process, you will construct the business and user profiles as follows:

  1. Concatenating all reviews for a business as one document and parsing the document, such as removing the punctuations, numbers, and stopwords. Also, you can remove extremely rare words to reduce the vocabulary size. Rare words could be the ones whose frequency is less than 0.0001% of the total number of words.
  2. Measuring word importance using TF-IDF, i.e., term frequency multiply inverse doc frequency
  3. Using top 200 words with the highest TF-IDF scores to describe the document
  4. Creating a Boolean vector with these significant words as the business profile
  5. Creating a Boolean vector for representing the user profile by aggregating the profiles of the items that the user has reviewed

During the prediction process, you will estimate if a user would prefer to review a business by computing the cosine distance between the profile vectors. The (user, business) pair is valid if their cosine similarity is >= 0.01. You should only output these valid pairs. 

 

4.2.2 Execution commands Training commands:

Python $ spark-submit task2train.py <train_file> <model_file> <stopwords>

Scala     $ spark-submit –class task2train hw3.jar < train_file> <model_file> <stopwords>

<train_file>: the train review set         <model_file>: the output model

<stopwords>: containing the stopwords that can be removed

 

Predicting commands:

Python $ spark-submit task2predict.py <test_file> <model_file> <output_file>

Scala     $ spark-submit –class task2predict hw3.jar <test_file> <model_file> <output_file>

<test_file>: the test review set (only target pairs)

<model_file>: the model generated during the training process <output_file>: the output results

4.2.3 Output format:

Model format:  There is no strict format requirement for the content-based model.

Prediction format:

You must write the results in JSON format using exactly the same tags like the example in Figure 2. Each line represents a predicted pair of (“user_id”, “business_id”). You do not need to truncate decimals for ‘sim’ values.

 

Figure 2: An example prediction output for Task 2 in JSON forma

 

4.3  Task3: Collaborative Filtering Recommendation System (4pts)

4.3.1 Task description

In this task, you will build collaborative filtering (CF) recommendation systems using the train_review.json file. After building the systems, you will use the systems to predict the ratings for a user and business pair. You are required to implement 2 cases:

  • Case 1: Item-based CF recommendation system (2pts)

During the training process, you will build a recommendation system by computing the Pearson correlation for the business pairs with at least three co-rated users. During the predicting process, you will use the system to predict the rating for a given pair of user and business. You must use at most N business neighbors who are the top N most similar to the target business for prediction (you can try various N, e.g., 3 or 5).

  • Case 2: User-based CF recommendation system with Min-Hash LSH (2pts)

During the training process, you should combine the Min-Hash and LSH algorithms in your user-based CF recommendation system since the number of potential user pairs might be too large to compute. You need to (1) identify user pairs’ similarity using their co-rated businesses without considering their rating scores (similar to Task 1). This process reduces the number of user pairs you need to compare for the final Pearson correlation score. (2) compute the Pearson correlation for the user pair candidates with Jaccard similarity >= 0.01 and at least three co-rated businesses. The predicting process is similar to Case 1.

 

 

 

4.3.2 Execution commands Training commands:

Python $ spark-submit task3train.py <train_file> <model _file> <cf_type>

Scala     $ spark-submit –class task3train hw3.jar < train_file> <model _file> <cf_type>

<train_file>: the train review set         <model_file>: the output model

<cf_type>: either “item_based” or “user_based”

 

Predicting commands:

Python $ spark-submit task3predict.py <train_file> <test_file> <model_file> <output_file> <cf_type>

Scala     $ spark-submit –class task3predict hw3.jar <train_file> <test_file> <model_file>

<output_file> <cf_type>

<train_file>: the train review set

<test_file>: the test review set (only target pairs)

<model_file>: the model generated during the training process

<output_file>: the output results

<cf_type>: either “item_based” or “user_based”

 

4.3.3 Output format:

Model format:

You must write the model in JSON format using exactly the same tags like the example in Figure 3. Each line represents a business pair (“b1”, “b2”) for the item-based model (Figure 3a) or a user pair (“u1”, “u2”) the for user-based model (Figure 3b). There is no need to have (“b2”, “b1”) or (“u2”, “u1”). You do not need to truncate decimals for ‘sim’ values.

(a)

(b)

Figure 3: (a) is an example of item-based model and (b) is an example of user-based model

 

Prediction format:

You must write a target pair and its prediction in the JSON format using exactly the same tags like the example in Figure 4. Each line represents a predicted pair of (“user_id”, “business_id”). You do not need to truncate decimals for ‘stars’ values.

 

Figure 4: An example output for task3 in JSON format