Description
Creating an Inverted Index of words occurring in a set of web pages
- Get hands-on experience in GCP App Engine
We’ll be using a subset of 74 files from a total of 408 files (text extracted from HTML tags) derived from the Stanford WebBase project that is available here. It was obtained from a web crawl done in February 2007. It is one of the largest collections totaling more than 100 million web pages from more than 50,000 websites. This version has been cleaned for the purpose of this assignment.
These files will be placed in a bucket on your Google cloud storage and the Hadoop job will be instructed to read the input from this bucket.
- Uploading the input data into the bucket
https://drive.google.com/drive/u/1/folders/1Z4KyalIuddPGVkIm6dUjkpD_FiXyNIcq
You may use your USC account to get access to the data from the Google Drive link. Compressed full data is around 1.1GB. Uncompressed, it is 3.12 GB of data for the files for this project. So on balance you should download the zipped file, not the folder.
- Unzip the contents. You will find two folders inside named ‘development’ and ‘full data’. Each of the folders contains the actual data (txt files). We suggest you use the development data initially while you are testing your code. Using the full data will take up to few minutes for each run of the Map-Reduce job and you may risk spending all your cloud credits while testing the code.
- Click on ‘Dataproc’ in the left navigation menu under . Next, locate the address of the default Google cloud storage staging bucket for your cluster in the Figure 1 below. If you’ve previously disabled billing, you need to re-enable it before you can upload the data. Refer to the “Enable and Disable Billing account” section to see how to do this. d.
Figure 1: The default Cloud Storage bucket.
- Go to the storage section in the left navigation bar and select your cluster’s default bucket from the list of buckets. At the top you should see menu items UPLOAD FILES, UPLOAD FOLDER, CREATE FOLDER, etc (Figure 2). Click on the UPLOAD FOLDER button and upload the dev_data folder and full_data folder individually. This will take a while, but there will be a progress bar (Figure 3). You may not see this progress bar as soon as you start the upload but, it will show up eventually.
Figure 2: Cloud Storage Bucket.
Figure 3: Progress of uploading
Inverted Index Implementation using Map-Reduce
Now that you have the cluster and the files in place, you need to write the actual code for the job. As of now, Google Cloud allows us to submit jobs via the UI, only if they are packaged as a jar file. The following steps are focused on submitting a job written in Java via the Cloud console UI.
Refer to the examples below and write a Map-Reduce job in java that creates an Inverted Index given a collection of text files. You can very easily tweak a word-count example to create an inverted index instead (Hint: Change the mapper to output word docID instead of word count and in the reducer use a HashMap).
Here are some helpful examples of Map-Reduce Jobs
- https://developer.yahoo.com/hadoop/tutorial/module4.html#wordcount
- https://hadoop.apache.org/docs/stable/hadoop–mapreduce–client/hadoop–mapreduce–clienthttps://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.htmlcore/MapReduceTutorial.html
The example in the following pages explains a Hadoop word count implementation in detail. It takes one text file as input and returns the word count for every word in the file. Refer to the comments in the code for explanation.
The Mapper Class:
The Reducer Class:
Main Class
The input data is cleaned, that is all the \n\r s is removed but one or more \t might still be present (which needs to be handled). There will be punctuation and you are required to handle this in your code. Replace all the occurrences of special characters and numerals by space character, convert all the words to the lowercase. The ‘\t’ separates the key(Document ID) from the value(Document). The input files are in a key value format as below:
| DocumentID | document |
Sample document:
The mapper’s output is expected to be as follows:
The above example indicates that the word aspect occurred 1 time in the document with docID 5722018411 and economics 2 times.
The reducer takes this as input, aggregates the word counts using a Hashmap and creates the Inverted index. The format of the index is as follows.
| word docID:count | docID:count | docID:count… |
The above sample shows a portion of the inverted index created by the reducer.
To write the Hadoop java code you can use the VI or nano editors that come pre-installed on the master node. You can test your code on the cluster itself. Be sure to use the development data while testing the code. You are expected to write a simple Hadoop job. You can just tweak this example if you’d like, but make sure you understand it first.
Creating a jar for your code
Now that your code for the job is ready we’ll need to run it. The Google Cloud console requires us to upload a Map-Reduce job as a jar file. In the following example the Mapper and Reducer are in the same file called InvertedIndexJob.java.To create a jar for the Java class implemented please follow the instructions below. The following instructions were executed on the cluster’s master node on the Google Cloud.
- Say your Java Job file is called InvertedIndex.java. Create a JAR as follows: ○ hadoop com.sun.tools.javac.Main InvertedIndexJob.java
If you get the following Notes you can ignore them
Note: InvertedIndexJob.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details.
○ jar cf invertedindex.jar InvertedIndex*.class
Now you have a jar file for your job. You need to place this jar file in the default cloud bucket of your cluster. Just create a folder called JAR on your bucket and upload it to that folder. If you created your jar file on the cluster’s master node itself use the following commands to copy it to the JAR folder.
○ hadoop fs -copyFromLocal ./invertedindex.jar
○ hadoop fs -cp ./invertedindex.jar gs://dataproc-69070…/JAR
The highlighted part is the default bucket of your cluster. It needs to be prepended by the gs:// to tell the Hadoop environment that it is a bucket and not a regular location on the filesystem.
Note: This is not the only way to package your code into a jar file. You can follow any method that will create a single jar file that can be uploaded to the Google cloud.
Figure 4: Dataproc jobs section
- Fill the job parameters as follows (see Figure 13 for reference): ○ Cluster: Select the cluster you created
○ Job Type: Hadoop
○ Jar File: Full path to the jar file you uploaded earlier to the Google storage bucket. Don’t forget the gs://
○ Main Class or jar: The name of the java class you wrote the mapper and reducer in.
○ Arguments: This takes two arguments
- Input: Path to the input data you uploaded
- Output: Path to the storage bucket followed by a new folder name. The folder is created during execution. You will get an error if you give the name of an existing folder. ○ Leave the rest at their default settings
○
Figure 7: Job progress
- Once the job executes copy all the log entries that were generated to a text file called txt. You need to submit this log along with the java code. You need to do this only for the job you run on the full data.
- The output files will be stored in the output folder on the bucket. If you open this folder you’ll notice that the inverted index is in several segments.(Delete the _SUCCESS file in the folder before merging all the output files)
To merge the output files, run the following command in the master nodes command line(SSH)
○ hadoop fs -getmerge gs://dataproc-69070458-bbe2-…/output ./output.txt
○ hadoop fs -copyFromLocal ./output.txt
○ hadoop fs -cp ./output.txt gs://dataproc-69070458-bbe2-…/output.txt
The output.txt file in the bucket contains the full Inverted Index for all the files.
- Sort your output.txt file using the command
sort -o output_sorted.txt output.txt
- Use grep to search for the words mentioned in the submissions section. Using grep is the fastest way to get the entries associated with the words. For example to search for “string” use
grep -w ‘^string ’ output_sorted.txt
Note:-> In the above grep command, the word to be searched should be followed by a tab character.
Part II: Inverted Index of Bigrams using Map-Reduce
Now that you are familiar with setting up and running Hadoop jobs on GCP, you will now modify your InvertedIndexJob.java script to generate an inverted index of bigrams (instead of unigrams).
Your existing Mapper class emits (word, docID) pairs which are then aggregated in the Reducer class. You will have to modify your Mapper class to emit (“word1 word2”, docID) pairs instead. The reducer remains unchanged.
Once you modify your class(es), create the jar invertedindex__bigrams.jar and dispatch a Hadoop job on the devdata/ in the same manner as before.
The output will look something like this:
To get credit for this task, create another text file index_bigrams.txt with the index entries for the following bigram phrases (generated from devdata/):
- computer science
- information retrieval
- power politics
- los angeles
- bruce willis
You can apply grep on the output file in the same way you did for the previous exercises.




