[SOLVED] CS4395 Intro to NLP Homework 6

30.00 $

Category:

Description

Rate this product

 

  1. Build a web crawler function that starts with a URL representing a topic (a sport, your

favorite film, a celebrity, a political issue, etc.) and outputs a list of at least 15 relevant

URLs. The URLs can be pages within the original domain but should have a few outside

the original domain.

  1. Write a function to loop through your URLs and scrape all text off each page. Store each

page’s text in its own file.

  1. Write a function to clean up the text from each file. You might need to delete newlines

and tabs first. Extract sentences with NLTK’s sentence tokenizer. Write the sentences for

each file to a new file. That is, if you have 15 files in, you have 15 files out.

  1. Write a function to extract at least 25 important terms from the pages using an

importance measure such as term frequency, or tf-idf. First, it’s a good idea to lower

case everything, remove stopwords and punctuation. Print the top 25-40 terms.

  1. Manually determine the top 10 terms from step 4, based on your domain knowledge.
  2. Build a searchable knowledge base of facts that a chatbot (to be developed later) can

share related to the 10 terms. The “knowledge base” can be as simple as a Python dict

which you can pickle. More points for something more sophisticated like sql.

  1. In a doc: (1) describe how you created your knowledge base, include screen shots of the

knowledge base, and indicate your top 10 terms; (2) write up a sample dialog you would

like to create with a chatbot based on your knowledge base

  1. Create a link to the report and code on your index pageCS 4395 Intro to NLP

Caution: All course work is run through plagiarism detection software comparing

students’ work as well as work from previous semesters and other sources.

Be prepared to present your results to class:

– what was your starter site

– what kind of data did you get

– how did you clean up the data

– what were your top terms

– show us your knowledge base

– how might you use this data for a chatb