[SOLVED] CS4395 Intro to NLP Homework 8

30.00 $

Category:

Description

Rate this product

 

  1. Read in the csv file using pandas. Convert the author column to categorical data. Display the first

few rows. Display the counts by author.

  1. Divide into train and test, with 80% in train. Use random state 1234. Display the shape of train and

test.

  1. Process the text by removing stop words and performing tf-idf vectorization, fit to the training data

only, and applied to train and test. Output the training set shape and the test set shape.

  1. Try a Bernoulli Naïve Bayes model. What is your accuracy on the test set?
  2. The results from step 4 will be disappointing. The classifier just guessed the predominant class,

Hamilton, every time. Looking at the train data shape above, there are 7876 unique words in the

vocabulary. This may be too much, and many of those words may not be helpful. Redo the

vectorization with max_features option set to use only the 1000 most frequent words. In addition to

the words, add bigrams as a feature. Try Naïve Bayes again on the new train/test vectors and

compare your results.

  1. Try logistic regression. Adjust at least one parameter in the LogisticRegression() model to see if you

can improve results over having no parameters. What are your results?

  1. Try a neural network. Try different topologies until you get good results. What is your final

accuracy?