Description
Text Classification with the perceptron
The goal of this lab session is to implement the binary perceptron and apply to it sentiment analysis, and in particular to predict the sentiment of movie reviews. The data you will use for this purpose are available from here: http://www.cs. cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz. Download it and take the first (alphabetic file order) 800 instances of each class as training data and the remaining 200 as testing and implement and evaluate the following:
- standard binary perceptron with bag-of-words representation (2 marks)
- randomizing the order of the training instances (use python’s random library to fix the random seed so that your results are reproducible!) (1 mark)
- multiple passes over the training instances (show the learning progress in a graph) (1 mark)
- instead of using the last weight vector for testing, taking the average of all the weight vectors calculated for each class (0.5 mark)
- implement two feature types beyond bag-of-words. Discuss your choice of features. Does any of them help improve accuracy? (1 mark)
- What are the most positively-weighted features for each class? Give the top 10 for each class and comment on whether they make sense (if they don’t you might have a bug!). If we were to apply the classifier we learnt to a different domain such laptop reviews or restaurant reviews, do you think these features would generalize well? Can you propose better features for
the new domain? (0.5 mark)



