Description
Tasks
You are welcome to try any model you like on this task, and you are free to use any libraries you like to extract features. However, you must meet the following requirements:
- You must implement a Bernoulli Naive Bayes model (i.e., the Naive Bayes model from Lecture 5) from scratch (i.e., without using any external libraries such as SciKit learn). You are free to use any text preprocessing that you like with this model. Hint 1: you many want to use Laplace smoothing with your Bernoulli Naive Bayes model. Hint 2: you can choose the vocabulary for your model (i.e, which words you include vs. ignore), but you should provide justification for the vocabulary you use.
- You must run experiments using at least two different classifiers from the SciKit learn package (which are not Bernoulli Naive Bayes). Possible options are:
- Logistic regression
(https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
- Decision trees
(https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
- Support vector machines [to be introduced in Lecture 10 on Oct. 7th]
(https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)
- You must develop a model validation pipeline (e.g., using k-fold cross validation or a held-out validation set) and report on the performance of the above mentioned model variants.
- You should evaluate all the model variants above (i.e., Naive Bayes and the SciKit learn models) using your validation pipeline (i.e., without submitting to Kaggle) and report on these comparisons in your write-up. Ideally, you should only run your “best” model on the Kaggle competition, since you are limited to two submissions to Kaggle per day.
Project write-up
Your team must submit a project write-up that is a maximum of five pages (single-spaced, 10pt font or larger; extra pages for references/bibliographical content and appendices can be used). We highly recommend that students use LaTeX to complete their write-ups and use the bibtex feature for citations. You are free to structure the report how you see fit; below are general guidelines and recommendations, but this is only a suggested structure and you may deviate from it as you see fit.
Abstract (100-250 words) Summarize the project task and your most important findings.
Introduction (5+ sentences) Summarize the project task, the dataset, and your most important findings. This should be similar to the abstract but more detailed.
Related work (4+ sentences) Summarize previous literature related to the sentiment classification problem.
Dataset and setup (3+ sentences) Very briefly describe the dataset and any basic data pre-processing methods that are common to all your approaches (e.g., tokenizing). Note: You do not need to explicitly verify that the data satisfies the i.i.d. assumption (or any of the other formal assumptions for linear classification).
Proposed approach (7+ sentences ) Briefly describe the different models you implemented/compared and the features you designed, providing citations as necessary. If you use or build upon an existing model based on previously published work, it is essential that you properly cite and acknowledge this previous work. Discuss algorithm selection and implementation. Include any decisions about training/validation split, regularization strategies, any optimization tricks, setting hyper-parameters, etc. It is not necessary to provide detailed derivations for the models you use, but you should provide at least few sentences of background (and motivation) for each model.
Results (7+ sentences, possibly with figures or tables) Provide results on the different models you implemented (e.g., accuracy on the validation set, runtimes). You should report your leaderboard test set accuracy in this section, but most of your results should be on your validation set (or from cross validation).
Discussion and Conclusion (3+ sentences) Summarize the key takeaways from the project and possibly directions for future investigation.
Statement of Contributions (1-3 sentences) State the breakdown of the workload.





