Name: Data Science-Project 1: Spam/Ham Classification Solved
SKU: 70009
Availability: InStock

Description

5/5 - (1 vote)

In this project, you will use what you’ve learned in class to create a classifier that can distinguish spam (junk or commercial or bulk) emails from ham (non-spam) emails. In addition to providing some skeleton code to fill in, we will evaluate your work based on your model’s accuracy and your written responses in this notebook.

After this project, you should feel comfortable with the following:

Feature engineering with text data

Using sklearn libraries to process data and fit models

Validating the performance of your model and minimizing overfitting Generating and analyzing precision-recall curves

Part I – Initial Analysis

Loading in the Data

In email classification, our goal is to classify emails as spam or not spam (referred to as “ham”) using features generated from the text in the email.

The dataset consists of email messages and their labels (0 for ham, 1 for spam). Your labeled training dataset contains 8348 labeled examples, and the test set contains 1000 unlabeled examples.

Run the following cells to load in the data into DataFrames.

The train DataFrame contains labeled data that you will use to train your model. It contains four columns:

id : An identifier for the training example
subject : The subject of the email
email : The text of the email
spam : 1 if the email is spam, 0 if the email is ham (not spam)

The test DataFrame contains 1000 unlabeled emails. You will predict labels for these emails and submit your predictions to Kaggle for evaluation.

In [23]: original_training_data = pd.read_csv(‘data/train.csv’) test = pd.read_csv(‘data/test.csv’)

# Convert the emails to lower case as a first step to processing the original_training_data[’email’] = original_training_data[’email’].st

test[’email’] = test[’email’].str.lower()

original_training_data.head()

Out[23]:

id subject email spam

0 Subject: A&L Daily to be auctioned inbankrupt…

Subject: Wired: “Stronger ties between

¹ISPs an…
2 Subject: It’s just too small …
3 Subject: liberal defnitions\n

Subject: RE: [ILUG] Newbie seeks ⁴⁴advice – Suse… url: http://boingboing.net/#85534171\n date:

n…

url:

0 http://scriptingnews.userland.com/backiss…

<html>\n <head>\n </head>\n <body>\n 1

<font siz…

depends on how much over spending vs.

0 how much…

hehe sorry but if you hit caps lock twice the 0

…

Question 1a

First, let’s check if our data contains any missing values.

Fill in the cell below to print the number of NaN values in each column.

If there are NaN values, replace them with appropriate filler values (i.e., NaN values in the subject or email columns should be replaced with empty strings). Print the number of NaN values in each column after this modification to verify that there are no NaN values left.

Note that while there are no NaN values in the spam column, we should be careful when replacing NaN labels. Doing so without consideration may introduce significant bias into our model when fitting.

The provided test checks that there are no missing values in your dataset.

In [41]: # BEGIN YOUR CODE # ———————-print(‘Before imputation:’)

print(original_training_data.isnull().sum())

original_training_data[“subject”] = original_training_data[“subject”

print(‘————‘) print(‘After imputation:’)

print(original_training_data.isnull().sum())

# ———————–

# END YOUR CODE

Before imputation:

id 0 subject 6 email 0 spam 0 dtype: int64 ———–After imputation:

id 0 subject 0 email 0 spam 0 dtype: int64

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~

Running tests

——————————————————————–

Test summary

Passed: 1

Failed: 0

[ooooooooook] 100.0% passed

Question 1b

In the cell below, print the text of the first ham and the first spam email in the original training set.

The provided tests just ensure that you have assigned first_ham and first_spam to rows in the data, but only the hidden tests check that you selected the correct observations.

In [275]: # BEGIN YOUR CODE # ———————–

first_ham_frame = original_training_data.loc[original_training_data[ first_spam_frame = original_training_data.loc[original_training_data first_ham = first_ham_frame.iloc[0, 2] first_spam = first_spam_frame.iloc[0, 2]

# ———————–

# END YOUR CODE

print(‘The text of the first Ham:’)

print(‘————‘) print(first_ham)

print(‘The text of the first Spam:’)

print(‘————‘) print(first_spam)

The text of the first Ham:

————

url: http://boingboing.net/#85534171 (http://boingboing.net/#85534171) date: not supplied

arts and letters daily, a wonderful and dense blog, has folded up its tent due

to the bankruptcy of its parent company. a&l daily will be auctio

ned off by the

receivers. link[1] discuss[2] (_thanks, misha!_)

(http://www.quicktopic.com/boing/h/zlfterjnd6jf)

The text of the first Spam:

————

<html>

<head>

</head>

<body>

a man endowed with a 7-8″ hammer is simply

better equipped than a man with a 5-6″hammer.

would you rather have more than enough to get the job done or fall =

short. it’s totally up to you. our methods are guaranteed to i

ncrease y=

our size by 1-3″ <a href=3d”http://209.163.187.47/cgi-bin/ind

ex.php?10=

004″>come in here and see how</a>

</body>

</html>

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~

Running tests

——————————————————————

—

Test summary Passed: 2

Failed: 0

[ooooooooook] 100.0% passed

Question 1c

Discuss one thing you notice that is different between the two emails that might relate to the identification of spam.

Answer: Spam mails contain HTML text formattings, and have some key words that are common for spam emails like “increase your size”.

Training Validation Split

The training data we downloaded is all the data we have available for both training models and validating the models that we train. We therefore need to split the training data into separate training and validation datsets. You will need this validation data to assess the performance of your classifier once you are finished training.

Note that we set the seed (random_state) to 42. This will produce a pseudo-random sequence of random numbers that is the same for every student. Do not modify this in the following questions, as our tests depend on this random seed.

In [44]: from sklearn.model_selection import train_test_split

train, val = train_test_split(original_training_data, test_size=0.1,

Basic Feature Engineering

We would like to take the text of an email and predict whether the email is ham or spam. This is a classification problem, and here we use logistic regression to train a classifier.

Recall that to train an logistic regression model we need:

a numeric feature matrix 𝑋 a vector of corresponding binary labels 𝑦.

Unfortunately, our data are text, not numbers. To address this, we can create numeric features derived from the email text and use those features for logistic regression:

Each row of 𝑋 is an email.

Each column of 𝑋 contains one feature for all the emails.

We’ll guide you through creating a simple feature, and you’ll create more interesting ones when you are trying to increase your accuracy.

Question 2

Create a function called words_in_texts that takes in a list of words and a pandas Series of email texts . It should output a 2-dimensional NumPy array containing one row for each email text. The row should contain either a 0 or a 1 for each word in the list: 0 if the word doesn’t appear in the text and 1 if the word does. For example:

>>> words_in_texts([‘hello’, ‘bye’, ‘world’],

pd.Series([‘hello’, ‘hello worldhello’])

)

array([[1, 0, 0], [1, 0, 1]])

Hint: pandas.Series.str.contains

(https://pandas.p y data.or g /docs/reference/api/pandas.Series.str.contains.html)

The provided tests make sure that your function works correctly, so that you can use it for future questions.

In [130]: def words_in_texts(words, texts):

”’ Args: words (list-like): words to find texts (Series): strings to search in

Returns:

NumPy array of 0s and 1s with shape (n, p) where n is the number of texts and p is the number of words.

”’

# BEGIN YOUR CODE # ———————- temp_array = [] for text in texts:

row = [] for word in words: if word in text: row.append(1) else:

row.append(0) temp_array.append(row)

indicator_array = np.array(temp_array)

# ———————–

# END YOUR CODE return indicator_array words_in_texts([‘hello’, ‘bye’, ‘world’], pd.Series([‘hello’, ‘hello

Out[130]: array([[1, 0, 0],

[1, 0, 1]])

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~

Running tests

——————————————————————–

Test summary

Passed: 2

Failed: 0

[ooooooooook] 100.0% passed

Basic EDA

We need to identify some features that allow us to distinguish spam emails from ham emails. One idea is to compare the distribution of a single feature in spam emails to the distribution of the same feature in ham emails.

If the feature is itself a binary indicator (such as whether a certain word occurs in the text), this amounts to comparing the proportion of spam emails with the word to the proportion of ham emails with the word.

The following plot (which was created using sns.barplot ) compares the proportion of emails in each class containing a particular set of words.

In [132]: from IPython.display import display, Markdown df = pd.DataFrame({

‘word_1’: [1, 0, 1, 0],

‘word_2’: [0, 1, 0, 1], ‘type’: [‘spam’, ‘ham’, ‘ham’, ‘ham’]

})

display(Markdown(“> Our Original DataFrame has some words column and display(df);

display(Markdown(“> `melt` will turn columns into variale, notice ho display(df.melt(“type”))

Our Original DataFrame has some words column and a type column. You can think of each row is a sentence, and the value of 1 or 0 indicates the number of occurances of the word in this sentence.

word_1 word_2 type

1 0 spam
0 1 ham 2 1 0 ham

3 0 1 ham

melt will turn columns into variale, notice how word_1 and word_2

become variable , their values are stoed in the value column

type variable value

0	spam	word_1	1
1	ham	word_1	0
2	ham	word_1	1
3	ham	word_1	0
4	spam	word_2	0
5	ham	word_2	1
6	ham	word_2	0
7	ham	word_2	1

We can create a bar chart like the one above comparing the proportion of spam and ham emails containing certain words. Choose a set of words that are different from the ones above, but also have different proportions for the two classes. Make sure that we only consider emails from train .

Question 3

When the feature is binary, it makes sense to compare its proportions across classes (as in the previous question). Otherwise, if the feature can take on numeric values, we can compare the distributions of these values for different classes.

Create a class conditional density plot like the one above (using sns.distplot ), comparing the distribution of the length of spam emails to the distribution of the length of ham emails in the training set. Set the x-axis limit from 0 to 50000.

In [142]: # BEGIN YOUR CODE # ———————–

sns.distplot(train[train[‘spam’]==0][’email’].agg(len), label = ‘Ham sns.distplot(train[train[‘spam’]==1][’email’].agg(len), label = ‘Spa

plt.xlabel(“Length of email body”) plt.ylabel(“Distribution”)

plt.legend() plt.xlim(0, 50000) # ———————–

# END YOUR CODE

/Users/temirlan/opt/anaconda3/lib/python3.8/site-packages/seaborn/ distributions.py:2551: FutureWarning: `distplot` is a deprecated f unction and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `kdeplot` (an axes-level function for kernel densi ty plots). warnings.warn(msg, FutureWarning)

Out[142]: (0.0, 50000.0)

Basic Classification

Notice that the output of words_in_texts(words, train[’email’]) is a numeric matrix containing features for each email. This means we can use it directly to train a classifier!

Question 4

We’ve given you 5 words that might be useful as features to distinguish spam/ham emails. Use these words as well as the train DataFrame to create two NumPy arrays: X_train and Y_train .

X_train should be a matrix of 0s and 1s created by using your words_in_texts function on all the emails in the training set.

Y_train should be a vector of the correct labels for each email in the training set.

The provided tests check that the dimensions of your feature matrix (X) are correct, and that your features and labels are binary (i.e. consists of 0 and 1, no other values). It does not check that your function is correct; that was verified in a previous question.

In [144]: some_words = [‘drug’, ‘bank’, ‘prescription’, ‘memo’, ‘private’]

# BEGIN YOUR CODE

# ———————–

X_train = words_in_texts(some_words,train[’email’])

Y_train = train[‘spam’].values

# ———————–

# END YOUR CODE

X_train[:5], Y_train[:5]

Out[144]: (array([[0, 0, 0, 0, 0], [0, 0, 0, 0, 0],

[0, 0, 0, 0, 0],

[0, 0, 0, 1, 0]]), array([0, 0, 0, 0, 0]))

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~

Running tests

——————————————————————–

Test summary

Passed: 3

Failed: 0

[ooooooooook] 100.0% passed

Question 5

Now we have matrices we can give to scikit-learn!

Using the Lo g isticRe g ression (http://scikitlearn.or g /stable/modules/g enerated/sklearn.linear _model.Lo g isticRe g ression.html) classifier, train a logistic regression model using X_train and Y_train . Then, output the accuracy of the model (on the training data) in the cell below. You should get an accuracy around 0.75.

The provided test checks that you initialized your logistic regression model correctly.

In [146]: from sklearn.linear_model import LogisticRegression

# BEGIN YOUR CODE

# ———————–

model = LogisticRegression(fit_intercept=True)

model.fit(X_train, Y_train)

training_accuracy = model.score(X_train, Y_train)

# ———————–

# END YOUR CODE

print(“Training Accuracy: “, training_accuracy)

Training Accuracy: 0.7576201251164648

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~

Running tests

——————————————————————

—

Test summary

Passed: 1

Failed: 0

[ooooooooook] 100.0% passed

Evaluating Classifiers

That doesn’t seem too shabby! But the classifier you made above isn’t as good as this might lead us to believe. First, we are evaluating accuracy on the training set, which may lead to a misleading accuracy measure, especially if we used the training set to identify discriminative features. In future parts of this analysis, it will be safer to hold out some of our data for model validation and comparison.

Presumably, our classifier will be used for filtering, i.e. preventing messages labeled spam from reaching someone’s inbox. There are two kinds of errors we can make:

False positive (FP): a ham email gets flagged as spam and filtered out of the inbox.

False negative (FN): a spam email gets mislabeled as ham and ends up in the inbox.

These definitions depend both on the true labels and the predicted labels. False positives and false negatives may be of differing importance, leading us to consider more ways of evaluating a classifier, in addition to overall accuracy:

Precision measures the proportion ^TP of emails flagged as spam that are actually TP+FP

spam.

Recall measures the proportion ^TP of spam emails that were correctly flagged as TP+FN

spam.

False-alarm rate measures the proportion ^FP of ham emails that were incorrectly FP+TN

flagged as spam.

The following image might help:

Note that a true positive (TP) is a spam email that is classified as spam, and a true negative (TN) is a ham email that is classified as ham.

Question 6a

Suppose we have a classifier zero_predictor that always predicts 0 (never predicts positive). How many false positives and false negatives would this classifier have if it were evaluated on the training set and its results were compared to Y_train ? Fill in the variables below (answers can be hard-coded):

Tests in Question 6 only check that you have assigned appropriate types of values to each response variable, but do not check that your answers are correct.

1918

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

~~~

Running tests

——————————————————————

—

Test summary

Passed: 2

Failed: 0

[ooooooooook] 100.0% passed

Question 6b

What are the accuracy and recall of zero_predictor (classifies every email as ham) on the training set? Do NOT use any sklearn functions.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~

Running tests

——————————————————————

—

Test summary

Passed: 2

Failed: 0

[ooooooooook] 100.0% passed

Question 6c

Provide brief explanations of the results from 6a and 6b. Why do we observe each of these values (FP, FN, accuracy, recall)?

Answer: In 6b we get accuracy of classifier zero predictor, and in 6a we get actual values. We observe FN to know how many spam email doesn’t get detected, and the accuracy tells us the effectiveness of the classifier, and recall gives us the proportion of emails classified as spam.

Question 6d

Compute the precision, recall, and false-alarm rate of the LogisticRegression classifier created and trained in Question 5. Note: Do NOT use any sklearn functions.

In [253]: # BEGIN YOUR CODE # ———————-predict = model.predict(X_train)

true_pos = sum((predict == Y_train) & (Y_train == 1)) false_pos = sum((predict != Y_train) & (Y_train == 0)) false_neg = sum((predict != Y_train) & (Y_train == 1)) true_neg = sum((predict == Y_train) & (Y_train == 0)) logistic_predictor_precision = true_pos / (true_pos + false_pos) logistic_predictor_recall = true_pos / (true_pos + false_neg) logistic_predictor_far = false_pos / (false_pos + true_neg)

# ———————–

# END YOUR CODE

In [254]: ok.grade(“q6d”);

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~

Running tests

——————————————————————–

Test summary

Passed: 3

Failed: 0

[ooooooooook] 100.0% passed

Question 6e

Our logistic regression classifier got 75.6% prediction accuracy (number of correct predictions / total). How does this compare with predicting 0 for every email?
Given the word features we gave you above, name one reason this classifier is performing poorly. Hint: Think about how prevalent these words are in the email set.
Which of these two classifiers would you prefer for a spam filter and why? Describe your reasoning and relate it to at least one of the evaluation metrics you have computed so far.

Part II – Moving Forward

With this in mind, it is now your task to make the spam filter more accurate. In order to get full credit on the accuracy part of this assignment, you must get at least 77% accuracy on the test set. To see your accuracy on the test set, you will use your classifier to predict every email in the test DataFrame and upload your predictions to Kaggle.

Here are some ideas for improving your model:

Finding better features based on the email text. Some example features are: A. Number of characters in the subject / body B. Number of words in the subject / body
1. Use of punctuation (e.g., how many ‘!’ were there?)
2. Number / percentage of capital letters
3. Whether the email is a reply to an earlier email or a forwarded email
Finding better words to use as features. Which words are the best at distinguishing emails? This requires digging into the email text itself.
Better data processing. For example, many emails contain HTML as well as text. You can consider extracting out the text from the HTML to help you find better words. Or, you can match HTML tags themselves, or even some combination of the two.
Model selection. You can adjust parameters of your model (e.g. the regularization parameter) to achieve higher accuracy. Recall that you should use cross-validation to do feature and model selection properly! Otherwise, you will likely overfit to your training data.

ou may use whatever method you prefer in order to create features, but you are not allowed to import any external feature extraction libraries. In addition, you are only allowed to train logistic regression models. No random forests, k-nearest-neighbors, neural nets, etc.

We have not provided any code to do this, so feel free to create as many cells as you need in order to tackle this task. However, answering questions 7, 8, and 9 should help guide you.

Note: You should use the validation data to evaluate your model and get a better sense of how it will perform on the Kaggle evaluation.

Question 7: EDA

In the cell below, show a visualization that you used to select features for your model.

Include both

A plot showing something meaningful about the data that helped you during feature / model selection.
2-3 sentences describing what you plotted and what its implications are for your features.

Feel free to create as many plots as you want in your process of feature selection, but select one for the response cell below.

You should not just produce an identical visualization to question 3. Specifically, don’t show us a bar chart of proportions, or a one-dimensional class-conditional density plot. Any other plot is acceptable, as long as it comes with thoughtful commentary. Here are some ideas:

Consider the correlation between multiple features (look up correlation plots and heatmap ).
Try to show redundancy in a group of features (e.g. body and html might cooccur relatively frequently, or you might be able to design a feature that captures all html tags and compare it to these).
Visualize which words have high or low values for some useful statistic.
Visually depict whether spam emails tend to be wordier (in some sense) than ham emails.

Generate your visualization in the cell below and provide your description in a comment.

[273]: # Write your description (2-3 sentences) as a comment here:

# I used certain words to find their correlation and involvment to a

# mails. In many cases, words such as ‘extra’ and ‘money’ have highe

# to be a spam if both of them met in a mail.

# Write the code to generate your visualization here: words = [‘extra’, ‘money’, ‘free’, ‘help’, ‘submit’]

train_words = train

train_words[‘extra’] = words_in_texts([‘extra’],train[’email’]) train_words[‘money’] = words_in_texts([‘money’],train[’email’]) train_words[‘free’] = words_in_texts([‘free’],train[’email’]) train_words[‘company’] = words_in_texts([‘company’],train[’email’]) train_words[‘submit’] = words_in_texts([‘submit’],train[’email’])

train_plot = train_words[words]

sns.heatmap(train_plot.corr(), annot=True, cmap=“YlGnBu”)

Out[273]: <AxesSubplot:>

Question 8: Precision-Recall Curve

We can trade off between precision and recall. In most cases we won’t be able to get both perfect precision (i.e. no false positives) and recall (i.e. no false negatives), so we have to compromise.

Recall that logistic regression calculates the probability that an example belongs to a certain class.

Then, to classify an example we say that an email is spam if our classifier gives it ≥ 0.5 probability of being spam.

However, we can adjust that cutoff: we can say that an email is spam only if our classifier gives it ≥ 0.7 probability of being spam.

This is how we can trade off false positives and false negatives.

The precision-recall curve shows this trade off for each possible cutoff probability. In the cell below, plot a precision-recall curve (http://scikitlearn.or g /stable/auto _examples/model _selection/plot _precision _recall.html#plot-theprecision-recall-curve) for your final classifier.

[274]: from sklearn.metrics import precision_recall_curve import matplotlib.pyplot as plt

# Note that you’ll want to use the .predict_proba(…) method for yo

# instead of .predict(…) so you get probabilities, not classes

# BEGIN YOUR CODE

# ———————–

# END YOUR CODE

——————————————————————

———

NameError Traceback (most recent c

all last)

<ipython-input-274-6983a2dbba4f> in <module> 17

18 y_score = model.predict_proba(X_test)

—> 19 precision, recall, _ = precision_recall_curve(y_test,

y_score[:, 1])

20 plt.step(recall, precision, where=’post’) 21 plt.fill_between(recall, precision, step=’post’, alpha=0.2 , color=’b’)

NameError: name ‘y_test’ is not defined

Question 9: Submitting to Kaggle

The following code will write your predictions on the test dataset to a CSV, which you can submit to Kaggle. You may need to modify it to suit your needs.

Save your predictions in a 1-dimensional array called test_predictions . Even if you are not submitting to Kaggle, please make sure you’ve saved your predictions to

test_predictions as this is how your score for this question will be determined.

Remember that if you’ve performed transformations or featurization on the training data, you must also perform the same transformations on the test data in order to make predictions. For example, if you’ve created features for the words “drug” and “money” on the training data, you must also extract the same features in order to use scikit-learn’s

.predict(…) method.

You should submit your CSV files to https://www.ka gg le.com/c/cose471sp21pro j ect1 (https://www.ka gg le.com/c/cose471sp21pro j ect1)

The provided tests check that your predictions are in the correct format, but you must submit to Kaggle to evaluate your classifier accuracy.

[367]: # BEGIN YOUR CODE # ———————–

words = [‘penis’, ‘viagra’, ‘subscribe’, ‘increase’, ‘cash’, ‘sex’, ‘length’, ‘refund’, ‘<body>’, ‘extra’, ‘$’, ‘$$$’, ‘small’, ‘dick’, ‘cancel’, ‘call’, ‘extra’, ‘won’, ‘refund’, ‘verify’

‘affordable’, ‘call free’, ‘casino’, ‘cure’, ‘make money’,

‘watch free’, ‘double your’, ‘while you sleep’, ‘money’, ‘re ‘drug’, ‘bank’, ‘memo’, ‘private’, ‘adult’, ‘movie’, ‘click’

‘congrats’, ‘discount’, ‘sale’, ‘income’, ‘please’]

X_train = words_in_texts(words, train.loc[:,’email’])

Y_train = train.loc[:,’spam’]

X_test = words_in_texts(words, test[’email’])

model = LogisticRegression() model.fit(X_train,Y_train)

test_predictions = model.predict(X_test)

# ———————-# END YOUR CODE

In [368]: ok.grade(“q9”);

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~

Running tests

——————————————————————–

Test summary

Passed: 3

Failed: 0

[ooooooooook] 100.0% passed

In [366]: from datetime import datetime

# Assuming that your predictions on the test set are stored in a 1-d

# test_predictions. Feel free to modify this cell as long you create

# Construct and save the submission:

submission_df = pd.DataFrame({

“Id”: test[‘id’],

“Class”: test_predictions,

}, columns=[‘Id’, ‘Class’])

timestamp = datetime.isoformat(datetime.now()).split(“.”)[0] submission_df.to_csv(“submission_{}.csv”.format(timestamp), index=Fa

print(‘Created a CSV file: {}.’.format(“submission_{}.csv”.format(ti print(‘You may now upload this CSV file to Kaggle for scoring.’)

Created a CSV file: submission_2021-05-22T01:23:10.csv.

You may now upload this CSV file to Kaggle for scoring.

Question 10: Attach Your Leaderboard Screenshot

Take a screenshot of your submission to Kaggle as follows. This screenshot should contain your testing score.

You should replace images/leaderboard_example.png with your screenshot!

Note that, in order to get full credit on the accuracy part of this assignment, you must get at least 88% accuracy on the test set.

In [ ]:

[SOLVED] Data Science-Project 1: Spam/Ham Classification

If Helpful Share:

Description

Part I – Initial Analysis

Loading in the Data

Question 1a

Question 1b

Question 1c

Training Validation Split

Basic Feature Engineering

Question 2

Basic EDA

Question 3

Basic Classification

Question 4

Question 5

Evaluating Classifiers

Question 6a

Question 6b

Question 6c

Question 6d

Question 6e

Part II – Moving Forward

Question 7: EDA

Question 8: Precision-Recall Curve

Question 9: Submitting to Kaggle

Question 10: Attach Your Leaderboard Screenshot

Related products

Data Science-Homework 2: Exploratory Data Analysis (EDA)

Data Science-Homework 3: Predicting Housing Prices

Data Science Project 2

Related in this category

More in this category

Data Science-Homework 3: Predicting Housing Prices

Data Science Project 2

Homework #1 CS 5665

Data Structure-Homework 4 Hash Maps

Data Science-Project 2: Predicting Taxi Ride Duration

Data Science Project 1