Name: Data Science-Homework 3: Predicting Housing Prices Solved
SKU: 70007
Availability: InStock

Description

5/5 - (1 vote)

Introduction

We will go through the iterative process of specifying, fitting, and analyzing the performance of a model.

In the first portion of the assignment, we will guide you through some basic exploratory data analysis (EDA), laying out the thought process that leads to certain modeling decisions. Next, you will add a new feature to the dataset, before specifying and fitting a linear model to a few features of the housing data to predict housing prices. Finally, we will analyze the error of the model and brainstorm ways to improve the model’s performance.

After this homework, you should feel comfortable with the following:

Simple feature engineering
Using sklearn to build linear models
Building a data pipeline using pandas

Next homework will continue working with this dataset to address more advanced and subtle issues with modeling.

import numpy as np import pandas as pd

from pandas.api.types import CategoricalDtype

%matplotlib inline

import matplotlib.pyplot as plt import seaborn as sns

# Plot settings

plt.rcParams[‘figure.figsize’] = (12, 9) plt.rcParams[‘font.size’] = 12

In [3]:

The Ames Housing Price Dataset

The Ames dataset consists of 2930 records taken from the Ames, Iowa, Assessor’s

Office describing houses sold in Ames from 2006 to 2010. The data set has 23 nominal, 23 ordinal, 14 discrete, and 20 continuous variables (and 2 additional observation identifiers) — 82 features in total.

An explanation of each variable can be found in the included codebook.txt file. The information was used in computing assessed values for individual residential properties sold in Ames, Iowa from 2006 to 2010. Some noise has been added to the actual sale price, so prices will not match official records.

The data are split into training and test sets with 2000 and 930 observations, respectively.

training_data = pd.read_csv(“./data/ames_train.csv”) test_data = pd.read_csv(“./data/ames_test.csv”)

In [4]:

As a good sanity check, we should at least verify that the data shape matches the description.

# 2000 observations and 82 features in training data assert training_data.shape == (2000, 82) # 930 observations and 81 features in test data assert test_data.shape == (930, 81) # SalePrice is hidden in the test data

assert ‘SalePrice’ not in test_data.columns.values

# Every other column in the test data should be in the training data assert len(np.intersect1d(test_data.columns.values, training_data.columns.values)) == 81

In [5]:

The next order of business is getting a feel for the variables in our data. The Ames dataset contains information that typical homebuyers would want to know.

A more detailed description of each variable is included in codebook.txt . You should take some time to familiarize yourself with the codebook before moving forward.

training_data.columns.values

In [6]:

Out[6]: array([‘Order’, ‘PID’, ‘MS_SubClass’, ‘MS_Zoning’, ‘Lot_Frontage’, ‘Lot_Area’, ‘Street’, ‘Alley’, ‘Lot_Shape’, ‘Land_Contour’, ‘Utilities’, ‘Lot_Config’, ‘Land_Slope’, ‘Neighborhood’,

‘Condition_1’, ‘Condition_2’, ‘Bldg_Type’, ‘House_Style’,

‘Overall_Qual’, ‘Overall_Cond’, ‘Year_Built’, ‘Year_Remod/Add’, ‘Roof_Style’, ‘Roof_Matl’, ‘Exterior_1st’, ‘Exterior_2nd’,

‘Mas_Vnr_Type’, ‘Mas_Vnr_Area’, ‘Exter_Qual’, ‘Exter_Cond’,

‘Foundation’, ‘Bsmt_Qual’, ‘Bsmt_Cond’, ‘Bsmt_Exposure’, ‘BsmtFin_Type_1’, ‘BsmtFin_SF_1’, ‘BsmtFin_Type_2’, ‘BsmtFin_SF_2’, ‘Bsmt_Unf_SF’, ‘Total_Bsmt_SF’, ‘Heating’, ‘Heating_QC’,

‘Central_Air’, ‘Electrical’, ‘1st_Flr_SF’, ‘2nd_Flr_SF’, ‘Low_Qual_Fin_SF’, ‘Gr_Liv_Area’, ‘Bsmt_Full_Bath’,

‘Bsmt_Half_Bath’, ‘Full_Bath’, ‘Half_Bath’, ‘Bedroom_AbvGr’,

‘Kitchen_AbvGr’, ‘Kitchen_Qual’, ‘TotRms_AbvGrd’, ‘Functional’, ‘Fireplaces’, ‘Fireplace_Qu’, ‘Garage_Type’, ‘Garage_Yr_Blt’, ‘Garage_Finish’, ‘Garage_Cars’, ‘Garage_Area’, ‘Garage_Qual’,

‘Garage_Cond’, ‘Paved_Drive’, ‘Wood_Deck_SF’, ‘Open_Porch_SF’,

‘Enclosed_Porch’, ‘3Ssn_Porch’, ‘Screen_Porch’, ‘Pool_Area’,

‘Pool_QC’, ‘Fence’, ‘Misc_Feature’, ‘Misc_Val’, ‘Mo_Sold’,

‘Yr_Sold’, ‘Sale_Type’, ‘Sale_Condition’, ‘SalePrice’], dtype=object)

Part 1: Exploratory Data Analysis

In this section, we will make a series of exploratory visualizations and interpret them.

Note that we will perform EDA on the training data so that information from the test data does not influence our modeling decisions.

Sale Price

We begin by examining a raincloud plot (a combination of a KDE, a histogram, a strip plot, and a box plot) of our target variable SalePrice . At the same time, we also take a look at some descriptive statistics of this variable.

fig, axs = plt.subplots(nrows=2)

sns.distplot(

training_data[‘SalePrice’], ax=axs[0]

)

sns.stripplot(

training_data[‘SalePrice’], jitter=0.4, size=3, ax=axs[1], alpha=0.3

)

sns.boxplot(

training_data[‘SalePrice’], width=0.3, ax=axs[1], showfliers=False, )

# Align axes

spacer = np.max(training_data[‘SalePrice’]) * 0.05 xmin = np.min(training_data[‘SalePrice’]) – spacer xmax = np.max(training_data[‘SalePrice’]) + spacer axs[0].set_xlim((xmin, xmax)) axs[1].set_xlim((xmin, xmax))

# Remove some axis text axs[0].xaxis.set_visible(False) axs[0].yaxis.set_visible(False) axs[1].yaxis.set_visible(False)

# Put the two plots together plt.subplots_adjust(hspace=0)

# Adjust boxplot fill to be white axs[1].artists[0].set_facecolor(‘white’)

In [10]:

/Users/temirlan/opt/anaconda3/lib/python3.8/site-packages/seaborn/distribut ions.py:2551: FutureWarning: `distplot` is a deprecated function and will b e removed in a future version. Please adapt your code to use either `displo t` (a figure-level function with similar flexibility) or `histplot` (an axe s-level function for histograms). warnings.warn(msg, FutureWarning)

/Users/temirlan/opt/anaconda3/lib/python3.8/site-packages/seaborn/_decorato rs.py:36: FutureWarning: Pass the following variable as a keyword arg: x. F rom version 0.12, the only valid positional argument will be `data`, and pa ssing other arguments without an explicit keyword will result in an error o r misinterpretation.

warnings.warn(

training_data[‘SalePrice’].describe()

In [11]:

Out[11]: count 2000.000000 mean 180775.897500 std 81581.671741 min 2489.000000 25% 128600.000000 50% 162000.000000 75% 213125.000000 max 747800.000000 Name: SalePrice, dtype: float64

Question 1

To check your understanding of the graph and summary statistics above, answer the following True or False questions:

The distribution of SalePrice in the training set is left-skew.
The mean of SalePrice in the training set is greater than the median.
At least 25% of the houses in the training set sold for more than $200,000.00.

The provided tests for this question do not confirm that you have answered correctly; only that you have assigned each variable to True or False .

# These should be True or False q1statement1 = False q1statement2 = True q1statement3 = True

In [12]:

ok.grade(“q1”);

In [13]:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Running tests

———————————————————————

Test summary Passed: 4

Failed: 0

[ooooooooook] 100.0% passed

SalePrice vs Gr_Liv_Area

Next, we visualize the association between SalePrice and Gr_Liv_Area . The codebook.txt file tells us that Gr_Liv_Area measures “above grade (ground) living

area square feet.”

This variable represents the square footage of the house excluding anything underground. Some additional research (into real estate conventions) reveals that this value also excludes the garage space.

sns.jointplot( x=‘Gr_Liv_Area’, y=‘SalePrice’, data=training_data, stat_func=None, kind=“reg”, ratio=4, space=0, scatter_kws={ ‘s’: 3,

‘alpha’: 0.25

line_kws={

‘color’: ‘black’

}

);

In [14]:

—————————————————————————

TypeError Traceback (most recent call last)

<ipython-input-14-0cec703253ff> in <module> —-> 1 sns.jointplot(

2 x=’Gr_Liv_Area’, 3 y=’SalePrice’,

data=training_data,
stat_func=None,

~/opt/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py in inner

_f(*args, **kwargs)

)
kwargs.update({k: arg for k, arg in zip(sig.parameters,

args)})

—> 46 return f(**kwargs)

47 return inner_f

48

~/opt/anaconda3/lib/python3.8/site-packages/seaborn/axisgrid.py in jointplo t(x, y, data, kind, color, height, ratio, space, dropna, xlim, ylim, margin al_ticks, joint_kws, marginal_kws, hue, palette, hue_order, hue_norm, **kwa rgs) 2135

2136 joint_kws.setdefault(“color”, color)

-> 2137 grid.plot_joint(regplot, **joint_kws)

2138

2139 elif kind.startswith(“resid”): ~/opt/anaconda3/lib/python3.8/site-packages/seaborn/axisgrid.py in plot_joi

nt(self, func, **kwargs)

1690

1691 if str(func.__module__).startswith(“seaborn”):

-> 1692 func(x=self.x, y=self.y, **kwargs)

else:
func(self.x, self.y, **kwargs)

~/opt/anaconda3/lib/python3.8/site-packages/seaborn/_decorators.py in inner

_f(*args, **kwargs)

)
kwargs.update({k: arg for k, arg in zip(sig.parameters,

args)})

—> 46 return f(**kwargs)

47 return inner_f

48

TypeError: regplot() got an unexpected keyword argument ‘stat_func’

There’s certainly an association, and perhaps it’s linear, but the spread is wider at larger values of both variables. Also, there are two particularly suspicious houses above 5000 square feet that look too inexpensive for their size.

Question 2

What are the Parcel Indentification Numbers for the two houses with Gr_Liv_Area greater than 5000 sqft?

The provided tests for this question do not confirm that you have answered correctly; only that you have assigned q2house1 and q2house2 to two integers that are in the range of PID values.

# BEGIN YOUR CODE

# ———————–

# Hint: You can answer this question in one line

q2house1, q2house2 = training_data.loc[(training_data[‘Gr_Liv_Area’]>5000)

# ———————–

# END YOUR CODE

In [21]:

908154235 908154195

ok.grade(“q2”);

In [22]:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Running tests

———————————————————————

Test summary

Passed: 5

Failed: 0

[ooooooooook] 100.0% passed

Question 3

The codebook tells us how to manually inspect the houses using an online database called Beacon. These two houses are true outliers in this data set: they aren’t the same time of entity as the rest. They were partial sales, priced far below market value. If you would like to inspect the valuations, follow the directions at the bottom of the codebook to access Beacon and look up houses by PID.

For this assignment, we will remove these outliers from the data. Write a function remove_outliers that removes outliers from a data set based off a threshold value of

a variable. For example, remove_outliers(training_data, ‘Gr_Liv_Area’, upper=5000) should return a data frame with only observations that satisfy Gr_Liv_Area less than or equal to 5000.

The provided tests check that training_data was updated correctly, so that future analyses are not corrupted by a mistake. However, the provided tests do not check that you have implemented remove_outliers correctly so that it works with any data, variable, lower, and upper bound.

def remove_outliers(data, variable, lower=-np.inf, upper=np.inf):

“”” Input:

data (data frame): the table to be filtered variable (string): the column with numerical outliers

lower (numeric): observations with values lower than this will be rem upper (numeric): observations with values higher than this will be re

Output:

a winsorized data frame with outliers removed

Note: This function should not change mutate the contents of data.

“””

# BEGIN YOUR CODE

# ———————–

dtlc = data.loc[(data[variable] <= upper) & (data[variable] >= lower)] return dtlc

# ———————–

# END YOUR CODE training_data = remove_outliers(training_data, ‘Gr_Liv_Area’, upper=5000)

In [13]:

ok.grade(“q3”);

In [14]:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Running tests

———————————————————————

Test summary Passed: 5

Failed: 0

[ooooooooook] 100.0% passed

Part 2: Feature Engineering

In this section we will create a new feature out of existing ones through a simple data transformation.

Bathrooms

Let’s create a groundbreaking new feature. Due to recent advances in Universal WC Enumeration Theory, we now know that Total Bathrooms can be calculated as:

TotalBathrooms = (BsmtFullBath + FullBath) + (BsmtHalfBath + HalfBath)

The actual proof is beyond the scope of this class, but we will use the result in our model.

Question 4

Write a function add_total_bathrooms(data) that returns a copy of data with an additional column called TotalBathrooms computed by the formula above.

The provided tests check that you answered correctly, so that future analyses are not corrupted by a mistake.

def add_total_bathrooms(data):

“”” Input:

data (data frame): a data frame containing at least 4 numeric columns

Bsmt_Full_Bath, Full_Bath, Bsmt_Half_Bath, and Half_Bath

“””

with_bathrooms = data.copy()

bath_vars = [‘Bsmt_Full_Bath’, ‘Full_Bath’, ‘Bsmt_Half_Bath’, ‘Half_Bat weights = pd.Series([1, 1, 0.5, 0.5], index=bath_vars)

with_bathrooms = with_bathrooms.fillna({var: 0 for var in bath_vars})

# BEGIN YOUR CODE

# ———————–

with_bathrooms[‘TotalBathrooms’] = with_bathrooms[bath_vars].dot(weight

# ———————–

# END YOUR CODE return with_bathrooms training_data = add_total_bathrooms(training_data)

In [15]:

ok.grade(“q4”);

In [16]:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Running tests

———————————————————————

Test summary

Passed: 4

Failed: 0

[ooooooooook] 100.0% passed

Question 5

Create a visualization that clearly and succintly shows that TotalBathrooms is associated with SalePrice . Your visualization should avoid overplotting.

# BEGIN YOUR CODE

# ———————–

sns.boxplot(x = ‘TotalBathrooms’, y = ‘SalePrice’, data = training_data);

# ———————–

# END YOUR CODE

In [29]:

Part 3: Modeling

We’ve reached the point where we can specify a model. But first, we will load a fresh copy of the data, just in case our code above produced any undesired side-effects. Run the cell below to store a fresh copy of the data from ames_train.csv in a dataframe named full_data . We will also store the number of rows in full_data in the variable full_data_len .

# Load a fresh copy of the data and get its length full_data = pd.read_csv(“./data/ames_train.csv”) full_data_len = len(full_data) full_data.head()

In [30]:

Out[30]: Order PID MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley Lo

0	1	526301100	20	RL	141.0	31770	Pave	NaN
1	2	526350040	20	RH	80.0	11622	Pave	NaN
2	3	526351010	20	RL	81.0	14267	Pave	NaN
3	4	526353030	20	RL	93.0	11160	Pave	NaN
4	5	527105010	60	RL	74.0	13830	Pave	NaN

5 rows × 82 columns

Question 6

Now, let’s split the data set into a training set and test set. We will use the training set to fit our model’s parameters, and we will use the test set to estimate how well our model will perform on unseen data drawn from the same distribution. If we used all the data to fit our model, we would not have a way to estimate model performance on unseen data.

“Don’t we already have a test set in ames_test.csv ?” you might wonder. The sale prices for ames_test.csv aren’t provided, so we’re constructing our own test set for which we know the outputs.

In the cell below, split the data in full_data into two DataFrames named train and test . Let train contain 80% of the data, and let test contain the remaining 20%

of the data.

To do this, first create two NumPy arrays named train_indices and test_indices . train_indices should contain a random 80% of the indices in full_data , and test_indices should contain the remaining 20% of the indices. Then, use these arrays to index into full_data to create your final train and test DataFrames.

The provided tests check that you not only answered correctly, but ended up with the exact same train/test split as our reference implementation. Later testing is easier this way.

# This makes the train-test split in this section reproducible across diffe # of the notebook. You do not need this line to run train_test_split in gen np.random.seed(1337)

shuffled_indices = np.random.permutation(full_data_len)

# Set train_indices to the first 80% of shuffled_indices and and test_indic # BEGIN YOUR CODE

# ———————–

train_indices = shuffled_indices[0: int(len(shuffled_indices)*0.8)] test_indices = shuffled_indices[int(len(shuffled_indices)*0.8):]

# ———————–

# END YOUR CODE

# Create train and test` by indexing into `full_data` using # `train_indices` and `test_indices`

# BEGIN YOUR CODE

# ———————–

train = full_data.loc[train_indices] test = full_data.loc[test_indices]

# ———————–

# END YOUR CODE

In [35]:

ok.grade(“q6”);

In [36]:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Running tests

———————————————————————

Test summary

Passed: 6

Failed: 0

[ooooooooook] 100.0% passed

Reusable Pipeline

Throughout this assignment, you should notice that your data flows through a single processing pipeline several times. From a software engineering perspective, it’s best to define functions/methods that can apply the pipeline to any dataset. We will now encapsulate our entire pipeline into a single function process_data_gm . gm is shorthand for “guided model”. We select a handful of features to use from the many that are available.

def select_columns(data, *columns):

“””Select only columns passed as arguments.””” return data.loc[:, columns]

def process_data_gm(data):

“””Process the data for a guided model.”””

data = remove_outliers(data, ‘Gr_Liv_Area’, upper=5000)

# Transform Data, Select Features data = add_total_bathrooms(data) data = select_columns(data,

‘SalePrice’,

‘Gr_Liv_Area’,

‘Garage_Area’,

‘TotalBathrooms’, )

# Return predictors and response variables separately

X = data.drop([‘SalePrice’], axis = 1) y = data.loc[:, ‘SalePrice’]

return X, y

In [37]:

Now, we can use process_data_gm to clean our data, select features, and add our

TotalBathrooms feature all in one step! This function also splits our data into X , a matrix of features, and y , a vector of sale prices.

Run the cell below to feed our training and test data through the pipeline, generating

X_train , y_train , X_test , and y_test .

# Pre-process our training and test data in exactly the same way # Our functions make this very easy!

X_train, y_train = process_data_gm(train)

X_test, y_test = process_data_gm(test)

In [38]:

Fitting Our First Model

We are finally going to fit a model! The model we will fit can be written as follows:

SalePrice = θ₀+ θ₁⋅ Gr_Liv_Area + θ₂⋅ Garage_Area + θ₃⋅ TotalBathrooms

In vector notation, the same equation would be written:

y = θ ⋅ x

where y is the SalePrice, θ is a vector of all fitted weights, and x contains a 1 for the bias followed by each of the feature values.

Note: Notice that all of our variables are continuous, except for TotalBathrooms , which takes on discrete ordered values (0, 0.5, 1, 1.5, …). In this homework, we’ll treat

TotalBathrooms as a continuous quantitative variable in our model, but this might not be the best choice. The next homework may revisit the issue.

Question 7a

We will use a sklearn.linear_model.LinearRegression object as our linear model. In the cell below, create a LinearRegression object and name it

linear_model .

Hint: See the fit_intercept parameter and make sure it is set appropriately. The intercept of our model corresponds to θ0 in the equation above.

The provided tests check that you answered correctly, so that future analyses are not corrupted by a mistake.

from sklearn import linear_model as lm

# BEGIN YOUR CODE

# ———————–

linear_model = lm.LinearRegression(fit_intercept=True)

# ———————–

# END YOUR CODE

In [40]:

ok.grade(“q7a”);

In [41]:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Running tests

———————————————————————

Test summary

Passed: 2

Failed: 0

[ooooooooook] 100.0% passed

Question 7b

Now, remove the commenting and fill in the ellipses … below with X_train , y_train , X_test , or y_test .

With the ellipses filled in correctly, the code below should fit our linear model to the training data and generate the predicted sale prices for both the training and test datasets.

The provided tests check that you answered correctly, so that future analyses are not corrupted by a mistake.

# Uncomment the lines below and fill in the … with X_train, y_train, X_te # BEGIN YOUR CODE

# ———————-linear_model.fit(X_train, y_train) y_fitted = linear_model.predict(X_train) y_predicted = linear_model.predict(X_test)

# ———————–

# END YOUR CODE

In [44]:

ok.grade(“q7b”);

In [45]:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Running tests

———————————————————————

Test summary

Passed: 2

Failed: 0

[ooooooooook] 100.0% passed

Question 8a

Is our linear model any good at predicting house prices? Let’s measure the quality of our model by calculating the Root-Mean-Square Error (RMSE) between our predicted house prices and the true prices stored in SalePrice .

∑ ( − )²

RMSE

In the cell below, write a function named rmse that calculates the RMSE of a model.

Hint: Make sure you are taking advantage of vectorized code. This question can be answered without any for statements.

The provided tests check that you answered correctly, so that future analyses are not corrupted by a mistake.

from sklearn.metrics import mean_squared_error as mse

def rmse(actual, predicted):

“””

Calculates RMSE from actual and predicted values

Input:

actual (1D array): vector of actual values

predicted (1D array): vector of predicted/fitted values

Output: a float, the root-mean square error

“””

# BEGIN YOUR CODE

# ———————–

return mse(actual, predicted, squared=False)

# ———————–

# END YOUR CODE

In [51]:

ok.grade(“q8a”);

In [52]:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Running tests

———————————————————————

Test summary

Passed: 2

Failed: 0

[ooooooooook] 100.0% passed

Question 8b

Now use your rmse function to calculate the training error and test error in the cell below.

The provided tests for this question do not confirm that you have answered correctly; only that you have assigned each variable to a non-negative number.

In [53]: # BEGIN YOUR CODE # ———————–

training_error = rmse(y_train, y_fitted) test_error = rmse(y_test, y_predicted)

# ———————–

# END YOUR CODE

(training_error, test_error)

Out[53]: (46710.597505875856, 46146.64265682625)

In [54]: ok.grade(“q8b”);

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Running tests

———————————————————————

Test summary

Passed: 4

Failed: 0

[ooooooooook] 100.0% passed

Question 8c

How much does including TotalBathrooms as a predictor reduce the RMSE of the model on the test set? That is, what’s the difference between the RSME of a model that only includes Gr_Liv_Area and Garage_Area versus one that includes all three predictors?

The provided tests for this question do not confirm that you have answered correctly; only that you have assigned the answer variable to a non-negative number.

# BEGIN YOUR CODE

# ———————–

test_error_no_bath = rmse(y_predicted, y_test)

# ———————–

# END YOUR CODE

test_error_difference = test_error_no_bath – test_error test_error_difference

In [59]:

Out[59]: 2477.008463647042

ok.grade(“q8c”);

In [60]:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Running tests

———————————————————————

Test summary

Passed: 2

Failed: 0

[ooooooooook] 100.0% passed

Residual Plots

One way of understanding the performance (and appropriateness) of a model is through a residual plot. Run the cell below to plot the actual sale prices against the residuals of the model for the test data.

residuals = y_test – y_predicted ax = sns.regplot(y_test, residuals) ax.set_xlabel(‘Sale Price (Test Data)’)

ax.set_ylabel(‘Residuals (Actual Price – Predicted Price)’) ax.set_title(“Residuals vs. Sale Price on Test Data”);

In [55]: /Users/temirlan/opt/anaconda3/lib/python3.8/site-packages/seaborn/_decorato rs.py:36: FutureWarning: Pass the following variables as keyword args: x, y . From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(

Ideally, we would see a horizontal line of points at 0 (perfect prediction!). The next best thing would be a homogenous set of points centered at 0.

But alas, our simple model is probably too simple. The most expensive homes are systematically more expensive than our prediction.

Question 8d

What changes could you make to your linear model to improve its accuracy and lower the test error? Suggest at least two things you could try in the cell below, and carefully explain how each change could potentially improve your model’s accuracy.

Answer:

Add more data to the model to improve its accuracy, like features that are relative to expensive houses to \ 2. Also, we can add location of the houses as an additional feature, since the prices of the houses in the same neighborhood should be similar.

[SOLVED] Data Science-Homework 3: Predicting Housing Prices

If Helpful Share:

Description

Introduction

The Ames Housing Price Dataset

Part 1: Exploratory Data Analysis

Sale Price

Question 1

SalePrice vs Gr_Liv_Area

48

2138

1690

48

Question 2

Question 3

Part 2: Feature Engineering

Bathrooms

Question 4

Question 5

Part 3: Modeling

Question 6

Reusable Pipeline

Fitting Our First Model

y = θ ⋅ x

Question 7a

Question 7b

Question 8a

Question 8b

Question 8c

Residual Plots

Question 8d

Related products

Data Science-Project 1: Spam/Ham Classification

Homework #1 CS 5665

Data Science Project 1

Related in this category

More in this category

CSE111–Data Base Systems: Lab3

DataScience – Exploration and Presentation -Assignment 2 –

Homework #2 CS 5665

Data Science-Project 1: Spam/Ham Classification

Homework #4 CS 5665

Data Science Assignment 1