Description
In this project, you will use what you’ve learned in class to create a regression model that predicts the travel time of a taxi ride in New York. Some questions in this project are more substantial than those of past projects.
After this project, you should feel comfortable with the following:
The data science lifecycle: data selection and cleaning, EDA, feature engineering, and model selection.
Using sklearn to process data and fit linear regression models.
Embedding linear regression as a component in a more complex model.
First, let’s import:
The Data
Run the following cell to load the cleaned Manhattan data.
In [3]: manhattan_taxi = pd.read_csv(‘manhattan_taxi.csv’)
Attributes of all yellow taxi (https://en.wikipedia.org/wiki/Taxicabs_of_New_York_City) trips in January 2016 are published by the NYC Taxi and Limosine Commission (https://www1.nyc.gov/site/tlc/about/tlctrip-record-data.page).
Columns of the manhattan_taxi table include:
pickup_datetime: date and time when the meter was engaged dropoff_datetime: date and time when the meter was disengaged pickup_lon: the longitude where the meter was engaged pickup_lat: the latitude where the meter was engaged dropoff_lon: the longitude where the meter was disengaged dropoff_lat: the latitude where the meter was disengaged
passengers: the number of passengers in the vehicle (driver entered value) distance: trip distance duration: duration of the trip in seconds
Your goal will be to predict duration from the pick-up time, pick-up and drop-off locations, and distance.
Out[4]:
| pickup_datetime | dropoff_datetime | pickup_lon | pickup_lat | dropoff_lon | dropoff_ | |
| 0 | 2016-01-30
22:47:32 |
2016-01-30
23:03:53 |
-73.988251 | 40.743542 | -74.015251 | 40.70980 |
| 1 | 2016-01-04
04:30:48 |
2016-01-04
04:36:08 |
-73.995888 | 40.760010 | -73.975388 | 40.78220 |
| 2 | 2016-01-07
21:52:24 |
2016-01-07
21:57:23 |
-73.990440 | 40.730469 | -73.985542 | 40.73851 |
| 3 | 2016-01-08
18:46:10 |
2016-01-08
18:54:00 |
-74.004494 | 40.706989 | -74.010155 | 40.71675 |
| 4 | 2016-01-02
12:39:57 |
2016-01-02
12:53:29 |
-73.958214 | 40.760525 | -73.983360 | 40.76040 |
A scatter diagram of only Manhattan taxi rides has the familiar shape of Manhattan Island.
Part 1: Exploratory Data Analysis
In this part, you’ll choose which days to include as training data in your regression model.
Your goal is to develop a general model that could potentially be used for future taxi rides. There is no guarantee that future distributions will resemble observed distributions, but some effort to limit training data to typical examples can help ensure that the training data are representative of future observations.
Note that January 2016 had some atypical days.
New Years Day (January 1) fell on a Friday.
Martin Luther King Jr. Day was on Monday, January 18.
A historic blizzard (https://en.wikipedia.org/wiki/January_2016_United_States_blizzard) passed through New York that month.
Using this dataset to train a general regression model for taxi trip times must account for these unusual phenomena, and one way to account for them is to remove atypical days from the training data.
Question 1a
Add a column labeled date to manhattan_taxi that contains the date (but not the time) of pickup, formatted as a datetime.date value (docs (https://docs.python.org/3/library/datetime.html#dateobjects)).
The provided tests check that you have extended manhattan_taxi correctly.
Out[6]:
| pickup_datetime | dropoff_datetime | pickup_lon | pickup_lat | dropoff_lon | dropoff_ | |
| 0 | 2016-01-30
22:47:32 |
2016-01-30
23:03:53 |
-73.988251 | 40.743542 | -74.015251 | 40.70980 |
| 1 | 2016-01-04
04:30:48 |
2016-01-04
04:36:08 |
-73.995888 | 40.760010 | -73.975388 | 40.78220 |
| 2 | 2016-01-07
21:52:24 |
2016-01-07
21:57:23 |
-73.990440 | 40.730469 | -73.985542 | 40.73851 |
| 3 | 2016-01-08
18:46:10 |
2016-01-08
18:54:00 |
-74.004494 | 40.706989 | -74.010155 | 40.71675 |
| 4 | 2016-01-02
12:39:57 |
2016-01-02
12:53:29 |
-73.958214 | 40.760525 | -73.983360 | 40.76040 |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~
Running tests
——————————————————————–
Test summary
Passed: 2
Failed: 0
[ooooooooook] 100.0% passed
Question 1b
Create a data visualization that allows you to identify which dates were affected by the historic blizzard of January 2016. Make sure that the visualization type is appropriate for the visualized data.
Finally, we have generated a list of dates that should have a fairly typical distribution of taxi rides, which
excludes holidays and blizzards. The cell below assigns final_taxi to the subset of manhattan_taxi that is on these days. (No changes are needed; just run this cell.)
January 2016
Mo Tu We Th Fr Sa Su
4 5 6 7 8 9 10
11 12 13 14 15 16 17
19 20 21 22
27 28 29 30 31
Part 2: Feature Engineering
In this part, you’ll create a design matrix (i.e., feature matrix) for your linear regression model. You decide to predict trip duration from the following inputs: start location, end location, trip distance, time of day, and day of the week (Monday, Tuesday, etc.).
You will ensure that the process of transforming observations into a design matrix is expressed as a Python function called design_matrix, so that it’s easy to make predictions for different samples in later parts of the project.
Because you are going to look at the data in detail in order to define features, it’s best to split the data into training and test sets now, then only inspect the training set.
In [10]: import sklearn.model_selection
train, test = sklearn.model_selection.train_test_split( final_taxi, train_size=0.8, test_size=0.2, random_state=42) print(‘Train:’, train.shape, ‘Test:’, test.shape)
Train: (53680, 10) Test: (13421, 10)
Question 2a
Use sns.boxplot to create a box plot that compares the distributions of taxi trip durations for each day using train only. Individual dates shoud appear on the horizontal axis, and duration values should appear on the vertical axis. Your plot should look like this:
Question 2b
In one or two sentences, describe the assocation between the day of the week and the duration of a taxi trip.
Note: The end of Part 2 showed a calendar for these dates and their corresponding days of the week.
Answer: your answer here…
Below, the provided augment function adds various columns to a taxi ride dataframe.
hour: The integer hour of the pickup time. E.g., a 3:45pm taxi ride would have 15 as the hour. A 12:20am ride would have 0.
day: The day of the week with Monday=0, Sunday=6. weekend: 1 if and only if the day is Saturday or Sunday. period: 1 for early morning (12am-6am), 2 for daytime (6am-6pm), and 3 for night (6pm-12pm). speed: Average speed in miles per hour.
No changes are required; just run this cell.
In [12]: def speed(t):
“””Return a column of speeds in miles per hour.””” return t[‘distance’] / t[‘duration’] * 60 * 60
def augment(t):
“””Augment a dataframe t with additional columns.””” u = t.copy()
pickup_time = pd.to_datetime(t[‘pickup_datetime’])
u.loc[:, ‘hour’] = pickup_time.dt.hour
u.loc[:, ‘day’] = pickup_time.dt.weekday
u.loc[:, ‘weekend’] = (pickup_time.dt.weekday >= 5).astype(int)
u.loc[:, ‘period’] = np.digitize(pickup_time.dt.hour, [0, 6, 18
])
u.loc[:, ‘speed’] = speed(t) return u
train = augment(train) test = augment(test) train.iloc[0,:] # An example row
Out[12]: pickup_datetime 2016-01-21 18:02:20 dropoff_datetime 2016-01-21 18:27:54 pickup_lon -73.9942 pickup_lat 40.751 dropoff_lon -73.9637 dropoff_lat 40.7711 passengers 1 distance 2.77 duration 1534 date 2016-01-21 hour 18 day 3 weekend 0 period 3 speed 6.50065
Name: 14043, dtype: object
Question 2c
Use sns.distplot to create an overlaid histogram comparing the distribution of average speeds for taxi rides that start in the early morning (12am-6am), day (6am-6pm; 12 hours), and night (6pm-12am; 6 hours). Your plot should look like this:
It looks like the time of day is associated with the average speed of a taxi ride.
Question 2d (PCA)
Manhattan can roughly be divided into Lower, Midtown, and Upper regions. Instead of studying a map, let’s approximate by finding the first principal component of the pick-up location (latitude and longitude).
Add a region column to train that categorizes each pick-up location as 0, 1, or 2 based on the value of each point’s first principal component, such that an equal number of points fall into each region.
Read the documentation of pd.qcut (https://pandas.pydata.org/pandasdocs/version/0.23.4/generated/pandas.qcut.html), which categorizes points in a distribution into equal-frequency bins.
You don’t need to add any lines to this solution. Just fill in the assignment statements to complete the implementation.
The provided tests ensure that you have answered the question correctly.
In [14]: # Find the first principle component
D = train[[‘pickup_lon’, ‘pickup_lat’]].values pca_n = D.shape[0]
pca_means = np.mean(D, axis=0)
X = (D – pca_means) / np.sqrt(pca_n) u, s, vt = np.linalg.svd(X, full_matrices=False)
def add_region(t):
“””Add a region column to t based on vt above.”””
# BEGIN YOUR CODE
# ———————–
D = t[[‘pickup_lon’, ‘pickup_lat’]].values
assert D.shape[0] == t.shape[0], ‘You set D using the incorrect table’
# Always use the same data transformation used to compute vt
X = …
first_pc = …
# ———————–
# END YOUR CODE
t.loc[:,’region’] = pd.qcut(first_pc, 3, labels=[0, 1, 2])
add_region(train) add_region(test)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~
Running tests
——————————————————————–
Test summary
Passed: 7
Failed: 0
[ooooooooook] 100.0% passed
Let’s see how PCA divided the trips into three groups. These regions do roughly correspond to
Lower Manhattan (below 14th street)
Midtown Manhattan (between 14th and the park) Upper Manhattan (bordering Central Park).
No prior knowledge of New York geography was required!
Finally, we create a design matrix that includes many of these features.
Quantitative features are converted to standard units
Categorical features are converted to dummy variables using one-hot encoding.
Note that,
The period is not included because it is a linear combination of the hour.
The weekend variable is not included because it is a linear combination of the day. The speed is not included because it was computed from the duration (it’s impossible to know the speed without knowing the duration, given that you know the distance).
In [17]: from sklearn.preprocessing import StandardScaler
num_vars = [‘pickup_lon’, ‘pickup_lat’, ‘dropoff_lon’, ‘dropoff_lat
‘, ‘distance’]
cat_vars = [‘hour’, ‘day’, ‘region’]
scaler = StandardScaler() scaler.fit(train[num_vars])
def design_matrix(t):
“””Create a design matrix from taxi ride dataframe t.””” scaled = t[num_vars].copy()
scaled.iloc[:,:] = scaler.transform(scaled) # Convert to standa rd units
categoricals = [pd.get_dummies(t[s], prefix=s, drop_first=True) for s in cat_vars]
return pd.concat([scaled] + categoricals, axis=1) design_matrix(train).iloc[0,:]
Out[17]: pickup_lon -0.805821 pickup_lat -0.171761 dropoff_lon 0.954062 dropoff_lat 0.624203 distance 0.626326 hour_1 0.000000 hour_2 0.000000 hour_3 0.000000 hour_4 0.000000 hour_5 0.000000 hour_6 0.000000 hour_7 0.000000 hour_8 0.000000 hour_9 0.000000 hour_10 0.000000 hour_11 0.000000 hour_12 0.000000 hour_13 0.000000 hour_14 0.000000 hour_15 0.000000 hour_16 0.000000 hour_17 0.000000 hour_18 1.000000 hour_19 0.000000 hour_20 0.000000 hour_21 0.000000 hour_22 0.000000 hour_23 0.000000 day_1 0.000000 day_2 0.000000 day_3 1.000000 day_4 0.000000 day_5 0.000000 day_6 0.000000 region_1 1.000000 region_2 0.000000 Name: 14043, dtype: float64
Part 3: Model Selection
In this part, you will select a regression model to predict the duration of a taxi ride.
Important: Tests in this part do not confirm that you have answered correctly. Instead, they check that you’re somewhat close in order to detect major errors. It is up to you to calculate the results correctly based on the question descriptions.
Question 3a
Assign constant_rmse to the root mean squared error on the test set for a constant model that always predicts the mean duration of all training set taxi rides.
Out[18]: 399.14375723526661
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~
Running tests
——————————————————————–
Test summary
Passed: 2
Failed: 0
[ooooooooook] 100.0% passed
Question 3b
Assign simple_rmse to the root mean squared error on the test set for a simple linear regression model that uses only the distance of the taxi ride as a feature (and includes an intercept).
Terminology Note: Simple linear regression means that there is only one covariate. Multiple linear regression means that there is more than one. In either case, you can use the LinearRegression model from sklearn to fit the parameters to data.
Out[20]: 276.78411050003422
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~
Running tests
——————————————————————–
Test summary
Passed: 2
Failed: 0
[ooooooooook] 100.0% passed
Question 3c
Assign linear_rmse to the root mean squared error on the test set for a linear regression model fitted to the training set without regularization, using the design matrix defined by the design_matrix function from Part 3.
The provided tests check that you have answered the question correctly and that your design_matrix function is working as intended.
Out[22]: 255.19146631882754
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~
Running tests
——————————————————————–
Test summary
Passed: 3
Failed: 0
[ooooooooook] 100.0% passed
Question 3d
For each possible value of period, fit an unregularized linear regression model to the subset of the training set in that period. Assign period_rmse to the root mean squared error on the test set for a model that first chooses linear regression parameters based on the observed period of the taxi ride, then predicts the duration using those parameters. Again, fit to the training set and use the design_matrix function for features.
Out[24]: 246.62868831165173
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~
Running tests
——————————————————————–
Test summary
Passed: 2
Failed: 0
[ooooooooook] 100.0% passed
This approach is a simple form of decision tree regression, where a different regression function is estimated for each possible choice among a collection of choices. In this case, the depth of the tree is only 1.
Question 3e
In one or two sentences, explain how the period regression model could possibly outperform linear regression when the design matrix for linear regression already includes one feature for each possible hour, which can be combined linearly to determine the period value.
Answer: your answer here…
Question 3f
Instead of predicting duration directly, an alternative is to predict the average speed of the taxi ride using linear regression, then compute an estimate of the duration from the predicted speed and observed distance for each ride.
Assign speed_rmse to the root mean squared error in the duration predicted by a model that first predicts speed as a linear combination of features from the design_matrix function, fitted on the training set, then predicts duration from the predicted speed and observed distance.
Hint: Speed is in miles per hour, but duration is measured in seconds. You’ll need the fact that there are 60 * 60 = 3,600 seconds in an hour.
Out[26]: 243.01798368514949
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~
Running tests
——————————————————————–
Test summary
Passed: 2
Failed: 0
[ooooooooook] 100.0% passed
Here’s a summary of your results:
Congratulations!
You’ve carried out the entire data science lifecycle for a challenging regression problem.
In Part 1 on EDA, you used the data to assess the impact of a historical event—the 2016 blizzard—and filtered the data accordingly.
In Part 2 on feature engineering, you used PCA to divide up the map of Manhattan into regions that roughly corresponded to the standard geographic description of the island. In Part 3 on model selection, you found that using linear regression in practice can involve more than just choosing a design matrix. Tree regression made better use of categorical variables than linear regression. The domain knowledge that duration is a simple function of distance and speed allowed you to predict duration more accurately by first predicting speed.
Hopefully, it is apparent that all of these steps are required to reach a reliable conclusion about what inputs and model structure are helpful in predicting the duration of a taxi ride in Manhattan.






