Description
Download Rdata files from Piazza or Coursera
The assignment is related to the Boston Housing data. The original data is from the R library “mlbench”, which has 506 observations on 19 variables.
| crim zn indus chas nox rm
age dis rad tax ptratio b lstat medv cmedv town tract lon lat |
per capita crime rate by town proportion of residential land zoned for lots over 25,000 sq.ft proportion of non-retail business acres per town
Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) nitric oxides concentration (parts per 10 million) average number of rooms per dwelling proportion of owner-occupied units built prior to 1940 weighted distances to five Boston employment centres index of accessibility to radial highways full-value property-tax rate per USD 10,000 pupil-teacher ratio by town 1000(B − 0.63)2 where B is the proportion of blacks by town percentage of lower status of the population median value of owner-occupied homes in USD 1000’s corrected median value of owner-occupied homes in USD 1000’s name of town census tract longitude of census tract latitude of census tract |
First, we apply some suggested transformations on the data, then remove three variables medv, town, and tract, and use cmedv as the response variable Y.
Consider following 10 procedures:
- Full: run a linear regression model using all features,
- F and AIC.B: Forward/backward selection with AIC,
- F and BIC.B: Forward/backward selection with BIC,
- min and R.1se: Ridge regression using lambda.min or lambda.1se,
- min and L.1se: Lasso using lambda.min or lambda.1se,
- Refit: Refit the model selected by Lasso using lambda.1se.
- Load Rdata, which has 16 variables including the response variable Y. The data has been pre-processed, so no need to apply any transformation.
- Repeat the following simulation 50 times. In each iteration, randomly split the data into two parts, 75% for training and 25% for testing. fit the model based on the training data and obtain a prediction on the test data, record the mean squared prediction error (MSPE) on the test set, the selected model-size or effect dimension (for Ridge), and the computation time for each procedure.
Exclude intercept in computing model-size or effect dimension.
- Summarize your results on MSPE and model size graphically, e.g., using boxplot or stripchart.
- Load Rdata, which has 135 variables including the response variable Y. In addition to the original 15 predictors, the data contains their quadratic and all pairwise interaction terms.
Repeat (a-b) above for only five methods: R.min, R.1se, L.min L.1se, and
L.Refit.
- Load Rdata, which has 635 variables including the response variable Y. In addition to BostonHousing2.Rdata, the data contains 500 noise features.
Repeat (a-b) above for only five methods: R.min, R.1se, L.min L.1se, and
L.Refit.
(Continue on the next page −→)




