[SOLVED] MachineLearning - Programming Exercise 2 - Logistic Regression

30.00 $

Category:

Description

5/5 - (1 vote)

1         Problem 1: Univariate Linear Regression

1.1      1) import the libraries you will need:

/usr/local/lib/python3.7/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead. import pandas.util.testing as tm

1.2      2) Import the date poverty.csv dataset

<IPython.core.display.HTML object>

Saving real_estate.csv to real_estate (4).csv

1.3       3) Print the dataset indexed upon the location column.

[ ]:                                                      PovPct Brth15to17 Brth18to19 ViolCrime TeenBrth

Location

Alabama                                 20.1                31.5                 88.7               11.2            54.5

Alaska                                      7.1                18.9                 73.7                 9.1            39.5

Arizona                                  16.1                35.0               102.5               10.4            61.2

Arkansas                                14.9                31.6               101.7               10.4            59.9

California                              16.7                22.6                 69.1               11.2            41.1

Colorado                                  8.8                26.2                 79.1                 5.8            47.0

Connecticut                             9.7                14.1                 45.1                 4.6            25.8

Delaware                                10.3                24.7                 77.8                 3.5            46.3

District_of_Columbia           22.0                44.8               101.5               65.0            69.1

Florida                                   16.2                23.2                 78.4                 7.3            44.5

Georgia                                  12.1                31.4                 92.8                 9.5            55.7

Hawaii                                    10.3                17.7                 66.4                 4.7            38.2

Idaho                                      14.5                18.4                 69.1                 4.1            39.1

Illinois                                  12.4                23.4                 70.5               10.3            42.2

Indiana                                    9.6                22.6                 78.5                 8.0            44.6

Iowa                                       12.2                16.4                 55.4                 1.8            32.5

Kansas                                    10.8                21.4                 74.2                 6.2            43.0

Kentucky                                14.7                26.5                 84.8                 7.2            51.0

Louisiana                               19.7                31.7                 96.1               17.0            58.1

Maine                                     11.2                11.9                 45.2                 2.0            25.4

Maryland                                10.1                20.0                 59.6               11.8            35.4

Massachusetts                       11.0                12.5                 39.6                 3.6            23.3

Michigan                                12.2                18.0                 60.8                 8.5            34.8

Minnesota                                9.2                14.2                 47.3                 3.9            27.5

Mississippi                            23.5                37.6               103.3               12.9            64.7

Missouri                                  9.4                22.2                 76.6                 8.8            44.1

Montana                                 15.3                17.8                 63.3                 3.0            36.4

Nebraska                                  9.6                18.3                 64.2                 2.9            37.0

Nevada                                   11.1                28.0                 96.7               10.7            53.9

New_Hampshire                        5.3                  8.1                 39.0                 1.8            20.0

New_Jersey                              7.8                14.7                 46.1                 5.1            26.8

New_Mexico                           25.3                37.8                 99.5                 8.8            62.4

New_York                               16.5                15.7                 50.1                 8.5            29.5

North_Carolina                      12.6                28.6                 89.3                 9.4            52.2

North_Dakota                         12.0                11.7                 48.7                 0.9            27.2

Ohio                                       11.5                20.1                 69.4                 5.4            39.5

Oklahoma                               17.1                30.1                 97.6               12.2            58.0

Oregon                                   11.2                18.2                 64.8                 4.1            36.8

Pennsylvania                         12.2                17.2                 53.7                 6.3            31.6

Rhode_Island                         10.6                19.6                 59.0                 3.3            35.6

South_Carolina                      19.9                29.2                 87.2                 7.9            53.0

South_Dakota                         14.5                17.3                 67.8                 1.8            38.0

Tennessee                              15.5                28.2                 94.2               10.6            54.3

Texas                                     17.4                38.2               104.3                 9.0            64.4

Utah                                         8.4                17.8                 62.4                 3.9            36.8

Vermont                                 10.3                10.4                 44.4                 2.2            24.2

Virginia                                 10.2                19.0                 66.0                 7.6            37.6

Washington                            12.5                16.8                 57.6                 5.1            33.0

West_Virginia                       16.7                21.5                 80.7                 4.9            45.5

Wisconsin                                8.5                15.9                 57.1                 4.3            32.3

Wyoming                                12.2                17.7                 72.1                 2.1            39.9

1.4       4) Get useful descriptive statistial data on the dataset.

Hint: this is a single line, data._____

[ ]: poverty.describe()

[ ]:                             PovPct Brth15to17 Brth18to19 ViolCrime              TeenBrth

count 51.000000          51.000000            51.000000 51.000000 51.000000

mean 13.117647          22.282353       72.019608         7.854902 42.243137

std           4.277228         8.043499       18.975563         8.914131 12.318511

min          5.300000         8.100000       39.000000         0.900000 20.000000

25%        10.250000        17.250000       58.300000         3.900000 33.900000

50%        12.200000        20.000000       69.400000         6.300000 39.500000

75%        15.800000        28.100000       87.950000         9.450000 52.600000

max        25.300000              44.800000 104.300000 65.000000 69.100000

1.5     5) Print the columns

Index([‘PovPct’, ‘Brth15to17’, ‘Brth18to19’, ‘ViolCrime’, ‘TeenBrth’], dtype=’object’)

1.6              6) Create a regression line based upon the dependent and independent variables:

PovPct Brth18to19

In this step only create a scatterplot of the two variables, simply plotting the data.

Note: The variable PovPct is the percent of a state’s population in 2000 living in households with incomes below the federally defined poverty level.

[ ]: plt.scatter(poverty.PovPct, poverty.Brth18to19)

[ ]: <matplotlib.collections.PathCollection at 0x7efc09f48d50>

1.7       7) Lets create a new variable, x1, as well as the results variable:

Example would be 1. x1 = sm.add_constant(x) 2. results = sm.OLS(y, x1).fit() 3. results.summary() This gives you the OLS Regression results, the coefficients table, and some additional tests. The data that you are interested in is the coefficient values. This is the value for the constant you created is b0, and birth19to19 is b1 in the regression equation.

[ ]: x1 = sm.add_constant(poverty.PovPct)

/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/tsatools.py:117:

FutureWarning: In a future version of pandas all arguments of concat except for the argument ‘objs’ will be keyword-only

x = pd.concat(x[::order], 1)

[ ]: <class ‘statsmodels.iolib.summary.Summary’> “””

OLS Regression Results

==============================================================================

Dep. Variable:                  Brth18to19        R-squared: 0.422
Model: OLS Adj. R-squared: 0.410
Method:            Least Squares            F-statistic: 35.78
Date: Thu, 17 Feb 2022                Prob (F-statistic): 2.50e-07
Time:                       02:35:21         Log-Likelihood: -207.98
No. Observations: 51 AIC: 420.0
Df Residuals: 49 BIC: 423.8
Df Model: 1
Covariance Type: nonrobust

============================================================================== coef         std err   t           P>|t|      [0.025  0.975]

—————————————————————————–const 34.2124 6.641 5.151 0.000 20.866 47.559 PovPct 2.8822 0.482 5.982 0.000 1.914 3.850

==============================================================================

Omnibus:                                               1.175        Durbin-Watson:                                      2.161

Prob(Omnibus):                                     0.556          Jarque-Bera (JB):                                  0.988

Skew:                                                     0.088        Prob(JB):                                                0.610

Kurtosis:                                               2.341       Cond. No.                                                 45.1

==============================================================================

Warnings:

  • Standard Errors assume that the covariance matrix of the errors is correctly specified. “””

1.8    8) Taking the coeffient values for the new constant and the Y variable, create a scatterplot:

e.g. yhat = 0.1464*x + 0.25712 fig = plt.plot(x, yhat, lw=4, c=’red’, label = ’regression line’)

[ ]: plt.scatter(poverty.PovPct, poverty.Brth18to19) plt.plot([0,25],[34.2124,2.8822*25+34.2124], color=”magenta”)

[ ]: [<matplotlib.lines.Line2D at 0x7fb63c15ab10>]

2         Problem 2: Implement

2.1        1) Perform linear regression using the normal equation, as done in slides.

[ ]: [<matplotlib.lines.Line2D at 0x7fb641086710>]

  • 93813808]])

[ ]: plt.plot(X,y,”b.”) plt.plot([0,2],[theta_best[0],theta_best[1]*2+theta_best[0]], color=”magenta”)

[ ]: [<matplotlib.lines.Line2D at 0x7fb63be61190>]

2.2        2) Perform linear regression using Scikit-Learn, as done in the slides.

[ ]: (array([4.52769162]), array([[2.93813808]]))

/usr/local/lib/python3.7/dist-packages/numpy/core/shape_base.py:65:

VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify ‘dtype=object’ when creating the ndarray. ary = asanyarray(ary)

[ ]: [<matplotlib.lines.Line2D at 0x7fb63bf92d50>]

3         Problem 3: Multivariate Linear Regression

In this problem we will continue using the poverty dataset. Do poverty and violent crimes affect teen pregnancy?

3.1 1) import the libraries you will need: numpy pandas matplotlab.pyplot statsmodels.api

3.2      2) Import the dataset, poverty_2.csv, and print it.

[ ]:              PovPct ViolCrime TeenBrth

  • 1 11.2     54.5
  • 1 9.1       39.5
  • 1 10.4     61.2
  • 9 10.4     59.9
  • 7 11.2     41.1
  • 8 5.8       47.0
  • 7 4.6       25.8

 

  • 3 3.5       46.3
  • 0 65.0     69.1
  • 2 7.3       44.5
  • 1 9.5       55.7
  • 3 4.7       38.2
  • 5 4.1       39.1
  • 4 10.3     42.2
  • 6 8.0       44.6
  • 2 1.8       32.5
  • 8 6.2       43.0
  • 7 7.2       51.0
  • 7 17.0     58.1
  • 2 2.0       25.4
  • 1 11.8     35.4
  • 0 3.6       23.3
  • 2 8.5       34.8
  • 2 3.9       27.5
  • 5 12.9     64.7
  • 4 8.8       44.1
  • 3 3.0       36.4
  • 6 2.9       37.0
  • 1 10.7     53.9
  • 3 1.8       20.0
  • 8 5.1       26.8
  • 3 8.8       62.4
  • 5 8.5       29.5
  • 6 9.4       52.2
  • 0 0.9       27.2
  • 5 5.4       39.5
  • 1 12.2     58.0
  • 2 4.1       36.8
  • 2 6.3       31.6
  • 6 3.3       35.6
  • 9 7.9       53.0
  • 5 1.8       38.0
  • 5 10.6     54.3
  • 4 9.0       64.4
  • 4 3.9       36.8
  • 3 2.2       24.2
  • 2 7.6       37.6
  • 5 5.1       33.0
  • 7 4.9       45.5
  • 5 4.3       32.3
  • 2 2.1       39.9
    • 3) We need to normalize the input variables.
    • 4) Split the data into input variables, X, and the output variable, Y.
    • 5) Graph the dataset with a seed of 42.
    • 6) Implement Gradient Descent.

This section has be provided. Please run and understand the code.

iteration : 0 loss : 0.14248615838353396 iteration : 100 loss : 0.005841062708110685 iteration : 200 loss : 0.005374637890296291 iteration : 300 loss : 0.005059239296919674 iteration : 400 loss : 0.0048406904218634165

[ ]: array([[0.41375839],

[0.26569717],

3.7       7) Implement Stochastic Gradient Descent. Please run.

iteration : 0 loss : 0.0066577245043739405 iteration : 100 loss : 0.003102327706993443 iteration : 200 loss : 0.002532377208293092 iteration : 300 loss : 0.0023333911770596814 iteration : 400 loss : 0.0022626837845736957

4     Problem 4, predict house price.

  • import real_estate.csv
  • Are there any null values in the dataset? Drop any missing data if exist.
  • Create X as a 1-D array of the distance to the nearest MRT station, and y as the housing price
  • What is the number of samples in the data set? To do this, you can look at the “shape” of X and y
  • Split the data into train and test sets using sklearn’s train_test_split, with test_size = 1/3
  • Find the line of best fit using a Linear Regression and show the result of coefficients and intercept (you can use sklearn’s linear regression)
  • Using the predict method, make predictions for the test set and evaluate the performance (e.g., MSE or other metrics).

[4]:                             X1 transaction date … Y house price of unit area

No                                               …

1                                2012.917 … 37.9
2                                2012.917 … 42.2
3                                2013.583 … 47.3
4                                2013.500 … 54.8
5                                2012.833 … 43.1
..                                           … …
410                             2013.000 … 15.4
411                             2012.667 … 50.0
412                             2013.250 … 40.6
413                             2013.000 … 52.5
414                             2013.500 … 63.9

[414 rows x 7 columns]

[5]:                             X1 transaction date … Y house price of unit area

No                                               …

1                                2012.917 … 37.9
2                                2012.917 … 42.2
3                                2013.583 … 47.3
4                                2013.500 … 54.8
5                                2012.833 … 43.1
..                                           … …
410                             2013.000 … 15.4
411                             2012.667 … 50.0
412                             2013.250 … 40.6
413                             2013.000 … 52.5
414                             2013.500 … 63.9

/usr/local/lib/python3.7/dist-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but PolynomialFeatures was fitted with feature names

“X does not have valid feature names, but”

[109]: <matplotlib.collections.PathCollection at 0x7f578280b050>

/usr/local/lib/python3.7/dist-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but PolynomialFeatures was fitted with feature names

“X does not have valid feature names, but”

[112]: 73.4172288261123

/usr/local/lib/python3.7/dist-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but PolynomialFeatures was fitted with feature names

“X does not have valid feature names, but”

[113]: 72.16393207172295

Quartic has the least squared error, got larger for 5 or 6 powered, MSE is 72.1639

Mounted at /content/drive/

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

WARNING: apt does not have a stable CLI interface. Use with caution in scripts. Extracting templates from packages: 100%

[ ]: