Name: DATA303/473 Assignment 1-Solved
SKU: 91070
Availability: InStock

Description

Rate this product

Data on US cancer mortality rates for over 3000 counties are available in the dataset

cancer_reg.csv available on Blackboard. The data were obtained from the Data World website (https:

//data.world/nrippner/ols-regression-challenge). Read the data set into R and use it to answer the questions

that follow. We’ll use the subset of variables listed below:

incidencerate: Mean per capita (100,000) cancer diagnoses1
medincome: Median annual income (dollars) per county (2
povertypercent: Percent of county population in poverty2
studypercap: Per capita number of cancer-related clinical trials per county1
medianage: Median age (in years) of county residents2
pctunemployed16_over: Percent of county residents aged 16 and over that are unemployed2
pctprivatecoverage: Percent of county residents with private health coverage2
pctbachdeg25_over: Percent of county residents aged 25 and over with bachelor’s degree as highest

education attained2

target_deathrate: Response variable. Mean per capita (100,000) cancer mortalities1

1 Years 2010-2016 2 2013 Census Estimates

Create a new dataset called cancer2 that contains only the subset of variables listed above.

Based on a summary of the variables in the dataset and the plots below, identify any variable or

variables that have obviously incorrect values. For the variables you identify, write and implement code

to fifilter out the incorrect values. Give the number of observations left in the dataset.

100

200

300

250

500

750 1000 1250

Mean cancer diagnoses

per 100,000

100

200

300

250005000075000100000125000

Median income per county

100

200

300

Percent of population

in poverty

100

200

300

2500 5000 7500 10000

Number of cancer−related

clinical trials per county

100

200

300

200

400

600

Median age of county

100

200

300

% aged 16 and over

who are unemployed

100

200

300

% with private

health coverage

100

200

300

% aged 25 and over with

Bachelor’s degree as highest qualification

Some data cleaning is done on cancer2 and a new dataset cancer3.csv (available on

Blackboard) is created. Construct a scatterplot matrix of all variables in the new dataset. List any

key points of note from the scatterplot matrix, including any considerations you might make during a

regression analysis.

Mortality

Mortalityc. Fit a linear model to the data in cancer3, including all predictors with no transformations

or interactions. Present a summary of the model in a table. Give an estimate of σ2 , the error variance.

Suppose two counties diffffer by 1 per 100,000 in mean cancer diagnoses with all else being

equal. Based on the model fifitted in part (c), what is the difffference in expected cancer mortality for

these two counties?

Does it make practical sense to interpret the intercept for the model in part (c)? Justify

your answer.

The model fifitted in part (c) is to be used to predict cancer mortality for a county with

the predictor values below. Obtain 95% confifidence and prediction intervals for such a county. Explain

brieflfly why the prediction interval is wider than the confifidence interval.

incidencerate: 452
medincome: 23000
povertypercent: 16
studypercap: 150
medianage: 40
pctunemployed16_over: 8
pctprivatecoverage: 70
pctbachdeg25_over: 50

Assuming all regression assumptions hold, are the intervals you obtained in part (f) likely

to be valid? Explain your answer brieflfly.

Based on a global usefulness test, is it worth going on to further analyse and interpret a

model of target_deathrate against each of the predictors? Carry out the test, give the conclusion

and justify your answer.

The plots below are constructed from the cleaned dataset cancer3. Which predictors, if

any, would you consider applying log or polynomial transformations to? Explain your answer brieflfly.

100

200

300

250

500

750 1000 1250

Mean cancer diagnoses

per 100,000

100

200

300

250005000075000100000125000

Median income per county

100

200

300

Percent of population

in poverty

100

200

300

2500 5000 7500 10000

Number of cancer−related

clinical trials per county

100

200

300

Median age of county

100

200

300

% aged 16 and over

who are unemployed

100

200

300

% with private

health coverage

100

200

300

% aged 25 and over with

Bachelor’s degree as highest qualification

Mortality

MortalityFrancis Galton’s 1866 dataset (cleaned) lists individual observations on height for 899

children. Galton coined the term “regression” following his study of how children’s heights related to heights

of their parents. The data are available in the fifile galton.csv and contain the following variables:

familyID: Family ID
father: Height of father
mother: Height of mother
gender: gender of child
height: Height of child
kids: Number of childre in family
midparent: Mid-parent height calculated as (‘father + 1.08*mother)/2
adltchld: height if gender=M, otherwise 1.08*height if gender= F

All heights are measured in inches.

Read the data into R and fifit a linear model for height with the variables father, mother,

gender, kids and midparent as predictors. Provide a summary of the fifitted model. You will notice

that estimates for midparent are listed as NA. Why might this be the case and what regression problem

does this point to?

What action might you take to resolve the problem identifified in part (a)?
Based on the model fifitted in part (a) give an interpretation of the coeffiffifficient for genderM.
Determine the number of families in the dataset.
The problem in part (a) is resolved and a new linear model is fifitted.No observations are

excluded. The plots below are obtained to investigate regression assumptions for this new model. Based

on your answer in part (d) and the plots below, do the data meet all the regression assumptions?

Explain your answer brieflfly.

Fitted values

Residuals vs Fitted

479

289

−3

−2

−1

Theoretical Quantiles

Normal Q−Q

479

289

Fitted values

Scale−Location

479

60289

0.000

0.005

0.010

0.015

0.020

Leverage

Cook’s distance

Residuals vs Leverage

815

126

−10 0

Residuals

−4

Standardized residuals

0.0 1.0 2.0

Standardized residuals

−4 0

Standardized residuals

[SOLVED] DATA303/473 Assignment 1

If Helpful Share:

Description

Related products

DATA303/473 Assignment 2

DATA303/473 Assignment 3

DATA303/473 Test 1Fish markets

Related in this category

More in this category

DATA303/473 Assignment 4

DATA303/473 Test 1Fish markets

DATA303/473 Assignment 3

DATA303/473 Assignment 2