[SOLVED] CS8803-Assignment 2

20.99 $

Category:

Description

Rate this product

In this assignment, you’ll begin the process of exploring relationships in data. You’ll accomplish this task by computing some basic statistical measures on one of three datasets. This is a good time to learn or reboot your Python coding skills.

 

Step 1Select one of the datasets for completion of this assignment:

  • [mental-health-in-tech-survey.csv] Mental Health in Tech Survey: Survey on Mental Health in the Tech Workplace in 2014 – https://osmihelp.org/research/

 

Dependent Variables:

  • treatment: Have you sought treatment for a mental health condition? (Yes/No) o mental_health_consequence: Do you think that discussing a mental health issue with your employer would have negative consequences? (Yes/Maybe/No)
  • phys_health_consequence: Do you think that discussing a physical health issue with your employer would have negative consequences? (Yes/Maybe/No)

 

  • [diabetic_data.csv] Diabetes 130 US hospitals for years 1999-2008: Diabetes – readmission – https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008

 

Dependent Variables:

  • time_in_hospital: a numeric value representing number of days between admission and discharge
  • readmitted: Days to inpatient readmission – “<30” if the patient was readmitted in less than 30 days, “>30” if the patient was readmitted in more than 30 days, and “No” for no record of readmission.

 

  • [compas-scores-two-years.csv] COMPAS Recidivism Racial Bias: Racial Bias in inmate COMPAS reoffense risk scores for Florida (ProPublica) – https://github.com/propublica/compasanalysis

 

Dependent Variables: o decile_score: a numeric value between 1 and 10 corresponding to the recidivism risk score generated by COMPAS software (a small number corresponds to a low risk, a larger number corresponds to a high risk).

  • two_year_recid: a numeric indicator of whether the defendant recidivated two years after previous charge (0: no, did not recidivate, 1: yes, did recidivate)

 

 

Step 2Explore the data by answering the following questions:

  • Which dataset did you select?
  • How many observations are in the dataset?
  • How many variables in the dataset?
  • Does this dataset seem to belong to a regulated domain in law as discussed in the lectures? If yes, which one?
  • How many variables in the dataset are associated with a legally recognized protected class? In a table format, list those variables associated with a protected class, identify the protected class and the associated legal precedence/law as discussed in the lectures.

 

Example Output (associated with a different dataset) – Dataset: Housing Decisions in Metro-Atlanta

Number of Observations: 1,400

Number of Variables: 16

Regulated Domain in Law: Housing (Fair Housing Act)

Number of Protected Class Variables: 2

  Protected Class Law
nationality National origin Civil Rights Act of 1964, 1991
pregnant (y/n) Pregnancy Pregnancy Discrimination Act

 

 

Step 3 – Determine the relationships between dependent and independent variables

The frequency of a value represents the number of times a value occurs in a data set. Compute the frequency of each value associated with each dependent variable (listed in Step 1) as a function of all of the protected class variables (independent variables) identified in Step 2. Create histogram(s) comparing the frequency values of the dependent variable as a function of the independent variable. Hint: For variables that are continuous, you might consider creating intervals that represent the data. For categorical/ordinal/nominal values, you might consider converting to numerical values.

 

Example Output for One Dependent-Independent Variable Combination:  

Independent Variable –

Protected Class Variable

Dependent Variable –

Housing Decision (Y/N)

Pregnant – Y Frequency of Y: 50 Frequency of N: 120
Pregnant – N Frequency of Y: 130 Frequency of N: 20

 

 

Step 4Show how to manipulate with data

Select one protected class variable (independent variable) and one dependent variable. 1) Create a graph to support the “fairness” hypothesis: The system is fair. There is no difference in the outcomes. 2) Create a graph to support the bias hypothesis: The system is biased. There is a difference in the outcomes. For each, provide a brief description of your manipulations.

 

Example Output:

 

  • Fair Hypothesis: As seen from this graph, housing decisions are not dependent on the pregnancy status of women. [Manipulations: Used line graph; Increased Scale to +-50; Mapped the ratio of positive Y decisions (i.e. 50/180 versus 130/180); No label on the Y-Axis].

Difference     in                           Housing                          Decisions            Based                  on                         Pregnancy

 

  • Bias Hypothesis: As seen from this graph, housing decisions are significantly dependent on the pregnancy status of women. [This hypothesis was easily supported with the data so didn’t require much in manipulations: Used stacked bar graph; Reduced Scale; Reworded labels].

 

 

 

 

Step 5: Given your selected protected class variable (independent variable), calculate the average (mean, median, and mode) values of the protected class group (Hint: Variables might need to be converted to numerical values as needed). Run the random sampling method using 50% of the data to create a reduced dataset. Calculate the average (mean, median, and mode) values of the protected class group. Indicate if there is a difference (or not) between the original dataset and the reduced dataset for any of the averages.  Provide all results.

 

Protected Class Variable (Pregnant) Mean Median Mode
Original Data Set 0 (NO) 0 (NO) 0 (NO)
Reduced Data Set 0 (NO) 1 (YES) 0 (NO)
Difference No Difference Difference No Difference

 

Step 6: Given your reduced dataset from Step 5, Repeat Step 3 (frequency and histogram) using your selected independent variable as a function of your selected dependent variable (from Step 4).  Explain any differences (in no more than 2 sentences). If you used the random sampling method, would members associated with the protected class variable benefit or be harmed? Explain your reasoning (in no more than 2 sentences).