[SOLVED] Data Science Assignment 2

25.00 $

Category:

Description

5/5 - (1 vote)

1 NYC Restaurants

We will analyze restaurant inspections in New York City, starting November

1, 2014 and ending January 31, 2015. The data is in the file NYC Restaurants.csv.

Setup First, read in this data using:

  • df = pd.read csv(’NYC Restaurants.csv’, dtype=str).

This ensures that all fields are read in as strings, and loading the data is

relatively fast.

 Create a unique name for each restaurant. On the

DataFrame created above, add a new column to your DataFrame, called

’RESTAURANT’, that combines DBA, BUILDING, STREET, ZIPCODE,

and BORO fields. For example, “WENDY’S 469 FLATBUSH AVENUE

11225 BROOKLYN”. Print the first 10 values of the RESTAURANT column

of your DataFrame.

How many restaurants are included in the data?

Careful now:

  • A “Subway” in one particular address (i.e., building, street, zipcode,

and boro) counts as one restaurant; don’t combine all Subways into

one restaurant!

  • The data can have multiple violations for the same restaurant!

How many chains are there? Let us define a chain to be

the same restaurant name occurring in at least two different (building, street,

zipcode, boro) addresses (i.e., one DBA with multiple restaurant locations).

You’ll see multiple versions of the name ”DUNKIN DONUTS”. Just act

as if they are different chains.

Plot a bar graph of the top 20 most popular chains.

We already have the chains from the previous problem. Count the number

of restaurants for each chain as a measure of its popularity.

What fraction of all restaurants are chain restau

rants? You should think of a restaurant as being one restaurant at one

location. So “Wendy’s 5th Street” and “Wendy’s on 10th Street” are two

different restaurants. Essentially a restaurant is what you defined in Q1.

Plot the number of non-chain restaurants in each

boro. First, we need to figure out all the non-chain restaurants, then select

out only those restaurants, and finally plot the number of such restaurants

by boro. Make sure to look at the plot; we don’t want to see… oh… the

“missing” boro.

] Plot the fraction of non-chain restaurants in each

boro. The boro with the most non-chain restaurants might just be the

boro with the most restaurants in general. If we want to find the boro that

attracts the most “independent” restauranteurs, we must divide the number

of non-chain restaurants by the total number of restaurants in the boro. Plot

this.

Is the boro with the most independent restaurants also the one with the

highest ratio of independent restaurants?

 Plot the popularity of cuisines. Which cuisines are

the most well-represented among all restaurants? Define the popularity of a

cuisine as the number of restaurants serving that cuisine. Plot the popularity

of the top 20 cuisines.

Plot the cuisines among restaurants which never got

cited for violations. Ideally, you should explore and see what happens

when there is no violation, but here I will just tell you: the ’VIOLATION

CODE’ field is missing.

First, find the restaurants that were never cited for a code violation. The

compute the popularity of each cuisine among these “clean” restaurants. Plot

the popularity of the top-20 among these cuisines.

 What cuisines tend to be the “cleanest”?

  • Select all cuisines for which there were at least 20 restaurants repre

senting that cuisine.

  • For each such cuisine, compute the ratio of the counts in Q9 to Q8.

This is the ratio of restaurants that never got cited, versus total number

of restaurants, for each cuisine.

  • Find the top-10 cuisines with the highest ratios; these are that cuisines

whose restaurants are “most likely to be clean.”

What are the most common violations in each bor

ough? Create a table of the number of times each violation description

was observed in each borough, and figure out the most common violation

description for each borough.

To create the table, check out the crosstab function. We will see a more

general version of this when we discuss groupby in class.

Once you do have the table, you will still need to find the most common

violation description for each borough.

 What are the most common violations per borough,

after normalizing for the relative abundance of each violation?

Hopefully, the answer to the previous question left you unsatisfied, because

some violations are just very common, irrespective of borough. A better

approach would be to normalize the violation counts, as follows.• Get overall frequencies: Figure out how common each violation is, over

the entire dataset; let’s call this violationFrequency.

  • Normalize: Consider the table of number of violations by boro that

you created for the previous question. For each borough, divide the

number of violations of each type by the total number of violations for

that type; i.e., divide the series of violations by violationFrequency.

We want to do this for each borough.

  • Find the biggest violations: Now, after this normalization, for each

borough, figure out the most common violation description.

 How many phone area codes correspond to a single

zipcode? The first three digits of the restaurant phone numbers are their

area codes. The area codes do not generally align with zip codes, but some

area codes are only for a single zip code. You must figure out how many area

codes have this property.

  • To extract the first 3 characters of the phone number, recall that strings

are pretty similar to lists.

 Find common misspellings of street names

Some

times, it’s Avenue, and sometimes, it’s Ave. We will try to come up with an

automated way to find common misspellings. The idea is the following: if

Ave and Avenue are the same, they should show up often in similar-sounding

street names, e.g., Lexington Ave and Lexington Avenue.

  • Create a new column, called STREET TYPE, which is the the last word

in the name of the street. For example, if the street is “Astoria Boule

vard”, the street type should be “Boulevard”.

  • Create another column, called STREET BASE, which contains everything

but the last word in the name of the street. For example, if the street

is “Astoria Boulevard”, the street base should be “Astoria”.

  • Create a third column, called STREET BASE & ZIP, that combines the

street base and the zipcode.

  • Create a table containing just these three columns, and remove any

duplicates or instances where street base is empty. This table now

contains unique street names, along with the street type.• Merge this table with itself, on the STREET BASE & ZIP column. Thus,

in the new merged tabled, we will have two STREET TYPE fields for

each street base and zipcode. For example, if both Lexington Ave and

Lexington Avenue exist in the same zipcode, we will get a row for the

street base Lexington and the two street types Ave and Avenue.

  • From the merged table, select only the rows where the street types are

different.

  • Now, do a cross-tabulation of the two distinct street types (check out

the crosstab function in Pandas). This gives us the number of times

Ave and Avenue were used with the same STREET BASE & ZIP.

  • From this cross-tabulation table, find the most commonly street type

that occurs with each of the following: AVE, ST, RD, PL, BOULEARD,

and BULEVARD.