[SOLVED] DA5030 Problems 1

30.00 $

Category:

Description

Rate this product

An organization has collected data on customer visits, transactions, operating system, and gender and desires
to build a model to predict revenue. For the moment, the goal is to prepare the data for modeling. Analyze the
data set in the following manner:
1.  Either install Base R and R Studio on your computer or create an account at RStudio.cloud and
then learn how to build R Markdown Notebooks to execute your code and organize your output into a
readable report. For those working on Windows, you may also use Microsoft Open R.
2.  Download this data set and then upload the data into RStudio Cloud. Each row represents a
customer’s interactions with the organization’s web store. The rst column is the number of visits of a
customer, the second the number of transactions of that customer, the third column is the customer’s
operating system, and the fourth column is the customer’s reported gender, while the last column is
revenue, , the total amount spent by that customer.
3. Calculate the following summative statistics: total transaction amount (revenue), mean number
of visits, median revenue, standard deviation of revenue, most common gender. Exclude any cases
where there is a missing value.
4.  Create a bar/column chart of gender (x-axis) versus revenue (y-axis). Omit missing values, ,
where gender is or missing.
5.What is the Pearson Moment of Correlation between number of visits and revenue? Comment on
the correlation.
6.  Which columns have missing data? How did you recognize them? How would you impute
missing values?
7.  Impute missing transaction and gender values. Use the mean for transaction (rounded to the
nearest whole number) and the mode for gender.
8.  Split the data set into two equally sized data sets where one can be used for training a model
and the other for validation. Take every odd numbered case and add them to the training data set and
every even numbered case and add them to the validation data set, i.e., row 1, 3, 5, 7, etc. are training
data while rows 2, 4, 6, etc. are validation data.
9. Calculate the mean revenue for the training and the validation data sets and compare them.
Comment on the dierence.
10.  For many data mining and machine learning tasks, there are packages in R. Use the
function to split the data set, so that 60% is used for training and 20% is used for testing, and another
20% is used for validation. To ensure that your code is reproducible and that everyone gets the same
i.e.
i.e.
NA
sample()
https://da5030.weebly.com/practice-1.html 2/3
result, use the number 77654 as your seed for the random number generator. Use the code fragment
below for reference: