Lab 1F: A Diamond in the Rough
Lab 1F - A Diamond in the Rough
Directions: Follow along with the slides, completing the questions in blue on your computer, and answering the questions in red in your journal.
Messy data? Get used to it
- 
Since lab 1, the data we've been using has been pretty clean.
 - 
Why do we call it clean?
– Variables were named so we could understand what they were about.
– There didn't seem to be any typos in the values.
– Numerical variables were considered numbers.
– Categorical variables were composed of categories.
 - 
Unfortunately, more often than not, data is messy until YOU clean it.
 - 
In this lab, we'll learn a few essentials for cleaning dirty data.
 
Messy data?
- 
What do we mean by messy data?
 - 
Variables might have non-descriptive names
– Var01, V2, a, ...
 - 
Categorical variables might have misspelled categories
– "blue", "Blue", "blu", ...
 - 
Numerical variables might have been input incorrectly. For example, if we're talking about people's height in inches:
– 64.7, 6.86, 676, ...
 - 
Numerical variables might be incorrectly coded as categorical variables (or vice-versa)
– "64.7", "68.6", "67.6"
 
The American Time Use Survey
- 
To show you what dirty data looks like, we'll check out the American Time Use Survey, or ATU survey.
 - 
What is ATU survey?
– It's a survey conducted by the US government (Specifically the Bureau of Labor Statistics).
– They survey thousands of people to find out exactly what activities they do throughout a single day.
– These thousands of people combined together give an idea about how much time the typical person living in the US spends doing various activites.
 
Load and go:
- 
Type the following commands into your console:
data(atu_dirty) View(atu_dirty) - 
(1) Just by viewing the data, what parts of our ATU data do you think need cleaning?
 
Description of ATU Variables
- 
The description of the actual variables:
–
caseid: Anonymous ID of survey taker.–
V1: The age of the respondent.–
V2: The sex of the respondent.–
V3: Whether the person is employed full-time or part-time.–
V4: Whether the person has a physical difficulty.–
V5: How long the person sleeps, in minutes.–
V6: How long the survey taker spent on homework, in minutes.–
V7: How long the respondent spent socializing, in minutes. 
New name, same old data
- 
To fix the variable names, we need to assign a new set of names in place of the old ones.
– Below is an example of the
renamefunction:atu_cleaner <- rename(atu_dirty, age = V1, sex = V2) - 
Use the example code and the variable information on the previous slide to rename the rest of the variables in
atu_dirty. - 
(2) Write down the new names you chose for the rest of the variables in
atu_dirty.– Names should be short, contain no spaces and describe what the variable is related to. So use abbreviations to your heart's content.
 
Next up: Strings
- 
In programming, a string is sort of like a word.
– It's a value made up of characters (i.e. letters)
 - 
The following are examples of strings. Notice that each string has quotes before and after.
"string" "A1B2c3" "Hot Cocoa" "0015" 
Numbers are words? (Sometimes)
- 
In some cases,
Rwill treat values that look like numbers as if they were strings. - 
Sometimes we do this on purpose.
– For example, we can code
Yes/Novariables as"1"/"0". - 
Sometimes we don't mean for this to happen.
– The number of siblings a person has should not be a string.
 - 
Look at the
structure of your data and the variable descriptions from a few slides back:– (3) Write down the variables that should be numeric but are improperly coded as strings or characters.
 
Changing strings into numbers
- 
To fix this problem, we need to tell
Rto think of our "numeric" variables as numeric variables. - 
We can do this with the
as.numericfunction.– An example using this function is below:
as.numeric("3.14") ## [1] 3.14 - 
Notice: We started with a string,
"3.14", butas.numericwas able to turn it back into a number. 
Mutating in action
- 
(4) Look at the variables you thought should be numeric and select one. Then fill in the blanks below to see how we can correctly code it as a number:
atu_cleaner <- mutate(atu_cleaner, age = as.numeric(age), ___ = as.numeric(___)) - 
Once you have this code working, use a similar line of code to correctly code the other numeric variables as numbers.
 
Deciphering Categorical Variables
- 
We mentioned earlier that we sometimes code categorical variables as numbers.
– For example, our
sexvariable uses"01"and"02"for"Male"and"Female", respectively. - 
It's often much easier to analyze and interpret when we use more descriptive categories, such as
"Male"and"Female". 
Factors and Levels
- 
Rhas a special name for categorical variables, called factors. - 
Ralso has a special name for the different categories of a categorical variable.– The individual categories are called levels.
 - 
To see the levels of
sexand their counts type:tally(~sex, data = atu_cleaner) - 
(5) Use similar code as we used above to write down the levels for the three factors in our data.
 
A level by any other name...
- 
If we know that
'01'means'Male'and'02'means'Female'then we can use the following code to recode the levels of sex. - 
Type the following command into your console:
atu_cleaner <- mutate(atu_cleaner, sex = recode(sex, "01" = "Male", "02" = "Female")) - 
This code is definitely a bit of a mouthful. Let's break it down.
 
Allow me to explain
atu_cleaner <- mutate(atu_cleaner, sex =
        recode(sex, "01" = "Male",
            "02" = "Female"))
- 
This code is saying:
– Replace my current version of
atu_cleaner...– with a mutated one where ...
– the
sexvariable's levels ...– have been recoded..."
– where
"01"will now be"Male"...– and
"02"will now be"Female". 
Finish it off!
- 
Recode the categorical variable about whether the person surveyed had a physical challenge or not. The coding is currently:
–
"01": Person surveyed did not have a physical challenge.–
"02": Person surveyed did have a physical challenge. - 
Write a script that:
(1) Loads the
atu_dirtydata set(2) Cleans the the data as we have in this lab
(3) Saves a copy of the cleaned data (see next slide).
 - 
NOTE: You can watch this video to learn about RScripts:
 
The final lines
- 
The last few lines of your script are extremely important because they will save all of your work.
 - 
Be sure to
Viewyour data and check itsstructure to make sure it looks clean and tidy before saving. - 
Run the code below:
atu_clean <- atu_cleaner - 
This code will create a new data frame in your Environment called
atu_cleanwhich is a final copy ofatu_cleaner.– If
atu_cleanis swept from your Environment all of the changes you made will NOT be saved.– You would need to re-run the script to clean the data again
 - 
To permanently save your changes you need to save the file as an
Rdata file or.Rda - 
Run the code below:
save(atu_clean, file = "atu_clean.Rda") - 
Look in your Files pane for the
atu_clean.Rdafile– This is as permanent copy of your clean atu data.
– To load the data onto your Environment click on the file.
– A pop-up window confirming the upload will appear.
 
Flex your skills
- 
Now that you have learned some cleaning data basics, it’s time to revisit the
fooddata.–
Importyourfooddata onto the Environment pane. - 
Run the code below:
histogram(~calories | healthy_level, data = food) - 
(6) Write and run code using the
as.factor()function to converthealthy_levelinto a categorical variable and re-run thehistogramfunction.– Notice that the
healthy_levelcategories are now numbers as opposed to tick-marks. This is an improvement but an even better solution would be torecodethe categories. - 
(7) Write and run code to
recodethehealthy_levelcategories and re-run thehistogramfunction.– "1" = "Very Unhealthy"
– "2" = "Unhealthy"
– "3" = "Neutral"
– "4" = "Healthy"
– "5" = "Very Healthy"
 - 
If your
fooddata is cleared from your Environment, the changes that you made to thehealthy_levelvariable will not be saved. - 
To save your changes permanently save your
foodfile as anRdata file.