by Joseph KliegmanJan 24 2019
Chief Scientist, Nextjournal

Data Cleaning in R

Following is a process for data cleaning, standard tools that are used to explore and discover characteristics that need to be cleaned and tools that are used to transform data into that cleaned state.

The dataset used in this demo came from this excellent course in data cleaning hosted on Datacamp

weather.rds
library(tidyverse)
weather <- readRDS(weather.rds)

Get a feel for the data

Before diving in, figure out the basic structure of the data. Look at things like the class() of the data object to make sure it is what we expect (usually a data.frame). Check the dimensions with dim() and the column names with names()

# Verify that weather is a data.frame
class(weather)
# Check the dimensions of the data set
dim(weather)
# View the column names because these will be useful handles
names(weather)

Summarize the data

Look at summaries of the data. Functions str(), glimpse(), and summary() each give insight into the data.

# View the structure of the data
str(weather)
# Look at the structure using dplyr's glimpse()
glimpse(weather)
# View a summary of the data
summary(weather)

Take a closer look

Functions head() and tail() allow you to view the top and bottom rows of the data.

# View first 10 rows
head(weather)
# View the last 10 rows
tail(weather)

Column names are values

This weather dataset suffers from one of the five most common symptoms of messy data: column names that are values. Column names X1-X31 represent days of the month. They should be values of a variable called day.

The tidyr package provides the gather() function for this scenario.

gather(df, time, val, t1:t3)

gather() allows you to select multiple columns to be gathered by using the : operator.

# Gather the columns
weather2 <- gather(weather, day, value, X1:X31, na.rm = TRUE)

Values are variable names

Our data suffer from a second common symptom of messy data: values are variable names. Specifically, values in the measure column should be variables (i.e. column names) in our dataset.

The spread() function from tidyr is designed to help with this. To remind you of how this function works, we've loaded another small dataset called df2 (which is the result of applying gather() to the original df from last exercise). Give the following a try before attempting the instructions below.

spread(df2, time, val)

Note how the values of the time column now become column names. The tidyr package is already loaded.

# First remove column of row names
weather2 <- weather2[, -1]
# Spread the data
weather3 <- spread(weather2, measure, value)

Clean up dates

# Load these packages
library(tidyr)
library(dplyr)
library(stringr)
library(lubridate)
# Remove X's from day column
weather3$day <- str_replace(weather3$day, "X", "")
# Unite the year, month, and day columns
weather4 <- unite(weather3, date, year, month, day, sep = "-")
# Convert date column to proper date format using lubridates's ymd()
weather4$date <- ymd(weather4$date)
# Rearrange columns using dplyr's select()
weather5 <- select(weather4, date, Events, CloudCover:WindDirDegrees)

A closer look at column types

It's important for analysis that variables are coded appropriately. This is not yet the case with our weather data. Recall that functions such as as.numeric() and as.character() can be used to coerce variables into different types.

It's important to keep in mind that coercions are not always successful, particularly if there's some data in a column that you don't expect. For example, the following will cause problems:

as.numeric(c(4, 6.44, "some string", 222))

If you run the code above in the console, you'll get a warning message saying that R introduced an NA in the process of coercing to numeric. This is because it doesn't know how to make a number out of a string ("some string"). Watch out for this in our weather data!

# View the structure of weather5
str(weather5)
# Examine the first rows of weather5. Are most of the characters numeric?
head(weather5)
# See what happens if we try to convert PrecipitationIn to numeric
as.numeric(weather5$PrecipitationIn)

Column type conversion

As you saw in the last exercise, "T" was used to denote a trace amount (i.e. too small to be accurately measured) of precipitation in the PrecipitationIn column. In order to coerce this column to numeric, you'll need to deal with this somehow. To keep things simple, we will just replace "T" with zero, as a string ("0").

## The dplyr and stringr packages are already loaded

# Replace T with 0 (T = trace)
weather5$PrecipitationIn <- str_replace(weather5$PrecipitationIn, "T", "0")
# Convert characters to numerics
weather6 <- mutate_each(weather5, funs(as.numeric), CloudCover:WindDirDegrees)
# Look at result
str(weather6)

Finding missing values

Missing values in R should be represented by NA, but unclean datasets will have other values representing NA.

If missing values are properly coded as NA, the is.na() function will help you find them. You may need to try searching for some of the usual suspects like "", "#N/A", etc. You can also use the summary() and table() functions to turn up unexpected values in your data.

# Count missing values
sum(is.na(weather6))
# Find missing values
summary(is.na(weather6))
# Find indices of NAs in Max.Gust.SpeedMPH
ind <- which(is.na(weather6$Max.Gust.SpeedMPH))
# Look at the full rows for records missing Max.Gust.SpeedMPH
weather6[ind, ]
summary(weather6)

Deal with obvious errors

Besides missing values, we want to know if there are values in the data that are too extreme or bizarre to be plausible. A great way to start the search for these values is with summary().

Once implausible values are identified, they must be dealt with in an intelligent and informed way. Sometimes the best way forward is obvious and other times it may require some research and/or discussions with the original collectors of the data.

# Find row with Max.Humidity of 1000
ind <- which(weather6$Max.Humidity == 1000)
# Look at the data for that day
weather6[ind, ]
# Change 1000 to 100
weather6$Max.Humidity[ind] <- 100

More obvious errors

# Look at summary of Mean.VisibilityMiles
summary(weather6$Mean.VisibilityMiles)
# Get index of row with -1 value
ind <- which(weather6$Mean.VisibilityMiles == -1)
# Look at full row
weather6[ind,]
# Set Mean.VisibilityMiles to the appropriate value
weather6$Mean.VisibilityMiles[ind] <- 10

Check for other extreme values

In addition to dealing with obvious errors in the data, we want to see if there are other extreme values. In addition to the trusty summary() function, hist() is useful for quickly getting a feel for how different variables are distributed.

# Review summary of full data once more
summary(weather6)
# Look at histogram for MeanDew.PointF
hist(weather6$MeanDew.PointF)
# Look at histogram for Min.TemperatureF
hist(weather6$Min.TemperatureF)
# Compare to histogram for Mean.TemperatureF
hist(weather6$Mean.TemperatureF)

Finishing touches

Before officially calling our weather data clean, we want to put a couple of finishing touches on the data. These are a bit more subjective and may not be necessary for analysis, but they will make the data easier for others to interpret, which is generally a good thing.

There are a number of stylistic conventions in the R language. Depending on who you ask, these conventions may vary. Because the period (.) has special meaning in certain situations, we generally recommend using underscores (_) to separate words in variable names. We also prefer all lowercase letters so that no one has to remember which letters are uppercase or lowercase.

Finally, the events column (renamed to be all lowercase in the first instruction) contains an empty string ("") for any day on which there was no significant weather event such as rain, fog, a thunderstorm, etc. However, if it's the first time you're seeing these data, it may not be obvious that this is the case, so it's best for us to be explicit and replace the empty strings with something more meaningful.

names(weather6)
new_colnames <- c("date", "events"
, "cloud_cover"               ,"max_dew_point_f"           
,"max_gust_speed_mph"         ,"max_humidity"              
,"max_sea_level_pressure_in"  ,"max_temperature_f"         
,"max_visibility_miles"       ,"max_wind_speed_mph"        
,"mean_humidity"              ,"mean_sea_level_pressure_in"
,"mean_temperature_f"         ,"mean_visibility_miles"     
,"mean_wind_speed_mph"        ,"mean_dew_point_f"          
,"min_dew_point_f"            ,"min_humidity"              
,"min_sea_level_pressure_in"  ,"min_temperature_f"         
,"min_visibility_miles"       ,"precipitation_in"          
,"wind_dir_degrees")
# Clean up column names
names(weather6) <- new_colnames
# Replace empty cells in events column
weather6$events[weather6$events == ""] <- "None"
# Print the first rows of weather6
head(weather6)