Piping using the magrittr package
Data management using the dplyr package
- Data manipulation functions
  - filter
  - select
  - arrange
  - group_by
  - summarise
  - mutate
- Joining data frames

Piping using the magrittr package

The magrittr package gives you the pipe operator %>% which does one very simple thing. If you have an object x and a function f(), instead of writing f(x) you can now write x %>% f():

library("magrittr")
x <- 1
f <- function(x, add=1)
{
  x + add  
}
f(x)

## [1] 2

x %>% f()

## [1] 2

In essence, it simply takes whatever is on the left of %>% and uses it as the first argument of the function on the right hand side. If you only need one argument you can omit the parentheses:

x %>% f

## [1] 2

and multiple arguments to the function can be assigned as usual:

x %>% f(2)

## [1] 3

If you want to use the object on the left for a different argument than the first you can use the period . as placeholder:

10 %>% f(1, add = .)

## [1] 11

Although in the packages used in this document all functions are programmed such that this is not needed.

Using the function %>% is called piping, and a convenient way of pronouncing %>% is simply as “then”. In the case above, I first took the value 10 and then added 1. While it seems like writing x %>% f() is not at all easier than writing f(x), this style of coding becomes incredibly useful when sequentially applying multiple functions; in this case piping will allow one to think from left to right in the logical order of functions rather than from inside to outside in an ugly large nested statement of functions.

For example, suppose I want to first add 1, then add 2 and finally add 3 to my object x I can write this in both ways:

f(f(f(x,1),2),3)

## [1] 7

x %>% f(1) %>% f(2) %>% f(3)

## [1] 7

The second line clearly shows what happens whereas the first line takes quite some reading to understand. Especially in database management where we have to apply a lot of functions sequentially to get the correct subset of data this can be a very useful tool.

Data management using the dplyr package

The dplyr package is derived from plyr and aimed at the manipulation of data frames (note that currently it should never be used together with plyr as they both use the a few identical function names). The package is optimized to be used in conjunction with the pipe operator %>% and comes with some very fast and useful functions. In addition to data frame manipulation the package allows one to make a connection to a MySQL server (see MySQL in dplyr section).

To demonstrate some functionality I will simulate a toy dataset:

library("dplyr")

## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

set.seed(1)
DF <- 
  data.frame(
    sex = sample(c("Male","Female"), 10, TRUE),
    age = sample(20:60, 10, TRUE),
    IQ = rnorm(10, 100, 15)
  ) %>% as.tbl

DF

## Source: local data frame [10 x 3]
## 
##       sex age        IQ
## 1    Male  28 122.67672
## 2    Male  27 105.84765
## 3  Female  48  90.68139
## 4  Female  35  66.77950
## 5    Male  51 116.87396
## 6  Female  40  99.32600
## 7  Female  49  99.75715
## 8  Female  60 114.15754
## 9  Female  35 112.31832
## 10   Male  51 108.90852

In RStudio View(DF) can be used to view the dataframe more user friendly. I converted the data.frame to a tbl class which is what dplyr works with. tbl is simple a data frame with a bit different printing options (only prints first 10 rows); all the dplyr functions could be used on a data.frame as well.

Data manipulation functions

filter

The filter() function can be used to select or filter out rows. This works almost exactly the same as the subset() argument in base R subset() except that it is faster and that different logical statements can be combined using , instead of &:

# Select only males:
DF %>% filter(sex == "Male")

## Source: local data frame [4 x 3]
## 
##    sex age       IQ
## 1 Male  28 122.6767
## 2 Male  27 105.8476
## 3 Male  51 116.8740
## 4 Male  51 108.9085

# Select the youngest person:
DF %>% filter(age == min(age))

## Source: local data frame [1 x 3]
## 
##    sex age       IQ
## 1 Male  27 105.8476

# Select only males older than 27:
DF %>% filter(sex == "Male", age > 27)

## Source: local data frame [3 x 3]
## 
##    sex age       IQ
## 1 Male  28 122.6767
## 2 Male  51 116.8740
## 3 Male  51 108.9085

select

The select() function is the dplyr version of the select argument in base R subset() with a few extra tricks. It can be used to select or filter out columns. The names of the variables can be used in combination with the - operator to drop columns or : to select a range of columns:

# Select IQ:
DF %>% select(IQ)

## Source: local data frame [10 x 1]
## 
##           IQ
## 1  122.67672
## 2  105.84765
## 3   90.68139
## 4   66.77950
## 5  116.87396
## 6   99.32600
## 7   99.75715
## 8  114.15754
## 9  112.31832
## 10 108.90852

# Drop age:
DF %>% select(-age)

## Source: local data frame [10 x 2]
## 
##       sex        IQ
## 1    Male 122.67672
## 2    Male 105.84765
## 3  Female  90.68139
## 4  Female  66.77950
## 5    Male 116.87396
## 6  Female  99.32600
## 7  Female  99.75715
## 8  Female 114.15754
## 9  Female 112.31832
## 10   Male 108.90852

# Select sex to age:
DF %>% select(sex:age)

## Source: local data frame [10 x 2]
## 
##       sex age
## 1    Male  28
## 2    Male  27
## 3  Female  48
## 4  Female  35
## 5    Male  51
## 6  Female  40
## 7  Female  49
## 8  Female  60
## 9  Female  35
## 10   Male  51

Of course, filter() and select() can be used sequentially:

# Select IQ of all males:
DF %>% filter(sex == "male") %>% select(IQ)

## Source: local data frame [0 x 1]

arrange

The arrange() function can be used to order data frames by columns, similar to DF[order(DF$sex),] but much easier to write. For example, to order the data frame first by gender and then by age:

DF %>% arrange(sex, age)

## Source: local data frame [10 x 3]
## 
##       sex age        IQ
## 1  Female  35  66.77950
## 2  Female  35 112.31832
## 3  Female  40  99.32600
## 4  Female  48  90.68139
## 5  Female  49  99.75715
## 6  Female  60 114.15754
## 7    Male  27 105.84765
## 8    Male  28 122.67672
## 9    Male  51 116.87396
## 10   Male  51 108.90852

group_by

For other uses of dplyr it is important to first group the data, which we can do using group_by. Grouping the data will cause the manipulation functions of dplyr to not apply to the whole dataset but to different sections of the dataset. For example, to group by gender:

DF %>% group_by(sex)

## Source: local data frame [10 x 3]
## Groups: sex
## 
##       sex age        IQ
## 1    Male  28 122.67672
## 2    Male  27 105.84765
## 3  Female  48  90.68139
## 4  Female  35  66.77950
## 5    Male  51 116.87396
## 6  Female  40  99.32600
## 7  Female  49  99.75715
## 8  Female  60 114.15754
## 9  Female  35 112.31832
## 10   Male  51 108.90852

Nothing really happened here except that the data frame is now grouped. Now, functions will be applied separately to males and females. For example:

# Select youngest person:
DF %>% filter(age == min(age))

## Source: local data frame [1 x 3]
## 
##    sex age       IQ
## 1 Male  27 105.8476

# Select youngest male and female(s):
DF %>% group_by(sex) %>% filter(age == min(age))

## Source: local data frame [3 x 3]
## Groups: sex
## 
##      sex age       IQ
## 1   Male  27 105.8476
## 2 Female  35  66.7795
## 3 Female  35 112.3183

Multiple arguments to group_by() indicating multiple grouping variables, and ungroup() can be used to remove grouping from a data frame.

summarise

The summarise() function can be used to create a single row for each group specified with group_by. Each argument of summarise() must be named and indicates a column in the summarized data frame, in which the names of existing columns can be used verbatim to compute something. For example:

# Median age for males and females:
DF %>% group_by(sex) %>% summarise(medianAge = median(age))

## Source: local data frame [2 x 2]
## 
##      sex medianAge
## 1 Female      44.0
## 2   Male      39.5

# Median age and mean IQ:
DF %>% group_by(sex) %>% 
  summarise(
    medianAge = median(age), 
    meanIQ = mean(IQ)
  )

## Source: local data frame [2 x 3]
## 
##      sex medianAge    meanIQ
## 1 Female      44.0  97.16998
## 2   Male      39.5 113.57671

The result is again a data frame which could further be piped and used. A useful function to use in summarise() is n() (without any arguments), which will simply return the number of rows:

DF %>% group_by(sex, age) %>% summarise(Amount = n())

## Source: local data frame [8 x 3]
## Groups: sex
## 
##      sex age Amount
## 1 Female  35      2
## 2 Female  40      1
## 3 Female  48      1
## 4 Female  49      1
## 5 Female  60      1
## 6   Male  27      1
## 7   Male  28      1
## 8   Male  51      2

mutate

The mutate() function works similar to the summarise() function but does not reduce the data for each variable to a single row. Rather, it can be used to change or add columns to the data using existing columns. For example, if for whatever reason we want to add an age times IQ variable we can do so as follows.

DF %>% mutate(agexIQ = age * IQ)

## Source: local data frame [10 x 4]
## 
##       sex age        IQ   agexIQ
## 1    Male  28 122.67672 3434.948
## 2    Male  27 105.84765 2857.887
## 3  Female  48  90.68139 4352.707
## 4  Female  35  66.77950 2337.283
## 5    Male  51 116.87396 5960.572
## 6  Female  40  99.32600 3973.040
## 7  Female  49  99.75715 4888.100
## 8  Female  60 114.15754 6849.453
## 9  Female  35 112.31832 3931.141
## 10   Male  51 108.90852 5554.335

We can again use grouping here to compute things per group:

# Add median age of gender as variable:
DF %>% group_by(sex) %>% mutate(medianAge = median(age))

## Source: local data frame [10 x 4]
## Groups: sex
## 
##       sex age        IQ medianAge
## 1    Male  28 122.67672      39.5
## 2    Male  27 105.84765      39.5
## 3  Female  48  90.68139      44.0
## 4  Female  35  66.77950      44.0
## 5    Male  51 116.87396      39.5
## 6  Female  40  99.32600      44.0
## 7  Female  49  99.75715      44.0
## 8  Female  60 114.15754      44.0
## 9  Female  35 112.31832      44.0
## 10   Male  51 108.90852      39.5

Joining data frames

Often data comes in different data frames. For example, consider the following three data frames:

set.seed(1)
log <- data.frame(
  user_id = sample(c(1,2,3), 10, TRUE),
  item_id = sample(c(1,2,3), 10, TRUE),
  correct = sample(c(0,1), 10, TRUE)
  ) %>% as.tbl

users <- data.frame(
  user_id = c(1,2,4),
  age = c(20,20,30)
    ) %>% as.tbl

items <- data.frame(
  item_id = 1:3,
  item = c("1+1","2*2","3/3")
  ) %>% as.tbl

This is an example of data as we have in the Mathgarden. We have a data frame called log which contains each response of users on items: `` {r} log ``` for both the users and the items we have separate data frames containing more information on them, such as the age and the item question:

users

## Source: local data frame [3 x 2]
## 
##   user_id age
## 1       1  20
## 2       2  20
## 3       4  30

items

## Source: local data frame [3 x 2]
## 
##   item_id item
## 1       1  1+1
## 2       2  2*2
## 3       3  3/3

In the database these data frames (often called tables) are stored separately to safe memory, but we might be interested in using information from multiple tables for some inference. To do this we need to join tables. This can be done between any two data frames that share at least one column with the same name

left_join

The left_join() function can be used to augment the data frame with all information from the second data frame; this is by far the most often used version of joining. For example, we can augment the log dataframe with information about age and items.

log %>% left_join(users, "user_id") %>% left_join(items, "item_id")

## Source: local data frame [10 x 5]
## 
##    item_id user_id correct age item
## 1        1       1       1  20  1+1
## 2        1       2       0  20  1+1
## 3        3       2       1  20  3/3
## 4        2       3       0  NA  2*2
## 5        3       1       0  20  3/3
## 6        2       3       0  NA  2*2
## 7        3       3       0  NA  3/3
## 8        3       2       0  20  3/3
## 9        2       2       1  20  2*2
## 10       3       1       0  20  3/3

If we had omitted the second argument in both calls left_join() would still have detected that these columns had equal names. However, this should be avoided in the Mathgarden as the Mathgarden tables typically contain column names with the same name that are not equal (for example, every table has an id column). More on this later.

inner_join

Note that above the age of user 3 is NA because we are missing that information in the users data frame. If we would rather have removed these rows entirely we could have used inner_join() instead:

log %>% inner_join(users, "user_id")

## Source: local data frame [7 x 4]
## 
##   user_id item_id correct age
## 1       1       1       1  20
## 2       2       1       0  20
## 3       2       3       1  20
## 4       1       3       0  20
## 5       2       3       0  20
## 6       2       2       1  20
## 7       1       3       0  20

inner_join() still augmented the log data frame with the additional column age, but also removed rows from log with user id’s not present in the users data frame.

semi_join and anti_join

If we are not interested in augmenting the data frame with information from the second data frame but are only interested in subsetting the data frame we can use semi_join() and anti_join():

log %>% semi_join(users, "user_id")

## Source: local data frame [7 x 3]
## 
##   user_id item_id correct
## 1       1       1       1
## 2       1       3       0
## 3       1       3       0
## 4       2       1       0
## 5       2       3       1
## 6       2       3       0
## 7       2       2       1

log %>% anti_join(users, "user_id")

## Source: local data frame [3 x 3]
## 
##   user_id item_id correct
## 1       3       2       0
## 2       3       2       0
## 3       3       3       0

Here, semi_join() only returned the part of log for which we had a user ID present in users, and anti_join() only returned the part of log for which there was no user ID in users.

Joining data frames with different names

In the Math garden the data looks a bit different. Every table (data frame) has an id column, and if one table refers to the column of another table the column name tablename_columnname is used (if the tablename is plural the singular version is used). For example, if a log table has a column referring to the id column in the items table the corresponding column name in the log table is item_id. Therefore, tables in the Mathgarden do not have equal names. For example:

set.seed(1)
log <- data.frame(
  id = 1:10,
  user_id = sample(c(1,2,3), 10, TRUE),
  item_id = sample(c(1,2,3), 10, TRUE),
  correct = sample(c(0,1), 10, TRUE)
  ) %>% as.tbl

users <- data.frame(
  id = c(1,2,4),
  age = c(20,20,30)
    ) %>% as.tbl

items <- data.frame(
  id = 1:3,
  item = c("1+1","2*2","3/3")
  ) %>% as.tbl

In the latest version, dplyr directly supports joining tables using columns of different names:

log %>% 
  left_join(users, c("user_id" = "id")) %>% 
  left_join(items, c("item_id" = "id"))

## Source: local data frame [10 x 6]
## 
##    item_id user_id id correct age item
## 1        1       1  1       1  20  1+1
## 2        1       2  2       0  20  1+1
## 3        3       2  3       1  20  3/3
## 4        2       3  4       0  NA  2*2
## 5        3       1  5       0  20  3/3
## 6        2       3  6       0  NA  2*2
## 7        3       3  7       0  NA  3/3
## 8        3       2  8       0  20  3/3
## 9        2       2  9       1  20  2*2
## 10       3       1 10       0  20  3/3

dplyr tutorial

Sacha Epskamp

07/07/2014