The magrittr
package gives you the pipe operator %>%
which does one very simple thing. If you have an object x
and a function f()
, instead of writing f(x)
you can now write x %>% f()
:
library("magrittr")
x <- 1
f <- function(x, add=1)
{
x + add
}
f(x)
## [1] 2
x %>% f()
## [1] 2
In essence, it simply takes whatever is on the left of %>%
and uses it as the first argument of the function on the right hand side. If you only need one argument you can omit the parentheses:
x %>% f
## [1] 2
and multiple arguments to the function can be assigned as usual:
x %>% f(2)
## [1] 3
If you want to use the object on the left for a different argument than the first you can use the period .
as placeholder:
10 %>% f(1, add = .)
## [1] 11
Although in the packages used in this document all functions are programmed such that this is not needed.
Using the function %>%
is called piping, and a convenient way of pronouncing %>%
is simply as “then”. In the case above, I first took the value 10 and then added 1. While it seems like writing x %>% f()
is not at all easier than writing f(x)
, this style of coding becomes incredibly useful when sequentially applying multiple functions; in this case piping will allow one to think from left to right in the logical order of functions rather than from inside to outside in an ugly large nested statement of functions.
For example, suppose I want to first add 1, then add 2 and finally add 3 to my object x
I can write this in both ways:
f(f(f(x,1),2),3)
## [1] 7
x %>% f(1) %>% f(2) %>% f(3)
## [1] 7
The second line clearly shows what happens whereas the first line takes quite some reading to understand. Especially in database management where we have to apply a lot of functions sequentially to get the correct subset of data this can be a very useful tool.
The dplyr
package is derived from plyr
and aimed at the manipulation of data frames (note that currently it should never be used together with plyr
as they both use the a few identical function names). The package is optimized to be used in conjunction with the pipe operator %>%
and comes with some very fast and useful functions. In addition to data frame manipulation the package allows one to make a connection to a MySQL server (see MySQL in dplyr section).
To demonstrate some functionality I will simulate a toy dataset:
library("dplyr")
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
set.seed(1)
DF <-
data.frame(
sex = sample(c("Male","Female"), 10, TRUE),
age = sample(20:60, 10, TRUE),
IQ = rnorm(10, 100, 15)
) %>% as.tbl
DF
## Source: local data frame [10 x 3]
##
## sex age IQ
## 1 Male 28 122.67672
## 2 Male 27 105.84765
## 3 Female 48 90.68139
## 4 Female 35 66.77950
## 5 Male 51 116.87396
## 6 Female 40 99.32600
## 7 Female 49 99.75715
## 8 Female 60 114.15754
## 9 Female 35 112.31832
## 10 Male 51 108.90852
In RStudio View(DF)
can be used to view the dataframe more user friendly. I converted the data.frame
to a tbl
class which is what dplyr
works with. tbl
is simple a data frame with a bit different printing options (only prints first 10 rows); all the dplyr
functions could be used on a data.frame
as well.
The filter()
function can be used to select or filter out rows. This works almost exactly the same as the subset()
argument in base R subset()
except that it is faster and that different logical statements can be combined using ,
instead of &
:
# Select only males:
DF %>% filter(sex == "Male")
## Source: local data frame [4 x 3]
##
## sex age IQ
## 1 Male 28 122.6767
## 2 Male 27 105.8476
## 3 Male 51 116.8740
## 4 Male 51 108.9085
# Select the youngest person:
DF %>% filter(age == min(age))
## Source: local data frame [1 x 3]
##
## sex age IQ
## 1 Male 27 105.8476
# Select only males older than 27:
DF %>% filter(sex == "Male", age > 27)
## Source: local data frame [3 x 3]
##
## sex age IQ
## 1 Male 28 122.6767
## 2 Male 51 116.8740
## 3 Male 51 108.9085
The select()
function is the dplyr
version of the select argument in base R subset()
with a few extra tricks. It can be used to select or filter out columns. The names of the variables can be used in combination with the -
operator to drop columns or :
to select a range of columns:
# Select IQ:
DF %>% select(IQ)
## Source: local data frame [10 x 1]
##
## IQ
## 1 122.67672
## 2 105.84765
## 3 90.68139
## 4 66.77950
## 5 116.87396
## 6 99.32600
## 7 99.75715
## 8 114.15754
## 9 112.31832
## 10 108.90852
# Drop age:
DF %>% select(-age)
## Source: local data frame [10 x 2]
##
## sex IQ
## 1 Male 122.67672
## 2 Male 105.84765
## 3 Female 90.68139
## 4 Female 66.77950
## 5 Male 116.87396
## 6 Female 99.32600
## 7 Female 99.75715
## 8 Female 114.15754
## 9 Female 112.31832
## 10 Male 108.90852
# Select sex to age:
DF %>% select(sex:age)
## Source: local data frame [10 x 2]
##
## sex age
## 1 Male 28
## 2 Male 27
## 3 Female 48
## 4 Female 35
## 5 Male 51
## 6 Female 40
## 7 Female 49
## 8 Female 60
## 9 Female 35
## 10 Male 51
Of course, filter()
and select()
can be used sequentially:
# Select IQ of all males:
DF %>% filter(sex == "male") %>% select(IQ)
## Source: local data frame [0 x 1]
The arrange()
function can be used to order data frames by columns, similar to DF[order(DF$sex),]
but much easier to write. For example, to order the data frame first by gender and then by age:
DF %>% arrange(sex, age)
## Source: local data frame [10 x 3]
##
## sex age IQ
## 1 Female 35 66.77950
## 2 Female 35 112.31832
## 3 Female 40 99.32600
## 4 Female 48 90.68139
## 5 Female 49 99.75715
## 6 Female 60 114.15754
## 7 Male 27 105.84765
## 8 Male 28 122.67672
## 9 Male 51 116.87396
## 10 Male 51 108.90852
For other uses of dplyr
it is important to first group the data, which we can do using group_by
. Grouping the data will cause the manipulation functions of dplyr
to not apply to the whole dataset but to different sections of the dataset. For example, to group by gender:
DF %>% group_by(sex)
## Source: local data frame [10 x 3]
## Groups: sex
##
## sex age IQ
## 1 Male 28 122.67672
## 2 Male 27 105.84765
## 3 Female 48 90.68139
## 4 Female 35 66.77950
## 5 Male 51 116.87396
## 6 Female 40 99.32600
## 7 Female 49 99.75715
## 8 Female 60 114.15754
## 9 Female 35 112.31832
## 10 Male 51 108.90852
Nothing really happened here except that the data frame is now grouped. Now, functions will be applied separately to males and females. For example:
# Select youngest person:
DF %>% filter(age == min(age))
## Source: local data frame [1 x 3]
##
## sex age IQ
## 1 Male 27 105.8476
# Select youngest male and female(s):
DF %>% group_by(sex) %>% filter(age == min(age))
## Source: local data frame [3 x 3]
## Groups: sex
##
## sex age IQ
## 1 Male 27 105.8476
## 2 Female 35 66.7795
## 3 Female 35 112.3183
Multiple arguments to group_by()
indicating multiple grouping variables, and ungroup()
can be used to remove grouping from a data frame.
The summarise()
function can be used to create a single row for each group specified with group_by
. Each argument of summarise()
must be named and indicates a column in the summarized data frame, in which the names of existing columns can be used verbatim to compute something. For example:
# Median age for males and females:
DF %>% group_by(sex) %>% summarise(medianAge = median(age))
## Source: local data frame [2 x 2]
##
## sex medianAge
## 1 Female 44.0
## 2 Male 39.5
# Median age and mean IQ:
DF %>% group_by(sex) %>%
summarise(
medianAge = median(age),
meanIQ = mean(IQ)
)
## Source: local data frame [2 x 3]
##
## sex medianAge meanIQ
## 1 Female 44.0 97.16998
## 2 Male 39.5 113.57671
The result is again a data frame which could further be piped and used. A useful function to use in summarise()
is n()
(without any arguments), which will simply return the number of rows:
DF %>% group_by(sex, age) %>% summarise(Amount = n())
## Source: local data frame [8 x 3]
## Groups: sex
##
## sex age Amount
## 1 Female 35 2
## 2 Female 40 1
## 3 Female 48 1
## 4 Female 49 1
## 5 Female 60 1
## 6 Male 27 1
## 7 Male 28 1
## 8 Male 51 2
The mutate()
function works similar to the summarise()
function but does not reduce the data for each variable to a single row. Rather, it can be used to change or add columns to the data using existing columns. For example, if for whatever reason we want to add an age times IQ variable we can do so as follows.
DF %>% mutate(agexIQ = age * IQ)
## Source: local data frame [10 x 4]
##
## sex age IQ agexIQ
## 1 Male 28 122.67672 3434.948
## 2 Male 27 105.84765 2857.887
## 3 Female 48 90.68139 4352.707
## 4 Female 35 66.77950 2337.283
## 5 Male 51 116.87396 5960.572
## 6 Female 40 99.32600 3973.040
## 7 Female 49 99.75715 4888.100
## 8 Female 60 114.15754 6849.453
## 9 Female 35 112.31832 3931.141
## 10 Male 51 108.90852 5554.335
We can again use grouping here to compute things per group:
# Add median age of gender as variable:
DF %>% group_by(sex) %>% mutate(medianAge = median(age))
## Source: local data frame [10 x 4]
## Groups: sex
##
## sex age IQ medianAge
## 1 Male 28 122.67672 39.5
## 2 Male 27 105.84765 39.5
## 3 Female 48 90.68139 44.0
## 4 Female 35 66.77950 44.0
## 5 Male 51 116.87396 39.5
## 6 Female 40 99.32600 44.0
## 7 Female 49 99.75715 44.0
## 8 Female 60 114.15754 44.0
## 9 Female 35 112.31832 44.0
## 10 Male 51 108.90852 39.5
Often data comes in different data frames. For example, consider the following three data frames:
set.seed(1)
log <- data.frame(
user_id = sample(c(1,2,3), 10, TRUE),
item_id = sample(c(1,2,3), 10, TRUE),
correct = sample(c(0,1), 10, TRUE)
) %>% as.tbl
users <- data.frame(
user_id = c(1,2,4),
age = c(20,20,30)
) %>% as.tbl
items <- data.frame(
item_id = 1:3,
item = c("1+1","2*2","3/3")
) %>% as.tbl
This is an example of data as we have in the Mathgarden. We have a data frame called log
which contains each response of users on items: `` {r} log ``` for both the users and the items we have separate data frames containing more information on them, such as the age and the item question:
users
## Source: local data frame [3 x 2]
##
## user_id age
## 1 1 20
## 2 2 20
## 3 4 30
items
## Source: local data frame [3 x 2]
##
## item_id item
## 1 1 1+1
## 2 2 2*2
## 3 3 3/3
In the database these data frames (often called tables) are stored separately to safe memory, but we might be interested in using information from multiple tables for some inference. To do this we need to join tables. This can be done between any two data frames that share at least one column with the same name
The left_join()
function can be used to augment the data frame with all information from the second data frame; this is by far the most often used version of joining. For example, we can augment the log
dataframe with information about age and items.
log %>% left_join(users, "user_id") %>% left_join(items, "item_id")
## Source: local data frame [10 x 5]
##
## item_id user_id correct age item
## 1 1 1 1 20 1+1
## 2 1 2 0 20 1+1
## 3 3 2 1 20 3/3
## 4 2 3 0 NA 2*2
## 5 3 1 0 20 3/3
## 6 2 3 0 NA 2*2
## 7 3 3 0 NA 3/3
## 8 3 2 0 20 3/3
## 9 2 2 1 20 2*2
## 10 3 1 0 20 3/3
If we had omitted the second argument in both calls left_join()
would still have detected that these columns had equal names. However, this should be avoided in the Mathgarden as the Mathgarden tables typically contain column names with the same name that are not equal (for example, every table has an id
column). More on this later.
Note that above the age of user 3 is NA
because we are missing that information in the users
data frame. If we would rather have removed these rows entirely we could have used inner_join()
instead:
log %>% inner_join(users, "user_id")
## Source: local data frame [7 x 4]
##
## user_id item_id correct age
## 1 1 1 1 20
## 2 2 1 0 20
## 3 2 3 1 20
## 4 1 3 0 20
## 5 2 3 0 20
## 6 2 2 1 20
## 7 1 3 0 20
inner_join()
still augmented the log
data frame with the additional column age
, but also removed rows from log
with user id’s not present in the users
data frame.
If we are not interested in augmenting the data frame with information from the second data frame but are only interested in subsetting the data frame we can use semi_join()
and anti_join()
:
log %>% semi_join(users, "user_id")
## Source: local data frame [7 x 3]
##
## user_id item_id correct
## 1 1 1 1
## 2 1 3 0
## 3 1 3 0
## 4 2 1 0
## 5 2 3 1
## 6 2 3 0
## 7 2 2 1
log %>% anti_join(users, "user_id")
## Source: local data frame [3 x 3]
##
## user_id item_id correct
## 1 3 2 0
## 2 3 2 0
## 3 3 3 0
Here, semi_join()
only returned the part of log
for which we had a user ID present in users
, and anti_join()
only returned the part of log
for which there was no user ID in users
.
In the Math garden the data looks a bit different. Every table (data frame) has an id
column, and if one table refers to the column of another table the column name tablename_columnname
is used (if the tablename is plural the singular version is used). For example, if a log table has a column referring to the id
column in the items
table the corresponding column name in the log table is item_id
. Therefore, tables in the Mathgarden do not have equal names. For example:
set.seed(1)
log <- data.frame(
id = 1:10,
user_id = sample(c(1,2,3), 10, TRUE),
item_id = sample(c(1,2,3), 10, TRUE),
correct = sample(c(0,1), 10, TRUE)
) %>% as.tbl
users <- data.frame(
id = c(1,2,4),
age = c(20,20,30)
) %>% as.tbl
items <- data.frame(
id = 1:3,
item = c("1+1","2*2","3/3")
) %>% as.tbl
In the latest version, dplyr
directly supports joining tables using columns of different names:
log %>%
left_join(users, c("user_id" = "id")) %>%
left_join(items, c("item_id" = "id"))
## Source: local data frame [10 x 6]
##
## item_id user_id id correct age item
## 1 1 1 1 1 20 1+1
## 2 1 2 2 0 20 1+1
## 3 3 2 3 1 20 3/3
## 4 2 3 4 0 NA 2*2
## 5 3 1 5 0 20 3/3
## 6 2 3 6 0 NA 2*2
## 7 3 3 7 0 NA 3/3
## 8 3 2 8 0 20 3/3
## 9 2 2 9 1 20 2*2
## 10 3 1 10 0 20 3/3