Copyright: Materials used in these lessons were derived from work that is Copyright © Data Carpentry (http://datacarpentry.org/). All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0). Thomas Wright and Naupaka Zimmerman (eds): "Software Carpentry: R forReproducible Scientific Analysis." Version 2016.06, June 2016,https://github.com/swcarpentry/r-novice-gapminder,10.5281/zenodo.57520.

Before Starting The Workshop

Please ensure you have the latest version of R and RStudio installed on your machine. This is important, as some packages used in the workshop may not install correctly (or at all) if R is not up to date.

Download and install the latest version of R here Download and install RStudio here

Introduction to RStudio

Welcome to the R portion of the Software Carpentry workshop.

Throughout this lesson, we’re going to teach you some of the fundamentals of the R language as well as some best practices for organizing code for scientific projects that will make your life easier.

We’ll be using RStudio: a free, open source R integrated development environment. It provides a built in editor, works on all platforms (including on servers) and provides many advantages such as integration with version control and project management.

Basic layout

When you first open RStudio, you will be greeted by three panels:

RStudio layout

RStudio layout

Once you open files, such as R scripts, an editor panel will also open in the top left.

RStudio layout with .R file open

RStudio layout with .R file open

Work flow within RStudio

There are two main ways one can work within RStudio.

  1. Test and play within the interactive R console then copy code into a .R file to run later.
  1. Start writing in an .R file and use RStudio’s short cut keys for the Run command to push the current line, selected lines or modified lines to the interactive R console.

Tip: Running segments of your code

RStudio offers you great flexibility in running code from within the editor window. There are buttons, menu choices, and keyboard shortcuts. To run the current line, you can 1. click on the Run button above the editor panel, or 2. select “Run Lines” from the “Code” menu, or 3. hit Ctrl+Return in Windows or Linux or +Return on OS X. (This shortcut can also be seen by hovering the mouse over the button). To run a block of code, select it and then Run. If you have modified a line of code within a block of code you have just run, there is no need to reselect the section and Run, you can use the next button along, Re-run the previous region. This will run the previous code block including the modifications you have made. {: .callout}

Introduction to R

Much of your time in R will be spent in the R interactive console. This is where you will run all of your code, and can be a useful environment to try out ideas before adding them to an R script file. This console in RStudio is the same as the one you would get if you typed in R in your command-line environment.

The first thing you will see in the R interactive session is a bunch of information, followed by a “>” and a blinking cursor. In many ways this is similar to the shell environment you learned about during the shell lessons: it operates on the same idea of a “Read, evaluate, print loop”: you type in commands, R tries to execute them, and then returns a result.

Using R as a calculator

The simplest thing you could do with R is do arithmetic:

1 + 100
## [1] 101

And R will print out the answer, with a preceding “[1]”. Don’t worry about this for now, we’ll explain that later. For now think of it as indicating output.

Like bash, if you type in an incomplete command, R will wait for you to complete it:

> 1 +

{: .r}

+

{: .output}

Any time you hit return and the R session shows a “+” instead of a “>”, it means it’s waiting for you to complete the command. If you want to cancel a command you can simply hit “Esc” and RStudio will give you back the “>” prompt.

Tip: Cancelling commands

If you’re using R from the commandline instead of from within RStudio, you need to use Ctrl+C instead of Esc to cancel the command. This applies to Mac users as well!

Cancelling a command isn’t only useful for killing incomplete commands: you can also use it to tell R to stop running code (for example if it’s taking much longer than you expect), or to get rid of the code you’re currently writing.

{: .callout}

When using R as a calculator, the order of operations is the same as you would have learned back in school.

From highest to lowest precedence:

3 + 5 * 2
## [1] 13

Use parentheses to group operations in order to force the order of evaluation if it differs from the default, or to make clear what you intend.

(3 + 5) * 2
## [1] 16

This can get unwieldy when not needed, but clarifies your intentions. Remember that others may later read your code.

(3 + (5 * (2 ^ 2))) # hard to read
3 + 5 * 2 ^ 2       # clear, if you remember the rules
3 + 5 * (2 ^ 2)     # if you forget some rules, this might help

The text after each line of code is called a “comment”. Anything that follows after the hash (or octothorpe) symbol # is ignored by R when it executes code.

Really small or large numbers get a scientific notation:

2/10000
## [1] 2e-04

Which is shorthand for “multiplied by 10^XX”. So 2e-4 is shorthand for 2 * 10^(-4).

You can write numbers in scientific notation too:

5e3  # Note the lack of minus here
## [1] 5000

Mathematical functions

R has many built in mathematical functions. To call a function, we simply type its name, followed by open and closing parentheses. Anything we type inside the parentheses is called the function’s arguments:

sin(1)  # trigonometry functions
## [1] 0.841471
log(1)  # natural logarithm
## [1] 0
log10(10) # base-10 logarithm
## [1] 1
exp(0.5) # e^(1/2)
## [1] 1.648721

Don’t worry about trying to remember every function in R. You can simply look them up on Google, or if you can remember the start of the function’s name, use the tab completion in RStudio.

This is one advantage that RStudio has over R on its own, it has auto-completion abilities that allow you to more easily look up functions, their arguments, and the values that they take.

Typing a ? before the name of a command will open the help page for that command. As well as providing a detailed description of the command and how it works, scrolling to the bottom of the help page will usually show a collection of code examples which illustrate command usage. We’ll go through an example later.

Comparing things

We can also do comparison in R:

1 == 1  # equality (note two equals signs, read as "is equal to")
## [1] TRUE
1 != 2  # inequality (read as "is not equal to")
## [1] TRUE
1 < 2  # less than
## [1] TRUE
1 <= 1  # less than or equal to
## [1] TRUE
1 > 0  # greater than
## [1] TRUE
1 >= -9 # greater than or equal to
## [1] TRUE

Tip: Comparing Numbers

A word of warning about comparing numbers: you should never use == to compare two numbers unless they are integers (a data type which can specifically represent only whole numbers).

Computers may only represent decimal numbers with a certain degree of precision, so two numbers which look the same when printed out by R, may actually have different underlying representations and therefore be different by a small margin of error (called Machine numeric tolerance).

Instead you should use the all.equal function.

Further reading: http://floating-point-gui.de/

{: .callout}

Variables and assignment

We can store values in variables using the assignment operator <-, like this:

x <- 1/40

Notice that assignment does not print a value. Instead, we stored it for later in something called a variable. x now contains the value 0.025:

x
## [1] 0.025

More precisely, the stored value is a decimal approximation of this fraction called a floating point number.

Look for the Environment tab in one of the panes of RStudio, and you will see that x and its value have appeared. Our variable x can be used in place of a number in any calculation that expects a number:

log(x)
## [1] -3.688879

Notice also that variables can be reassigned:

x <- 100

x used to contain the value 0.025 and and now it has the value 100.

Assignment values can contain the variable being assigned to:

x <- x + 1 #notice how RStudio updates its description of x on the top right tab
y <- x * 2

The right hand side of the assignment can be any valid R expression. The right hand side is fully evaluated before the assignment occurs.

Variable names can contain letters, numbers, underscores and periods. They cannot start with a number nor contain spaces at all. Different people use different conventions for long variable names, these include

What you use is up to you, but be consistent.

It is also possible to use the = operator for assignment:

x = 1/40

But this is much less common among R users. The most important thing is to be consistent with the operator you use. There are occasionally places where it is less confusing to use <- than =, and it is the most common symbol used in the community. So the recommendation is to use <-.

Challenge 1

Which of the following are valid R variable names?

min_height
max.height
_age
.mass
MaxLength
min-length
2widths
celsius2kelvin

Solution to challenge 1

The following can be used as R variables:

min_height
max.height
MaxLength
celsius2kelvin

The following creates a hidden variable:

.mass

The following will not be able to be used to create a variable

_age
min-length
2widths

{: .solution} {: .challenge}

Vectorization

One final thing to be aware of is that R is vectorized, meaning that variables and functions can have vectors as values. In contrast to physics and mathematics, a vector in R describes a set of values in a certain order of the same data type. For example

1:5
## [1] 1 2 3 4 5
2^(1:5)
## [1]  2  4  8 16 32
x <- 1:5
2^x
## [1]  2  4  8 16 32

This is incredibly powerful; we will discuss this further in an upcoming lesson.

Managing your environment

There are a few useful commands you can use to interact with the R session.

ls will list all of the variables and functions stored in the global environment (your working R session):

ls()
## [1] "x" "y"

Tip: hidden objects

Like in the shell, ls will hide any variables or functions starting with a “.” by default. To list all objects, type ls(all.names=TRUE) instead

{: .callout}

Note here that we didn’t give any arguments to ls, but we still needed to give the parentheses to tell R to call the function.

If we type ls by itself, R will print out the source code for that function!

ls
## function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE, 
##     pattern, sorted = TRUE) 
## {
##     if (!missing(name)) {
##         pos <- tryCatch(name, error = function(e) e)
##         if (inherits(pos, "error")) {
##             name <- substitute(name)
##             if (!is.character(name)) 
##                 name <- deparse(name)
##             warning(gettextf("%s converted to character string", 
##                 sQuote(name)), domain = NA)
##             pos <- name
##         }
##     }
##     all.names <- .Internal(ls(envir, all.names, sorted))
##     if (!missing(pattern)) {
##         if ((ll <- length(grep("[", pattern, fixed = TRUE))) && 
##             ll != length(grep("]", pattern, fixed = TRUE))) {
##             if (pattern == "[") {
##                 pattern <- "\\["
##                 warning("replaced regular expression pattern '[' by  '\\\\['")
##             }
##             else if (length(grep("[^\\\\]\\[<-", pattern))) {
##                 pattern <- sub("\\[<-", "\\\\\\[<-", pattern)
##                 warning("replaced '[<-' by '\\\\[<-' in regular expression pattern")
##             }
##         }
##         grep(pattern, all.names, value = TRUE)
##     }
##     else all.names
## }
## <bytecode: 0x7fe58608efc0>
## <environment: namespace:base>

You can use rm to delete objects you no longer need:

rm(x)

If you have lots of things in your environment and want to delete all of them, you can pass the results of ls to the rm function:

rm(list = ls())

In this case we’ve combined the two. Like the order of operations, anything inside the innermost parentheses is evaluated first, and so on.

In this case we’ve specified that the results of ls should be used for the list argument in rm. When assigning values to arguments by name, you must use the = operator!!

If instead we use <-, there will be unintended side effects, or you may get an error message:

rm(list <- ls())
## Error in rm(list <- ls()): ... must contain names or character strings

Tip: Warnings vs. Errors

Pay attention when R does something unexpected! Errors, like above, are thrown when R cannot proceed with a calculation. Warnings on the other hand usually mean that the function has run, but it probably hasn’t worked as expected.

In both cases, the message that R prints out usually give you clues how to fix a problem.

{: .callout}

R Packages

It is possible to add functions to R by writing a package, or by obtaining a package written by someone else. As of this writing, there are over 10,000 packages available on CRAN (the comprehensive R archive network). R and RStudio have functionality for managing packages:

Challenge 2

What will be the value of each variable after each statement in the following program?

mass <- 47.5
age <- 122
mass <- mass * 2.3
age <- age - 20

Solution to challenge 2

mass <- 47.5

This will give a value of 47.5 for the variable mass

age <- 122

This will give a value of 122 for the variable age

mass <- mass * 2.3

This will multiply the existing value of 47.5 by 2.3 to give a new value of 109.25 to the variable mass.

age <- age - 20

This will subtract 20 from the existing value of 122 to give a new value of 102 to the variable age. {: .solution} {: .challenge}

Challenge 3

Run the code from the previous challenge, and write a command to compare mass to age. Is mass larger than age?

Solution to challenge 3

One way of answering this question in R is to use the > to set up the following:

mass > age
## [1] TRUE

This should yield a boolean value of TRUE since 109.25 is greater than 102. {: .solution} {: .challenge}

Challenge 4

Clean up your working environment by deleting the mass and age variables.

Solution to challenge 4

We can use the rm command to accomplish this task

rm(age, mass)

{: .solution} {: .challenge}

Challenge 5

Install the following packages: ggplot2, plyr, gapminder

Solution to challenge 5

We can use the install.packages() command to install the required packages.

install.packages("ggplot2")
install.packages("plyr")
install.packages("gapminder")

{: .solution} {: .challenge}

Seeking Help

Reading Help files

R, and every package, provide help files for functions. The general syntax to search for help on any function, “function_name”, from a specific function that is in a package loaded into your namespace (your interactive R session):

?function_name
help(function_name)

This will load up a help page in RStudio (or as plain text in R by itself).

Each help page is broken down into sections:

  • Description: An extended description of what the function does.
  • Usage: The arguments of the function and their default values.
  • Arguments: An explanation of the data each argument is expecting.
  • Details: Any important details to be aware of.
  • Value: The data the function returns.
  • See Also: Any related functions you might find useful.
  • Examples: Some examples for how to use the function.

Different functions might have different sections, but these are the main ones you should be aware of.

Tip: Reading help files

One of the most daunting aspects of R is the large number of functions available. It would be prohibitive, if not impossible to remember the correct usage for every function you use. Luckily, the help files mean you don’t have to! {: .callout}

Special Operators

To seek help on special operators, use quotes:

?"<-"

Getting help on packages

Many packages come with “vignettes”: tutorials and extended example documentation. Without any arguments, vignette() will list all vignettes for all installed packages; vignette(package="package-name") will list all available vignettes for package-name, and vignette("vignette-name") will open the specified vignette.

If a package doesn’t have any vignettes, you can usually find help by typing help("package-name").

When you kind of remember the function

If you’re not sure what package a function is in, or how it’s specifically spelled you can do a fuzzy search:

??function_name

When you have no idea where to begin

If you don’t know what function or package you need to use CRAN Task Views is a specially maintained list of packages grouped into fields. This can be a good starting point.

When your code doesn’t work: seeking help from your peers

If you’re having trouble using a function, 9 times out of 10, the answers you are seeking have already been answered on Stack Overflow. You can search using the [r] tag.

If you can’t find the answer, there are a few useful functions to help you ask a question from your peers:

?dput

Will dump the data you’re working with into a format so that it can be copy and pasted by anyone else into their R session.

sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.4.0  backports_1.1.1 magrittr_1.5    rprojroot_1.2  
##  [5] tools_3.4.0     htmltools_0.4.0 yaml_2.1.15     Rcpp_1.0.2     
##  [9] stringi_1.1.6   rmarkdown_1.8   knitr_1.18      stringr_1.2.0  
## [13] digest_0.6.12   rlang_0.4.1     evaluate_0.10.1

Will print out your current version of R, as well as any packages you have loaded. This can be useful for others to help reproduce and debug your issue.

Challenge 1

Look at the help for the c function. What kind of vector do you expect you will create if you evaluate the following:

c(1, 2, 3)
c('d', 'e', 'f')
c(1, 2, 'f')

Solution to Challenge 1

The c() function creates a vector, in which all elements are the same type. In the first case, the elements are numeric, in the second, they are characters, and in the third they are characters: the numeric values are “coerced” to be characters. {: .solution} {: .challenge}

Challenge 2

Look at the help for the paste function. You’ll need to use this later. What is the difference between the sep and collapse arguments?

Solution to Challenge 2

To look at the help for the paste() function, use:

help("paste")
?paste

The difference between sep and collapse is a little tricky. The paste function accepts any number of arguments, each of which can be a vector of any length. The sep argument specifies the string used between concatenated terms — by default, a space. The result is a vector as long as the longest argument supplied to paste. In contrast, collapse specifies that after concatenation the elements are collapsed together using the given separator, the result being a single string. e.g.

paste(c("a","b"), "c")
## [1] "a c" "b c"
paste(c("a","b"), "c", sep = ",")
## [1] "a,c" "b,c"
paste(c("a","b"), "c", collapse = "|")
## [1] "a c|b c"
paste(c("a","b"), "c", sep = ",", collapse = "|")
## [1] "a,c|b,c"

(For more information, scroll to the bottom of the ?paste help page and look at the examples, or try example('paste').) {: .solution} {: .challenge}

Challenge 3

Use help to find a function (and its associated parameters) that you could use to load data from a csv file in which columns are delimited with “” (tab) and the decimal point is a “.” (period). This check for decimal separator is important, especially if you are working with international colleagues, because different countries have different conventions for the decimal point (i.e. comma vs period). hint: use ??csv to lookup csv related functions. > ## Solution to Challenge 3 > > The standard R function for reading tab-delimited files with a period > decimal separator is read.delim(). You can also do this with > read.table(file, sep="\t") (the period is the default decimal > separator for read.table(), although you may have to change > the comment.char argument as well if your data file contains > hash (#) characters {: .solution} {: .challenge}

Other ports of call

Vector

A vector in R is essentially an ordered list of things, with the special condition that everything in the vector must be the same basic data type. If you don’t choose the datatype, it’ll default to logical; or, you can declare an empty vector of whatever type you like.

The c() function creates a vector, in which all elements are the same type. In the first case, the elements are numeric, in the second, they are characters, and in the third they are characters: the numeric values are “coerced” to be characters.

my_vector <- vector(length = 3)
my_vector
## [1] FALSE FALSE FALSE
another_vector <- vector(mode='character', length=3)
another_vector
## [1] "" "" ""

concatenate command c()

We can also make vectors with explicit contents with the combine function:

c("a","b","c")
## [1] "a" "b" "c"
c(1,2,3)
## [1] 1 2 3

Given what we’ve learned so far, what do you think the following will produce?

quiz_vector <- c(2,6,'3')

This is something called type coercion, and it is the source of many surprises and the reason why we need to be aware of the basic data types and how R will interpret them. When R encounters a mix of types (here numeric and character) to be combined into a single vector, it will force them all to be the same type. Consider:

coercion_vector <- c('a', TRUE)
coercion_vector
## [1] "a"    "TRUE"
another_coercion_vector <- c(0, TRUE)
another_coercion_vector
## [1] 0 1

The coercion rules go: logical -> integer -> numeric -> complex -> character, where -> can be read as are transformed into. You can try to force coercion against this flow using the as. functions:

character_vector_example <- c('0','2','4')
character_vector_example
## [1] "0" "2" "4"
character_coerced_to_numeric <- as.numeric(character_vector_example)
character_coerced_to_numeric
## [1] 0 2 4
numeric_coerced_to_logical <- as.logical(character_coerced_to_numeric)
numeric_coerced_to_logical
## [1] FALSE  TRUE  TRUE

As you can see, some surprising things can happen when R forces one basic data type into another! Nitty-gritty of type coercion aside, the point is: if your data doesn’t look like what you thought it was going to look like, type coercion may well be to blame; make sure everything is the same type in your vectors and your columns of data.frames, or you will get nasty surprises!

The combine function, c(), will also append things to an existing vector:

ab_vector <- c('a', 'b')
ab_vector
## [1] "a" "b"
combine_example <- c(ab_vector, 'SWC')
combine_example
## [1] "a"   "b"   "SWC"

You can also make series of numbers:

mySeries <- 1:10
mySeries
##  [1]  1  2  3  4  5  6  7  8  9 10
seq(10)
##  [1]  1  2  3  4  5  6  7  8  9 10
seq(1,10, by=0.1)
##  [1]  1.0  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3
## [15]  2.4  2.5  2.6  2.7  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7
## [29]  3.8  3.9  4.0  4.1  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1
## [43]  5.2  5.3  5.4  5.5  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5
## [57]  6.6  6.7  6.8  6.9  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9
## [71]  8.0  8.1  8.2  8.3  8.4  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3
## [85]  9.4  9.5  9.6  9.7  9.8  9.9 10.0

We can ask a few questions about vectors:

sequence_example <- seq(10)
head(sequence_example, n=2)
## [1] 1 2
tail(sequence_example, n=4)
## [1]  7  8  9 10
length(sequence_example)
## [1] 10
class(sequence_example)
## [1] "integer"
typeof(sequence_example)
## [1] "integer"

Finally, you can give names to elements in your vector:

my_example <- 5:8
names(my_example) <- c("a", "b", "c", "d")
my_example
## a b c d 
## 5 6 7 8
names(my_example)
## [1] "a" "b" "c" "d"

Vectors Subsetting

R has many powerful subset operators. Mastering them will allow you to easily perform complex operations on any kind of dataset.

There are six different ways we can subset any kind of object, and three different subsetting operators for the different data structures.

Let’s start with the workhorse of R: a simple numeric vector.

x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
names(x) <- c('a', 'b', 'c', 'd', 'e')
x
##   a   b   c   d   e 
## 5.4 6.2 7.1 4.8 7.5

Atomic vectors

In R, simple vectors containing character strings, numbers, or logical values are called atomic vectors because they can’t be further simplified. {: .callout}

So now that we’ve created a dummy vector to play with, how do we get at its contents?

Accessing elements using their indices

To extract elements of a vector we can give their corresponding index, starting from one:

x[1]
##   a 
## 5.4
x[4]
##   d 
## 4.8

It may look different, but the square brackets operator is a function. For vectors (and matrices), it means “get me the nth element”.

We can ask for multiple elements at once:

x[c(1, 3)]
##   a   c 
## 5.4 7.1

Or slices of the vector:

x[1:4]
##   a   b   c   d 
## 5.4 6.2 7.1 4.8

the : operator creates a sequence of numbers from the left element to the right.

1:4
## [1] 1 2 3 4
c(1, 2, 3, 4)
## [1] 1 2 3 4

We can ask for the same element multiple times:

x[c(1,1,3)]
##   a   a   c 
## 5.4 5.4 7.1

If we ask for an index beyond the length of the vector, R will return a missing value:

x[6]
## <NA> 
##   NA

This is a vector of length one containing an NA, whose name is also NA.

If we ask for the 0th element, we get an empty vector:

x[0]
## named numeric(0)

Vector numbering in R starts at 1

In many programming languages (C and Python, for example), the first element of a vector has an index of 0. In R, the first element is 1. {: .callout}

Skipping and removing elements

If we use a negative number as the index of a vector, R will return every element except for the one specified:

x[-2]
##   a   c   d   e 
## 5.4 7.1 4.8 7.5

We can skip multiple elements:

x[c(-1, -5)]  # or x[-c(1,5)]
##   b   c   d 
## 6.2 7.1 4.8

Tip: Order of operations

A common trip up for novices occurs when trying to skip slices of a vector. It’s natural to try to negate a sequence like so:

x[-1:3]

This gives a somewhat cryptic error:

## Error in x[-1:3]: only 0's may be mixed with negative subscripts

But remember the order of operations. : is really a function. It takes its first argument as -1, and its second as 3, so generates the sequence of numbers: c(-1, 0, 1, 2, 3).

The correct solution is to wrap that function call in brackets, so that the - operator applies to the result:

x[-(1:3)]
##   d   e 
## 4.8 7.5

{: .callout}

To remove elements from a vector, we need to assign the result back into the variable:

x <- x[-4]
x
##   a   b   c   e 
## 5.4 6.2 7.1 7.5

Challenge 1

Given the following code:

x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
names(x) <- c('a', 'b', 'c', 'd', 'e')
print(x)
##   a   b   c   d   e 
## 5.4 6.2 7.1 4.8 7.5

Come up with at least 2 different commands that will produce the following output:

##   b   c   d 
## 6.2 7.1 4.8

After you find 2 different commands, compare notes with your neighbour. Did you have different strategies?

Solution to challenge 1

x[2:4]
##   b   c   d 
## 6.2 7.1 4.8
x[-c(1,5)]
##   b   c   d 
## 6.2 7.1 4.8
x[c(2,3,4)]
##   b   c   d 
## 6.2 7.1 4.8

{: .solution} {: .challenge}

Subsetting by name

We can extract elements by using their name, instead of extracting by index:

x <- c(a=5.4, b=6.2, c=7.1, d=4.8, e=7.5) # we can name a vector 'on the fly'
x[c("a", "c")]
##   a   c 
## 5.4 7.1

This is usually a much more reliable way to subset objects: the position of various elements can often change when chaining together subsetting operations, but the names will always remain the same!

Subsetting through other logical operations

We can also use any logical vector to subset:

x[c(FALSE, FALSE, TRUE, FALSE, TRUE)]
##   c   e 
## 7.1 7.5

Since comparison operators (e.g. >, <, ==) evaluate to logical vectors, we can also use them to succinctly subset vectors: the following statement gives the same result as the previous one.

x[x > 7]
##   c   e 
## 7.1 7.5

Breaking it down, this statement first evaluates x>7, generating a logical vector c(FALSE, FALSE, TRUE, FALSE, TRUE), and then selects the elements of x corresponding to the TRUE values.

We can use == to mimic the previous method of indexing by name (remember you have to use == rather than = for comparisons):

x[names(x) == "a"]
##   a 
## 5.4

Data Frames

Data frames are the de facto data structure for most tabular data, and what we use for statistics and plotting.

A data frame can be created by hand, but most commonly they are generated by the functions read.csv() or read.table(); in other words, when importing spreadsheets from your hard drive (or the web).

A data frame is the representation of data in the format of a table where the columns are vectors that all have the same length. Because columns are vectors, each column must contain a single type of data (e.g., characters, integers, factors). For example, here is a figure depicting a data frame comprising a numeric, a character, and a logical vector.

We can see this when inspecting the structure of a data frame with the function str():

You are now ready to load the data:

taxonomic_profiles <- data.frame(taxonomy = c("Otu00001", "Otu00002", "Otu00003", "Otu00004", "Otu00005", "Otu00006", "Otu00007"), 
                    Sample1 = c(21, 50, 32, 20, 10, 66, 100), 
                    Sample2 = c(31, 40, 10, 20, 46, 89, 23),
                    Sample3 = c(22, 100, 10, 55, 65, 93 ,56))
write.csv(x = taxonomic_profiles, file = "data/taxonomic_profiles.csv", row.names = FALSE)
str(taxonomic_profiles)
## 'data.frame':    7 obs. of  4 variables:
##  $ taxonomy: Factor w/ 7 levels "Otu00001","Otu00002",..: 1 2 3 4 5 6 7
##  $ Sample1 : num  21 50 32 20 10 66 100
##  $ Sample2 : num  31 40 10 20 46 89 23
##  $ Sample3 : num  22 100 10 55 65 93 56

Inspecting data.frame Objects

We already saw how the functions head() and str() can be useful to check the content and the structure of a data frame. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data. Let’s try them out!

  • Size:
    • dim(taxonomic_profiles) - returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)
    • nrow(taxonomic_profiles) - returns the number of rows
    • ncol(taxonomic_profiles) - returns the number of columns
  • Content:
    • head(taxonomic_profiles) - shows the first 6 rows
    • tail(taxonomic_profiles) - shows the last 6 rows
  • Names:
    • names(taxonomic_profiles) - returns the column names (synonym of colnames() for data.frame objects)
    • rownames(taxonomic_profiles) - returns the row names
  • Summary:
    • str(taxonomic_profiles) - structure of the object and information about the class, length and content of each column
    • summary(taxonomic_profiles) - summary statistics for each column

Note: most of these functions are “generic”, they can be used on other types of objects besides data.frame.

Challenge

Based on the output of str(taxonomic_profiles), can you answer the following questions?

  • What is the class of the object taxonomic_profiles?
  • How many rows and how many columns are in this object?
  • How many species have been recorded during these taxonomic_profiles?
str(taxonomic_profiles)
## 'data.frame':  7 obs. of  4 variables:
##  $ taxonomy: Factor w/ 7 levels "Otu00001","Otu00002",..: 1 2 3 4 5 6 7
##  $ Sample1 : num  21 50 32 20 10 66 100
##  $ Sample2 : num  31 40 10 20 46 89 23
##  $ Sample3 : num  22 100 10 55 65 93 56
## * class: data frame
## * how many rows: 34786,  how many columns: 13
## * how many species: 48

Indexing and subsetting data frames

Our survey data frame has rows and columns (it has 2 dimensions), if we want to extract some specific data from it, we need to specify the “coordinates” we want from it. Row numbers come first, followed by column numbers. However, note that different ways of specifying these coordinates lead to results with different classes.

# first element in the first column of the data frame (as a vector)
taxonomic_profiles[1, 1]   
## [1] Otu00001
## 7 Levels: Otu00001 Otu00002 Otu00003 Otu00004 Otu00005 ... Otu00007
# first element in the 6th column (as a vector)
taxonomic_profiles[1, 6]   
## NULL
# first column of the data frame (as a vector)
taxonomic_profiles[, 1]    
## [1] Otu00001 Otu00002 Otu00003 Otu00004 Otu00005 Otu00006 Otu00007
## 7 Levels: Otu00001 Otu00002 Otu00003 Otu00004 Otu00005 ... Otu00007
# first column of the data frame (as a data.frame)
taxonomic_profiles[1]      
##   taxonomy
## 1 Otu00001
## 2 Otu00002
## 3 Otu00003
## 4 Otu00004
## 5 Otu00005
## 6 Otu00006
## 7 Otu00007
# first three elements in the 5th column (as a vector)
taxonomic_profiles[1:3, 5] 
## NULL
# the 3rd row of the data frame (as a data.frame)
taxonomic_profiles[3, ]    
##   taxonomy Sample1 Sample2 Sample3
## 3 Otu00003      32      10      10
# equivalent to head_taxonomic_profiles <- head(taxonomic_profiles)
head_taxonomic_profiles <- taxonomic_profiles[1:4, ] 

: is a special function that creates numeric vectors of integers in increasing or decreasing order, test 1:10 and 10:1 for instance.

You can also exclude certain indices of a data frame using the “-” sign:

taxonomic_profiles[, -1]          # The whole data frame, except the first column
##   Sample1 Sample2 Sample3
## 1      21      31      22
## 2      50      40     100
## 3      32      10      10
## 4      20      20      55
## 5      10      46      65
## 6      66      89      93
## 7     100      23      56
taxonomic_profiles[-c(7:34786), ] # Equivalent to head(taxonomic_profiles)
##   taxonomy Sample1 Sample2 Sample3
## 1 Otu00001      21      31      22
## 2 Otu00002      50      40     100
## 3 Otu00003      32      10      10
## 4 Otu00004      20      20      55
## 5 Otu00005      10      46      65
## 6 Otu00006      66      89      93

Data frames can be subset by calling indices (as shown previously), but also by calling their column names directly:

taxonomic_profiles["taxonomy"]       # Result is a data.frame
taxonomic_profiles[, "taxonomy"]     # Result is a vector
taxonomic_profiles[["taxonomy"]]     # Result is a vector
taxonomic_profiles$taxonomy          # Result is a vector

In RStudio, you can use the autocompletion feature to get the full and correct names of the columns.

Challenge

  1. Create a data.frame (taxonomic_profiles_200) containing only the data in row 200 of the taxonomic_profiles dataset.

  2. Notice how nrow() gave you the number of rows in a data.frame?

    • Use that number to pull out just that last row in the data frame.
    • Compare that with what you see as the last row using tail() to make sure it’s meeting expectations.
    • Pull out that last row using nrow() instead of the row number.
    • Create a new data frame (taxonomic_profiles_last) from that last row.
  3. Use nrow() to extract the row that is in the middle of the data frame. Store the content of this row in an object named taxonomic_profiles_middle.

  4. Combine nrow() with the - notation above to reproduce the behavior of head(taxonomic_profiles), keeping just the first through 6th rows of the taxonomic_profiles dataset.

## 1.
taxonomic_profiles_200 <- taxonomic_profiles[200, ]
## 2.
# Saving `n_rows` to improve readability and reduce duplication
n_rows <- nrow(taxonomic_profiles)
taxonomic_profiles_last <- taxonomic_profiles[n_rows, ]
## 3.
taxonomic_profiles_middle <- taxonomic_profiles[n_rows / 2, ]
## 4.
taxonomic_profiles_head <- taxonomic_profiles[-(7:n_rows), ]

We can load this into R via the following:

raw_profiles <- read.csv(file = "data/raw_taxonomic_profiles.csv")
raw_profiles
##   taxonomy vd1 vd2
## 1 Otu00001 2.1   1
## 2 Otu00002 5.0   0
## 3 Otu00003 3.2   1

The read.table function is used for reading in tabular data stored in a text file where the columns of data are separated by punctuation characters such as CSV files (csv = comma-separated values). Tabs and commas are the most common punctuation characters used to separate or delimit data points in csv files. For convenience R provides 2 other versions of read.table. These are: read.csv for files where the data are separated with commas and read.delim for files where the data are separated with tabs. Of these three functions read.csv is the most commonly used. If needed it is possible to override the default delimiting punctuation marks for both read.csv and read.delim.

We can begin exploring our dataset right away, pulling out columns by specifying them using the $ operator:

raw_profiles$vd1
## [1] 2.1 5.0 3.2
raw_profiles$taxonomy
## [1] Otu00001 Otu00002 Otu00003
## Levels: Otu00001 Otu00002 Otu00003

We can do other operations on the columns:

## Say we discovered that the scale weighs two Kg light:
raw_profiles$vd1 + 2
## [1] 4.1 7.0 5.2
paste("My OTU is", raw_profiles$taxonomy)
## [1] "My OTU is Otu00001" "My OTU is Otu00002" "My OTU is Otu00003"

But what about

raw_profiles$vd1 + raw_profiles$taxonomy
## Warning in Ops.factor(raw_profiles$vd1, raw_profiles$taxonomy): '+' not
## meaningful for factors
## [1] NA NA NA

Understanding what happened here is key to successfully analyzing data in R.

Data Types

If you guessed that the last command will return an error because 2.1 plus "black" is nonsense, you’re right - and you already have some intuition for an important concept in programming called data types. We can ask what type of data something is:

typeof(raw_profiles$vd1)
## [1] "double"

There are 5 main types: double, integer, complex, logical and character.

typeof(3.14)
## [1] "double"
typeof(1L) # The L suffix forces the number to be an integer, since by default R uses float numbers
## [1] "integer"
typeof(1+1i)
## [1] "complex"
typeof(TRUE)
## [1] "logical"
typeof('banana')
## [1] "character"

No matter how complicated our analyses become, all data in R is interpreted as one of these basic data types. This strictness has some really important consequences.

file.show("data/raw_taxonomic_profiles.csv")
"taxonomy","vd1","vd2"
"Otu00001",2.1,1
"Otu00002",5,0
"Otu00003",3.2,1

Load the new raw_profiles data like before, and check what type of data we find in the taxonomy column:

raw_profiles <- read.csv(file="data/raw_taxonomic_profiles.csv")
typeof(raw_profiles$taxonomy)
## [1] "integer"

Oh no, our vd1 aren’t the double type anymore! If we try to do the same math we did on them before, we run into trouble:

raw_profiles$vd1 + 2
## [1] 4.1 7.0 5.2

What happened? When R reads a csv file into one of these tables, it insists that everything in a column be the same basic type; if it can’t understand everything in the column as a double, then nobody in the column gets to be a double. The table that R loaded our raw_profiles data into is something called a data.frame, and it is our first example of something called a data structure - that is, a structure which R knows how to build out of the basic data types.

We can see that it is a data.frame by calling the taxonomy function on it:

class(raw_profiles)
## [1] "data.frame"

Factors

When we did str(taxonomic_profiles) we saw that several of the columns consist of integers. The columns taxonomy, … however, are of a special class called factor. Factors are very useful and actually contribute to making R particularly well suited to working with data. So we are going to spend a little time introducing them.

Factors represent categorical data. They are stored as integers associated with labels and they can be ordered or unordered. While factors look (and often behave) like character vectors, they are actually treated as integer vectors by R. So you need to be very careful when treating them as strings.

Once created, factors can only contain a pre-defined set of values, known as levels. By default, R always sorts levels in alphabetical order. For instance, if you have a factor with 2 levels:

sex <- factor(c("male", "female", "female", "male"))

R will assign 1 to the level "female" and 2 to the level "male" (because f comes before m, even though the first element in this vector is "male"). You can see this by using the function levels() and you can find the number of levels using nlevels():

levels(sex)
## [1] "female" "male"
nlevels(sex)
## [1] 2

Sometimes, the order of the factors does not matter, other times you might want to specify the order because it is meaningful (e.g., “low”, “medium”, “high”), it improves your visualization, or it is required by a particular type of analysis. Here, one way to reorder our levels in the sex vector would be:

sex # current order
## [1] male   female female male  
## Levels: female male
sex <- factor(sex, levels = c("male", "female"))
sex # after re-ordering
## [1] male   female female male  
## Levels: male female

In R’s memory, these factors are represented by integers (1, 2, 3), but are more informative than integers because factors are self describing: "female", "male" is more descriptive than 1, 2. Which one is “male”? You wouldn’t be able to tell just from the integer data. Factors, on the other hand, have this information built in. It is particularly helpful when there are many levels (like the species names in our example dataset).

Factor subsetting

Now that we’ve explored the different ways to subset vectors, how do we subset the other data structures?

Factor subsetting works the same way as vector subsetting.

f <- factor(c("a", "a", "b", "c", "c", "d"))
f[f == "a"]
## [1] a a
## Levels: a b c d
f[f %in% c("b", "c")]
## [1] b c c
## Levels: a b c d
f[1:3]
## [1] a a b
## Levels: a b c d

Skipping elements will not remove the level even if no more of that category exists in the factor:

f[-3]
## [1] a a c c d
## Levels: a b c d

Lists

Another data structure you’ll want in your bag of tricks is the list. A list is simpler in some ways than the other types, because you can put anything you want in it:

list_example <- list(1, "a", TRUE, 1+4i)
list_example
## [[1]]
## [1] 1
## 
## [[2]]
## [1] "a"
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] 1+4i
another_list <- list(title = "Numbers", numbers = 1:10, data = TRUE )
another_list
## $title
## [1] "Numbers"
## 
## $numbers
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $data
## [1] TRUE

We can now understand something a bit surprising in our data.frame; what happens if we run:

typeof(raw_profiles)
## [1] "list"

We see that data.frames look like lists ‘under the hood’ - this is because a data.frame is really a list of vectors and factors, as they have to be - in order to hold those columns that are a mix of vectors and factors, the data.frame needs something a bit more flexible than a vector to put all the columns together into a familiar table. In other words, a data.frame is a special list in which all the vectors must have the same length.

In our example, we have an integer, a double and a logical variable. As we have seen already, each column of data.frame is a vector.

raw_profiles$taxonomy
## [1] Otu00001 Otu00002 Otu00003
## Levels: Otu00001 Otu00002 Otu00003
raw_profiles[,1]
## [1] Otu00001 Otu00002 Otu00003
## Levels: Otu00001 Otu00002 Otu00003
typeof(raw_profiles[,1])
## [1] "integer"
str(raw_profiles[,1])
##  Factor w/ 3 levels "Otu00001","Otu00002",..: 1 2 3

Each row is an observation of different variables, itself a data.frame, and thus can be composed of elements of different types.

raw_profiles[1,]
##   taxonomy vd1 vd2
## 1 Otu00001 2.1   1
typeof(raw_profiles[1,])
## [1] "list"
str(raw_profiles[1,])
## 'data.frame':    1 obs. of  3 variables:
##  $ taxonomy: Factor w/ 3 levels "Otu00001","Otu00002",..: 1
##  $ vd1     : num 2.1
##  $ vd2     : int 1

Challenge 3

There are several subtly different ways to call variables, observations and elements from data.frames:

  • raw_profiles[1]
  • raw_profiles[[1]]
  • raw_profiles$coat
  • raw_profiles["coat"]
  • raw_profiles[1, 1]
  • raw_profiles[, 1]
  • raw_profiles[1, ]

Try out these examples and explain what is returned by each one.

Hint: Use the function typeof() to examine what is returned in each case.

Solution to Challenge 3

raw_profiles[1]
##   taxonomy
## 1 Otu00001
## 2 Otu00002
## 3 Otu00003

We can think of a data frame as a list of vectors. The single brace [1] returns the first slice of the list, as another list. In this case it is the first column of the data frame.

raw_profiles[[1]]
## [1] Otu00001 Otu00002 Otu00003
## Levels: Otu00001 Otu00002 Otu00003

The double brace [[1]] returns the contents of the list item. In this case it is the contents of the first column, a vector of type factor.

raw_profiles$taxonomy
## [1] Otu00001 Otu00002 Otu00003
## Levels: Otu00001 Otu00002 Otu00003

``Here we are using a single brace[“taxonomy”]` replacing the index number with the column name. Like example 1, the returned object is a list.

raw_profiles[1, 1]
## [1] Otu00001
## Levels: Otu00001 Otu00002 Otu00003

This example uses a single brace, but this time we provide row and column coordinates. The returned object is the value in row 1, column 1. The object is an integer but because it is part of a vector of type factor, R displays the label “calico” associated with the integer value.

raw_profiles[, 1]
## [1] Otu00001 Otu00002 Otu00003
## Levels: Otu00001 Otu00002 Otu00003

Like the previous example we use single braces and provide row and column coordinates. The row coordinate is not specified, R interprets this missing value as all the elements in this column vector.

raw_profiles[1, ]
##   taxonomy vd1 vd2
## 1 Otu00001 2.1   1

Again we use the single brace with row and column coordinates. The column coordinate is not specified. The return value is a list containing all the values in the first row. {: .solution} {: .challenge}

List subsetting

Now we’ll introduce some new subsetting operators. There are three functions used to subset lists. We’ve already seen these when learning about atomic vectors and matrices: [, [[, and $.

Using [ will always return a list. If you want to subset a list, but not extract an element, then you will likely use [.

xlist <- list(a = "Software Carpentry", b = 1:10, data = head(iris))
xlist[1]
## $a
## [1] "Software Carpentry"

This returns a list with one element.

We can subset elements of a list exactly the same way as atomic vectors using [. Comparison operations however won’t work as they’re not recursive, they will try to condition on the data structures in each element of the list, not the individual elements within those data structures.

xlist[1:2]
## $a
## [1] "Software Carpentry"
## 
## $b
##  [1]  1  2  3  4  5  6  7  8  9 10

To extract individual elements of a list, you need to use the double-square bracket function: [[.

xlist[[1]]
## [1] "Software Carpentry"

Notice that now the result is a vector, not a list.

You can’t extract more than one element at once:

xlist[[1:2]]
## Error in xlist[[1:2]]: subscript out of bounds

Nor use it to skip elements:

xlist[[-1]]
## Error in xlist[[-1]]: attempt to select more than one element in get1index <real>

But you can use names to both subset and extract elements:

xlist[["a"]]
## [1] "Software Carpentry"

The $ function is a shorthand way for extracting elements by name:

xlist$data
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Challenge 5

Given the following list:

xlist <- list(a = "Software Carpentry", b = 1:10, data = head(iris))

Using your knowledge of both list and vector subsetting, extract the number 2 from xlist. Hint: the number 2 is contained within the “b” item in the list.

Solution to challenge 5

xlist$b[2]
## [1] 2
xlist[[2]][2]
## [1] 2
xlist[["b"]][2]
## [1] 2

{: .solution} {: .challenge}

Challenge 6

Given a linear model:

mod <- aov(pop ~ lifeExp, data=gapminder)

Extract the residual degrees of freedom (hint: attributes() will help you)

Solution to challenge 6

attributes(mod) ## `df.residual` is one of the names of `mod`
mod$df.residual

{: .solution} {: .challenge}

Matrices

Last but not least is the matrix. We can declare a matrix full of zeros:

matrix_example <- matrix(0, ncol=6, nrow=3)
matrix_example
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    0    0    0    0    0    0
## [2,]    0    0    0    0    0    0
## [3,]    0    0    0    0    0    0

And similar to other data structures, we can ask things about our matrix:

class(matrix_example)
## [1] "matrix"
typeof(matrix_example)
## [1] "double"
str(matrix_example)
##  num [1:3, 1:6] 0 0 0 0 0 0 0 0 0 0 ...
dim(matrix_example)
## [1] 3 6
nrow(matrix_example)
## [1] 3
ncol(matrix_example)
## [1] 6

Challenge 4

What do you think will be the result of length(matrix_example)? Try it. Were you right? Why / why not?

Solution to Challenge 4

What do you think will be the result of length(matrix_example)?

matrix_example <- matrix(0, ncol=6, nrow=3)
length(matrix_example)
## [1] 18

Because a matrix is a vector with added dimension attributes, length gives you the total number of elements in the matrix. {: .solution} {: .challenge}

Challenge 5

Make another matrix, this time containing the numbers 1:50, with 5 columns and 10 rows. Did the matrix function fill your matrix by column, or by row, as its default behaviour? See if you can figure out how to change this. (hint: read the documentation for matrix!)

Solution to Challenge 5

Make another matrix, this time containing the numbers 1:50, with 5 columns and 10 rows. Did the matrix function fill your matrix by column, or by row, as its default behaviour? See if you can figure out how to change this. (hint: read the documentation for matrix!)

x <- matrix(1:50, ncol=5, nrow=10)
x <- matrix(1:50, ncol=5, nrow=10, byrow = TRUE) # to fill by row

{: .solution} {: .challenge}

Challenge 6

Create a list of length two containing a character vector for each of the sections in this part of the workshop:

  • Data types
  • Data structures

Populate each character vector with the names of the data types and data structures we’ve seen so far.

Solution to Challenge 6

dataTypes <- c('double', 'complex', 'integer', 'character', 'logical')
dataStructures <- c('data.frame', 'vector', 'factor', 'list', 'matrix')
answer <- list(dataTypes, dataStructures)

Note: it’s nice to make a list in big writing on the board or taped to the wall listing all of these types and structures - leave it up for the rest of the workshop to remind people of the importance of these basics.

{: .solution} {: .challenge}

Challenge 7

Consider the R output of the matrix below:

##      [,1] [,2]
## [1,]    4    1
## [2,]    9    5
## [3,]   10    7

What was the correct command used to write this matrix? Examine each command and try to figure out the correct one before typing them. Think about what matrices the other commands will produce.

  1. matrix(c(4, 1, 9, 5, 10, 7), nrow = 3)
  2. matrix(c(4, 9, 10, 1, 5, 7), ncol = 2, byrow = TRUE)
  3. matrix(c(4, 9, 10, 1, 5, 7), nrow = 2)
  4. matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE)

Solution to Challenge 7

Consider the R output of the matrix below:

##      [,1] [,2]
## [1,]    4    1
## [2,]    9    5
## [3,]   10    7

What was the correct command used to write this matrix? Examine each command and try to figure out the correct one before typing them. Think about what matrices the other commands will produce.

matrix(c(4, 1, 9, 5, 10, 7), ncol = 2, byrow = TRUE)

{: .solution} {: .challenge}

Matrix subsetting

Matrices are also subsetted using the [ function. In this case it takes two arguments: the first applying to the rows, the second to its columns:

set.seed(1)
m <- matrix(rnorm(6*4), ncol=4, nrow=6)
m[3:4, c(3,1)]
##             [,1]       [,2]
## [1,]  1.12493092 -0.8356286
## [2,] -0.04493361  1.5952808

You can leave the first or second arguments blank to retrieve all the rows or columns respectively:

m[, c(3,4)]
##             [,1]        [,2]
## [1,] -0.62124058  0.82122120
## [2,] -2.21469989  0.59390132
## [3,]  1.12493092  0.91897737
## [4,] -0.04493361  0.78213630
## [5,] -0.01619026  0.07456498
## [6,]  0.94383621 -1.98935170

If we only access one row or column, R will automatically convert the result to a vector:

m[3,]
## [1] -0.8356286  0.5757814  1.1249309  0.9189774

If you want to keep the output as a matrix, you need to specify a third argument; drop = FALSE:

m[3, , drop=FALSE]
##            [,1]      [,2]     [,3]      [,4]
## [1,] -0.8356286 0.5757814 1.124931 0.9189774

Unlike vectors, if we try to access a row or column outside of the matrix, R will throw an error:

m[, c(3,6)]
## Error in m[, c(3, 6)]: subscript out of bounds

Tip: Higher dimensional arrays

when dealing with multi-dimensional arrays, each argument to [ corresponds to a dimension. For example, a 3D array, the first three arguments correspond to the rows, columns, and depth dimension.

{: .callout}

Because matrices are vectors, we can also subset using only one argument:

m[5]
## [1] 0.3295078

This usually isn’t useful, and often confusing to read. However it is useful to note that matrices are laid out in column-major format by default. That is the elements of the vector are arranged column-wise:

matrix(1:6, nrow=2, ncol=3)
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

If you wish to populate the matrix by row, use byrow=TRUE:

matrix(1:6, nrow=2, ncol=3, byrow=TRUE)
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6

Matrices can also be subsetted using their rownames and column names instead of their row and column indices.

Challenge 4

Given the following code:

m <- matrix(1:18, nrow=3, ncol=6)
print(m)
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    4    7   10   13   16
## [2,]    2    5    8   11   14   17
## [3,]    3    6    9   12   15   18
  1. Which of the following commands will extract the values 11 and 14?

A. m[2,4,2,5]

B. m[2:5]

C. m[4:5,2]

D. m[2,c(4,5)]

Solution to challenge 4

D {: .solution} {: .challenge}

Control Flow

Often when we’re coding we want to control the flow of our actions. This can be done by setting actions to occur only if a condition or a set of conditions are met. Alternatively, we can also set an action to occur a particular number of times.

There are several ways you can control flow in R. For conditional statements, the most commonly used approaches are the constructs:

# if
if (condition is true) {
  perform action
}

# if ... else
if (condition is true) {
  perform action
} else {  # that is, if the condition is false,
  perform alternative action
}

Say, for example, that we want R to print a message if a variable x has a particular value:

x <- 8

if (x >= 10) {
  print("x is greater than or equal to 10")
}

x
## [1] 8

The print statement does not appear in the console because x is not greater than 10. To print a different message for numbers less than 10, we can add an else statement.

x <- 8

if (x >= 10) {
  print("x is greater than or equal to 10")
} else {
  print("x is less than 10")
}
## [1] "x is less than 10"

You can also test multiple conditions by using else if.

x <- 8

if (x >= 10) {
  print("x is greater than or equal to 10")
} else if (x > 5) {
  print("x is greater than 5, but less than 10")
} else {
  print("x is less than 5")
}
## [1] "x is greater than 5, but less than 10"

Important: when R evaluates the condition inside if() statements, it is looking for a logical element, i.e., TRUE or FALSE. This can cause some headaches for beginners. For example:

x  <-  4 == 3
if (x) {
  "4 equals 3"
} else {
  "4 does not equal 3"          
}
## [1] "4 does not equal 3"

As we can see, the not equal message was printed because the vector x is FALSE

x <- 4 == 3
x
## [1] FALSE

Did anyone get a warning message like this?

## Warning in if (raw_profiles$taxonomy == 2012) {: the condition has length >
## 1 and only the first element will be used

If your condition evaluates to a vector with more than one logical element, the function if() will still run, but will only evaluate the condition in the first element. Here you need to make sure your condition is of length 1.

Tip: any() and all()

The any() function will return TRUE if at least one TRUE value is found within a vector, otherwise it will return FALSE. This can be used in a similar way to the %in% operator. The function all(), as the name suggests, will only return TRUE if all values in the vector are TRUE. {: .callout}

Repeating operations

If you want to iterate over a set of values, when the order of iteration is important, and perform the same operation on each, a for() loop will do the job. We saw for() loops in the shell lessons earlier. This is the most flexible of looping operations, but therefore also the hardest to use correctly. Avoid using for() loops unless the order of iteration is important: i.e. the calculation at each iteration depends on the results of previous iterations.

The basic structure of a for() loop is:

for(iterator in set of values){
  do a thing
}

For example:

for(i in 1:10){
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

The 1:10 bit creates a vector on the fly; you can iterate over any other vector as well.

We can use a for() loop nested within another for() loop to iterate over two things at once.

for(i in 1:5){
  for(j in c('a', 'b', 'c', 'd', 'e')){
    print(paste(i,j))
  }
}
## [1] "1 a"
## [1] "1 b"
## [1] "1 c"
## [1] "1 d"
## [1] "1 e"
## [1] "2 a"
## [1] "2 b"
## [1] "2 c"
## [1] "2 d"
## [1] "2 e"
## [1] "3 a"
## [1] "3 b"
## [1] "3 c"
## [1] "3 d"
## [1] "3 e"
## [1] "4 a"
## [1] "4 b"
## [1] "4 c"
## [1] "4 d"
## [1] "4 e"
## [1] "5 a"
## [1] "5 b"
## [1] "5 c"
## [1] "5 d"
## [1] "5 e"

Rather than printing the results, we could write the loop output to a new object.

output_vector <- c()
for(i in 1:5){
  for(j in c('a', 'b', 'c', 'd', 'e')){
    temp_output <- paste(i, j)
    output_vector <- c(output_vector, temp_output)
  }
}
output_vector
##  [1] "1 a" "1 b" "1 c" "1 d" "1 e" "2 a" "2 b" "2 c" "2 d" "2 e" "3 a"
## [12] "3 b" "3 c" "3 d" "3 e" "4 a" "4 b" "4 c" "4 d" "4 e" "5 a" "5 b"
## [23] "5 c" "5 d" "5 e"

Adding columns and rows in data frames

cbind

We already learned that the columns of a data frame are vectors, so that our data are consistent in type throughout the columns. As such, if we want to add a new column, we can start by making a new vector:

vd3 <- c(61, 40, 32)
vd3
## [1] 61 40 32
cbind(raw_profiles, vd3)
##   taxonomy vd1 vd2 vd3
## 1 Otu00001 2.1   1  61
## 2 Otu00002 5.0   0  40
## 3 Otu00003 3.2   1  32

rbind

Now how about adding rows? We already know that the rows of a data frame are lists:

newRow <- list("Otu00004", 3.3, 3, 90)
raw_profiles_new <- rbind(raw_profiles, newRow)
## Warning in `[<-.factor`(`*tmp*`, ri, value = "Otu00004"): invalid factor
## level, NA generated
raw_profiles_new
##   taxonomy vd1 vd2
## 1 Otu00001 2.1   1
## 2 Otu00002 5.0   0
## 3 Otu00003 3.2   1
## 4     <NA> 3.3   3

merge

# merge two data frames by taxonomy
taxonomic_mapping <- read.csv(file = "data/taxonomic_mapping.csv")
taxonomic_mapping
##   taxonomy  Kingdom         Phylum                 Class             Order
## 1 Otu00001 Bacteria  Bacteroidetes           Bacteroidia     Bacteroidales
## 2 Otu00002 Bacteria Proteobacteria Epsilonproteobacteria Campylobacterales
## 3 Otu00003 Bacteria  Bacteroidetes           Bacteroidia     Bacteroidales
## 4 Otu00004 Bacteria  Bacteroidetes           Bacteroidia     Bacteroidales
## 5 Otu00005 Bacteria Proteobacteria Epsilonproteobacteria Campylobacterales
## 6 Otu00006 Bacteria  Bacteroidetes           Bacteroidia     Bacteroidales
## 7 Otu00007 Bacteria  Bacteroidetes           Bacteroidia     Bacteroidales
##              Family          Genus Species
## 1    Prevotellaceae Prevotellaceae      NA
## 2 Helicobacteraceae   Helicobacter      NA
## 3    Prevotellaceae Alloprevotella      NA
## 4    Bacteroidaceae    Bacteroides      NA
## 5 Helicobacteraceae   Helicobacter      NA
## 6     Bacteroidales  Bacteroidales      NA
## 7    Prevotellaceae Prevotellaceae      NA
taxonomic_profiles
##   taxonomy Sample1 Sample2 Sample3
## 1 Otu00001      21      31      22
## 2 Otu00002      50      40     100
## 3 Otu00003      32      10      10
## 4 Otu00004      20      20      55
## 5 Otu00005      10      46      65
## 6 Otu00006      66      89      93
## 7 Otu00007     100      23      56
merged_table <- merge(taxonomic_profiles,taxonomic_mapping,by="taxonomy")

The dplyr package

Luckily, the dplyr package provides a number of very useful functions for manipulating dataframes in a way that will reduce the above repetition, reduce the probability of making errors, and probably even save you some typing. As an added bonus, you might even find the dplyr grammar easier to read.

Here we’re going to cover 6 of the most commonly used functions as well as using pipes (%>%) to combine them.

select() filter() group_by() summarize() mutate()

If you have have not installed this package earlier, please do so:

install.packages('dplyr')

Now let’s load the package

library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Using select()

If, for example, we wanted to move forward with only a few of the variables in our dataframe we could use the select() function. This will keep only the variables you select.

table_kingdom_phylum <- select(merged_table,taxonomy,Sample1,Kingdom, Phylum)

using pipes

table_select <- merged_table %>% 
                          select(taxonomy,Sample1,Kingdom, Phylum)

Using filter()

we can combine select and filter

table_select <- merged_table %>% 
                          filter(Kingdom=='Bacteria')
table_select
##   taxonomy Sample1 Sample2 Sample3  Kingdom         Phylum
## 1 Otu00001      21      31      22 Bacteria  Bacteroidetes
## 2 Otu00002      50      40     100 Bacteria Proteobacteria
## 3 Otu00003      32      10      10 Bacteria  Bacteroidetes
## 4 Otu00004      20      20      55 Bacteria  Bacteroidetes
## 5 Otu00005      10      46      65 Bacteria Proteobacteria
## 6 Otu00006      66      89      93 Bacteria  Bacteroidetes
## 7 Otu00007     100      23      56 Bacteria  Bacteroidetes
##                   Class             Order            Family          Genus
## 1           Bacteroidia     Bacteroidales    Prevotellaceae Prevotellaceae
## 2 Epsilonproteobacteria Campylobacterales Helicobacteraceae   Helicobacter
## 3           Bacteroidia     Bacteroidales    Prevotellaceae Alloprevotella
## 4           Bacteroidia     Bacteroidales    Bacteroidaceae    Bacteroides
## 5 Epsilonproteobacteria Campylobacterales Helicobacteraceae   Helicobacter
## 6           Bacteroidia     Bacteroidales     Bacteroidales  Bacteroidales
## 7           Bacteroidia     Bacteroidales    Prevotellaceae Prevotellaceae
##   Species
## 1      NA
## 2      NA
## 3      NA
## 4      NA
## 5      NA
## 6      NA
## 7      NA

Using Select and filter

table_select_filter <- merged_table %>% 
                          filter(Kingdom=='Bacteria') %>%
                          select(taxonomy, Sample1)

Using summarize()

The above was a bit on the uneventful side but group_by() is much more exciting in conjunction with summarize(). This will allow us to create new variable(s) by using functions that repeat for each of the continent-specific data frames. That is to say, using the group_by() function, we split our original dataframe into multiple pieces, then we can run functions (e.g. mean() or sd()) within summarize().

otu_groupby_phylum <- merged_table %>%
    group_by(Phylum) %>%
    summarize(mean_sample1=mean(Sample1))
otu_groupby_phylum
## # A tibble: 2 x 2
##   Phylum         mean_sample1
##   <fct>                 <dbl>
## 1 Bacteroidetes          47.8
## 2 Proteobacteria         30

ggplot

Install ggplot

install.packages("ggplot2")

Plot a simple figure

library(ggplot2)
ggplot(data = merged_table, aes(y = Sample3, x = taxonomy, by=Genus, color=Family)) + geom_point()

Copyright: Materials used in these lessons were derived from work that is Copyright © Data Carpentry (http://datacarpentry.org/). All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0). Thomas Wright and Naupaka Zimmerman (eds): "Software Carpentry: R forReproducible Scientific Analysis." Version 2016.06, June 2016, https://github.com/swcarpentry/r-novice-gapminder,10.5281/zenodo.57520