Condensed R: 240-Minute R Tutorial¶
Goal¶
This tutorial is designed for learners with little to no programming or R experience and prepares them to learn more about R on their own. We’ll introduce some of the most useful features of R for the analysis of biological data.
Note
This tutorial was prepared for presentation with an experienced R instructor. Self-guided learners should also find this tutorial useful.
What is today’s goal?¶
In the course of this tutorial we will show you …
- How tabular data (spreadsheets) can be imported into R
- How to install and use almost any R function (and read its related documentation)
- How data can be cleaned and prepared for analysis
- How you can find help and learn more about R
What this tutorial will not teach¶
The single most important thing this tutorial will not show are the hours of frustration, Googling, and copying-and-pasting that go along with using R (and just about any other programming language).
If (and especially if) this is your first time programing and/or using R. It’s important to know that programming is a learned skill - you will get better as you do it more so don’t expect that it will just “come naturally.” In fact, you will get a lot of use out of R without having to use it every day/week/month. With some basic skills, you will be able to do many important bioinformatics-related tasks, hopefully a lot faster and more reproducibly than you are doing now.
Is this tutorial right for you?¶
There are some useful tips and tricks for everyone in this tutorial but it was designed for people with little to no knowledge of programming and or R. After years teaching biologists, here are some typical statements from learners at this level:
- “I used R once /took a class once several years ago”
- “I’m afraid of R”
- “I use R, but there are a lot of things I do and don’t understand why I am doing them”
If this sounds like you, this tutorial was written just for you.
Fix or improve this documentation
- Search for an answer: CyVerse Learning Center
- Ask us for help:
click
on the lower right-hand side of the page
- Report an issue or submit a change: Github Repo Link
- Send feedback: Tutorials@CyVerse.org
R and RStudio basics¶
learning-objectives
- “Create an RStudio project, and know the benefits of working within a project”
- “Be able to customize the RStudio layout”
- “Be able to locate and change the current working directory with getwd() and setwd()
- “Compose an R script file containing comments and commands”
Create an RStudio project¶
One of the first benefits we will take advantage of in RStudio is something called an RStudio Project. An RStudio project allows you to more easily:
- Save data, files, variables, packages, etc. related to a specific analysis project
- Restart work where you left off
- Collaborate, especially if you are using version control such as git
To create a project, go to the File menu, and click New Project….
In the window that opens select New Directory, then New Project. For “Directory name:” enter r_learning. For “Create project as subdirectory of”, you may leave the default, which is your home directory “~”.
Finally click Create Project. In the “Files” tab of your output pane (more about the RStudio layout in a moment), you should see an RStudio project file, r_tutorial.Rroj. All RStudio projects end with the “.Rproj” file extension.
Creating your first R script¶
Now that we are ready to start exploring R, we will want to keep a record of the commands we are using. To do this we can create an R script:
Click the File menu and select New File and then R Script. Before we go any further, save your script by clicking the save/disk icon that is in the bar above the first line in the script editor, or click the File menu and select save. In the “Save File” window that opens, name your file “r_basics”. The new script r_basics.R should appear under “files” in the output pane. By convention, R scripts end with the file extension .R.
Tip
A script is simply a file (usually a text file) that contains a set of instructions (code) to be run by the computer. We will be writing several lines of code and running these lines of code line-by-line as we learn. Once you have finished with a script, all the lines of code can be run (from top to bottom) automatically.
Overview and customization of the RStudio layout¶
Here are the major windows (or panes) of the RStudio environment:
- Source: This pane is where you will write/view R scripts. Some outputs (such as if you view a dataset using View()) will appear as a tab here.
- Console/Terminal: This is actually where you see the execution of commands. This is the same display you would see if you were using R at the command line without RStudio. You can work interactively (i.e. enter R commands here), but for the most part we will run a script (or lines in a script) in the source pane and watch their execution and output here. The “Terminal” tab give you access to the BASH terminal (the Linux operating system, unrelated to R).
- Environment/History: Here, RStudio will show you what datasets and objects (variables) you have created and which are defined in memory. You can also see some properties of objects/datasets such as their type and dimensions. The “History” tab contains a history of the R commands you’ve executed R.
- Files/Plots/Packages/Help: This multipurpose pane will show you the contents of directories on your computer. You can also use the “Files” tab to navigate and set the working directory. The “Plots” tab will show the output of any plots generated. In “Packages” you will see what packages are actively loaded, or you can attach installed packages. “Help” will display help files for R functions and packages.
Tip
Uploads and downloads in the cloud
In the “Files” tab you can select a file and download it from your cloud instance (click the “more” button) to your local computer. Uploads are also possible.
All of the panes in RStudio have configuration options. For example, you can minimize/maximize a pane, or by moving your mouse in the space between panes you can resize as needed. The most important customization options for pane layout are in the View menu. Other options such as font sizes, colors/themes, and more are in the Tools menu under Global Options.
Note
You are working with R
Although we won’t be working with R at the terminal, there are lots of reasons to. For example, once you have written an RScript, you can run it at any Linux, Mac, or Windows terminal without the need to start up RStudio. We don’t want you to get confused - RStudio runs R, but R is not RStudio.
Understanding basic R functions¶
learning-objectives
- “Understand what an R function is”
- “Understand how to modify a function by altering its parameters”
- “Locate help for an R function using ?, ??, and args()”
Using functions in R, without needing to master them¶
A function in R (or any computing language) is a short program that takes some input and returns some output. Functions may seem like an advanced topic (and they are), but you have already used at least one function in R. getwd() is a function! The next sections will help you understand what is happening in any R script.
Question
Exercise: What do these functions do?
Try the following functions by writing them in your script. See if you can guess what they do, and make sure to add comments to your script about your assumed purpose.
- dir()
- sessionInfo()
- date()
- Sys.time()
Answer
- dir(): Lists files in the working directory
- sessionInfo(): Gives the version of R and additional info including on attached packages
- date(): Gives the current date
- Sys.time(): Gives the current time
Notice: Commands are case sensitive!
You have hopefully noticed a pattern - an R function has three key properties:
- Functions have a name (e.g. dir, getwd); note that functions are case sensitive!
- Following the name, functions have a pair of ()
- Inside the parentheses, a function may take 0 or more arguments
An argument may be a specific input for your function and/or may modify the function’s behavior. For example the function round() will round a number with a decimal:
# This will round a number to the nearest integer
round(3.14)
Getting help with function arguments¶
What if you wanted to round to one significant digit? round() can do this, but you may first need to read the help to find out how. To see the help (in R generally called a “vignette”) enter a ? in front of the function name:
?round()
The “Help” tab will show you information (often, too much information). You will slowly learn how to read and make sense of help files. Checking the “Usage” or “Examples” headings is often a good place to look first. If you look under “Arguments,” we also see what arguments we can pass to this function to modify its behavior. You can also see a function’s argument using the args() function:
args(round)
round() takes two arguments, x, which is the number to be rounded, and a digits argument. The = sign indicates that a default (in this case 0) is already set. Since x is not set, round() requires we provide it, in contrast to digits where R will use the default value 0 unless you explicitly provide a different value. We can explicitly set the digits parameter when we call the function:
round(3.14159, digits = 2)
Or, R accepts what we call “positional arguments”, if you pass a function arguments separated by commas, R assumes that they are in the order you saw when we used args(). In the case below that means that x is 3.14159 and digits is 2.
round(3.14159, 2)
Finally, what if you are using ? to get help for a function in a package not installed on your system, such as when you are running a script which has dependencies.
?geom_point()
will return an error:
Error in .helpForCall(topicExpr, parent.frame()) :
no methods for ‘geom_point’ and no documentation for it as a function
Use two question marks (i.e. ??geom_point()) and R will return results from a search of the documentation for packages you have installed on your computer in the “Help” tab. Finally, if you think there should be a function, for example a statistical test, but you aren’t sure what it is called in R, or what functions may be available, use the help.search() function.
Question
Searching for R functions
Use help.search() to find R functions for the following statistical functions. Remember to put your search query in quotes inside the function’s parentheses.
- Chi-Squared test
- Student-t test
- mixed linear model
Answer
While your search results may return several tests, we list a few you might find:
- Chi-Squared test: stats::Chisquare
- Student-t test: stats::TDist
- mixed linear model: stats::lm.glm
We will discuss more on where to look for the libraries and packages that contain functions you want to use. For now, be aware that two important ones are CRAN - the main repository for R, and Bioconductor - a popular repository for bioinformatics-related R packages.
RStudio contextual help¶
Here is one last bonus we will mention about RStudio. It’s difficult to remember all of the arguments and definitions associated with a given function. When you start typing the name of a function and hit the Tab key, RStudio will display functions and associated help:
Once you type a function, hitting the Tab inside the parentheses will show you the function’s arguments and provide additional help for each of these arguments.
Fix or improve this documentation
- Search for an answer: CyVerse Learning Center
- Ask us for help:
click
on the lower right-hand side of the page
- Report an issue or submit a change: Github Repo Link
- Send feedback: Tutorials@CyVerse.org
Fundamental objects in R¶
learning-objectives
- “Be able to create the most common R objects including vectors”
- “Understand that vectors have modes, which correspond to the type of data they contain”
- “Be able to use arithmetic operators on R objects”
- “Be able to retrieve (subset), name, or replace, values from a vector”
- “Be able to use logical operators in a subsetting operation”
Creating objects in R¶
What might be called a variable in many languages is called an object in R.
To create an object you need:
- a name (e.g. ‘a’)
- a value (e.g. ‘1’)
- the assignment operator (‘<-‘)
In your script, “r_basics.R”, using the R assignment operator ‘<-‘, assign ‘1’ to the object ‘a’ as shown. Remember to leave a comment in the line above (using the ‘#’) to explain what you are doing:
# this line creates the object 'a' and assigns it the value '1'
a <- 1
Next, run this line of code in your script. You can run a line of code by hitting the <KBDRun</KBD button that is just above the first line of your script in the header of the Source pane or you can use the appropriate shortcut:
- Windows execution shortcut: Ctrl + Enter
- Mac execution shortcut: Cmd(⌘) + Enter
To run multiple lines of code, you can highlight all the line you wish to run and then hit Run or use the shortcut key combo listed above.
In the RStudio ‘Console’ you should see:
a <- 1
The ‘Console’ will display lines of code run from a script and any outputs or status/warning/error messages (usually in red).
In the ‘Environment’ window you will also get a table:
Values | |
---|---|
a | 1 |
The ‘Environment’ window allows you to keep track of the objects you have created in R.
Question
Create some objects in R
Create the following objects; give each object an appropriate name (your best guess at what name to use is fine):
- Create an object that has the value of number of pairs of human chromosomes
- Create an object that has a value of your favorite gene name
- Create an object that has this URL as its value: “ftp://ftp.ensemblgenomes.org/pub/bacteria/release-39/fasta/bacteria_5_collection/escherichia_coli_b_str_rel606/”
- Create an object that has the value of the number of chromosomes in a diploid human cell
Answer
Here as some possible answers to the challenge:
human_chr_number <- 23
gene_name <- 'pten'
ensemble_url <- 'ftp://ftp.ensemblgenomes.org/pub/bacteria/release-39/fasta/bacteria_5_collection/escherichia_coli_b_str_rel606/'
human_diploid_chr_num <- 2 * human_chr_number
Naming objects in R¶
Here are some important details about naming objects in R.
- Avoid spaces and special characters: Object names cannot contain spaces or the minus sign (-). You can use ‘_’ to make names more readable. You should avoid using special characters in your object name (e.g. ! @ # . , etc.). Also, object names cannot begin with a number.
- Use short, easy-to-understand names: You should avoid naming your objects using single letters (e.g. ‘n’, ‘p’, etc.). This is mostly to encourage you to use names that would make sense to anyone reading your code (a colleague, or even yourself a year from now). Also, avoiding excessively long names will make your code more readable.
- Avoid commonly used names: There are several names that may already have a definition in the R language (e.g. ‘mean’, ‘min’, ‘max’). One clue that a name already has meaning is that if you start typing a name in RStudio and it gets a colored highlight or RStudio gives you a suggested autocompletion you have chosen a name that has a reserved meaning.
- Use the recommended assignment operator: In R, we use ‘<- ‘ as the preferred assignment operator. ‘=’ works too, but is most commonly used in passing arguments to functions (more on functions later). There is a shortcut for the R assignment operator: - Windows execution shortcut: Alt + - - Mac execution shortcut: Option + -
There are a few more suggestions about naming and style you may want to learn more about as you write more R code. There are several “style guides” that have advice, and one to start with is the tidyverse R style guide.
Tip
Pay attention to warnings in the script console
If you enter a line of code in your script that contains an error, RStudio may give you an error message and underline this mistake. Sometimes these messages are easy to understand, but often the messages may need some figuring out. Paying attention to these warnings will help you avoid mistakes. In the example below, our object name has a space, which is not allowed in R. The error message does not say this directly, but R is “not sure” about how to assign the name to “human_ chr_number” when the object name we want is “human_chr_number”.
Reassigning object names or deleting objects¶
Once an object has a value, you can change that value by overwriting it. R will not give you a warning or error if you overwriting an object, which may or may not be a good thing depending on how you look at it.
# gene_name has the value 'pten' or whatever value you used in the challenge.
# We will now assign the new value 'tp53'
gene_name <- 'tp53'
You can also remove an object from R’s memory entirely. The rm() function will delete the object.
# delete the object 'gene_name'
rm(gene_name)
If you run a line of code that has only an object name, R will normally display the contents of that object. In this case, we are told the object no longer exists.
Error: object 'gene_name' not found
Understanding object data types (modes)¶
In R, every object has two properties:
- Length: How many distinct values are held in that object
- Mode: What is the classification (type) of that object.
We will get to the “length” property later in the lesson. The “mode” property corresponds to the type of data an object represents. The most common modes you will encounter in R are:
Mode (abbreviation) | Type of data |
---|---|
Numeric (num) | Numbers such floating point/decimals (1.0, 0.5, 3.14), there are also more specific numeric types (dbl - Double, int - Integer). These differences are not relevant for most beginners and pertain to how these values are stored in memory |
Character (chr) | A sequence of letters/numbers in single ‘’ or double ” ” quotes |
Logical | Boolean values - TRUE or FALSE |
There are a few other modes (i.e. “complex”, “raw” etc.) but these are the three we will work with in this lesson.
Data types are familiar in many programming languages, but also in natural language where we refer to them as the parts of speech, e.g. nouns, verbs, adverbs, etc. Once you know if a word - perhaps an unfamiliar one - is a noun, you can probably guess you can count it and make it plural if there is more than one (e.g. 1 Tuatara, or 2 Tuataras). If something is a adjective, you can usually change it into an adverb by adding “-ly” (e.g. jejune vs. jejunely). Depending on the context, you may need to decide if a word is in one category or another (e.g “cut” may be a noun when it’s on your finger, or a verb when you are preparing vegetables). These concepts have important analogies when working with R objects.
Question
Exercise: Create objects and check their modes
Create the following objects in R, then use the mode() function to verify their modes. Try to guess what the mode will be before you look at the solution
- chromosome_name <- ‘chr02’
- od_600_value <- 0.47
- chr_position <- ‘1001701’
- spock <- TRUE
- pilot <- Earhart
Answer
chromosome_name <- 'chr02' od_600_value <- 0.47 chr_position <- '1001701' spock <- TRUE pilot <- Earhartmode(chromosome_name) mode(od_600_value) mode(chr_position) mode(spock) mode(pilot)
Notice from the solution that even if a series of numbers are given as a value R will consider them to be in the “character” mode if they are enclosed as single or double quotes. Also, notice that you cannot take a string of alphanumeric characters (e.g. Earhart) and assign as a value for an object. In this case, R looks for an object named Earhart but since there is no object, no assignment can be made. If Earhart did exist, then the mode of pilot would be whatever the mode of Earhart was originally. If we want to create an object called pilot that was the name “Earhart”, we need to enclose Earhart in quotation marks.
pilot <- "Earhart"
mode(pilot)
Mathematical and functional operations on objects¶
Once an object exists (which by definition also means it has a mode), R can appropriately manipulate that object. For example, objects of the numeric modes can be added, multiplied, divided, etc. R provides several mathematical (arithmetic) operators including:
Operator | Description |
---|---|
+ | addition |
- | subtraction |
* | multiplication |
/ | division |
^ or ** | exponentiation |
a%%b | modulus (returns the remainder after division) |
These can be used with literal numbers:
(1 + (5 ** 0.5))/2
and importantly, can be used on any object that evaluates to (i.e. interpreted by R) a numeric object:
human_chr_number <- 23
# multiply the object 'human_chr_number' by 2
human_chr_number * 2
Question
Compute the golden ratio
One approximation of the golden ratio (φ) can be found by taking the sum of 1 and the square root of 5, and dividing by 2 as in the example above. Compute the golden ratio to 3 digits of precision using the sqrt() and round() functions. Hint: remember the round() function can take 2 arguments.
Answer
round((1 + sqrt(5))/2, digits = 3)
Notice that you can place one function inside of another.
Vectors¶
Vectors are probably the most used commonly used object type in R. A vector is a collection of values that are all of the same type (numbers, characters, etc.). One of the most common ways to create a vector is to use the c() function - the “concatenate” or “combine” function. Inside the function you may enter one or more values; for multiple values, separate each value with a comma:
# Create the SNP gene name vector
snp_genes <- c("OXTR", "ACTN3", "AR", "OPRM1")
Vectors always have a mode and a length. You can check these with the mode() and length() functions respectively. Another useful function that gives both of these pieces of information is the str() (structure) function.
# Check the mode, length, and structure of 'snp_genes'
mode(snp_genes)
length(snp_genes)
Vectors are quite important in R. Another data type that we will work with later in this lesson, data frames, are collections of vectors. What we learn here about vectors will pay off even more when we start working with data frames.
Creating and subsetting vectors¶
Let’s create a few more vectors to play around with:
# Some interesting human SNPs
# while accuracy is important, typos in the data won't hurt you here
snps <- c('rs53576', 'rs1815739', 'rs6152', 'rs1799971')
snp_chromosomes <- c('3', '11', 'X', '6')
snp_positions <- c(8762685, 66560624, 67545785, 154039662)
Once we have vectors, one thing we may want to do is specifically retrieve one or more values from our vector. To do so, we use bracket notation. We type the name of the vector followed by square brackets. In those square brackets we place the index (e.g. a number) in that bracket as follows:
# get the 3rd value in the snp_genes vector
snp_genes[3]
In R, every item your vector is indexed, starting from the first item (1) through to the final number of items in your vector. You can also retrieve a range of numbers:
# get the 1st through 3rd value in the snp_genes vector
snp_genes[1:3]
If you want to retrieve several (but not necessarily sequential) items from a vector, you pass a vector of indices; a vector that has the numbered positions you wish to retrieve.
# get the 1st, 3rd, and 4th value in the snp_genes vector
snp_genes[c(1, 3, 4)]
There are additional (and perhaps less commonly used) ways of subsetting a vector (see [these examples](https://thomasleeper.com/Rcourse/Tutorials/vectorindexing.html)). Also, several of these subsetting expressions can be combined:
# get the 1st through the 3rd value, and 4th value in the snp_genes vector
# yes, this is a little silly in a vector of only 4 values.
snp_genes[c(1:3,4)]
Adding to, removing, or replacing values in existing vectors¶
Once you have an existing vector, you may want to add a new item to it. To do so, you can use the c() function again to add your new value:
# add the gene 'CYP1A1' and 'APOA5' to our list of snp genes
# this overwrites our existing vector
snp_genes <- c(snp_genes, "CYP1A1", "APOA5")
We can verify that “snp_genes” contains the new gene entry
snp_genes
Using a negative index will return a version a vector with that index’s value removed:
snp_genes[-6]
We can remove that value from our vector by overwriting it with this expression:
snp_genes <- snp_genes[-6]
snp_genes
We can also explicitly rename or add a value to our index using bracket notation:
snp_genes[7]<- "APOA5"
snp_genes
Notice in the operation above that R inserts an NA value to extend our vector so that the gene “APOA5” is an index 7. This may be a good or not-so-good thing depending on how you use this.
Question
- Examining and subsetting vectors
Answer the following questions to test your knowledge of vectors
Which of the following are true of vectors in R?
- All vectors have a mode or a length
- All vectors have a mode and a length
- Vectors may have different lengths
- Items within a vector may be of different modes
- You can use the c() to one or more items to an existing vector
- You can use the c() to add a vector to an exiting vector
Answer
- False - Vectors have both of these properties
- True
- True
- False - Vectors have only one mode (e.g. numeric, character); all items in a vector must be of this mode.
- True
- True
Logical Subsetting¶
There is one last set of cool subsetting capabilities we want to introduce. It is possible within R to retrieve items in a vector based on a logical evaluation or numerical comparison. For example, let’s say we wanted get all of the SNPs in our vector of SNP positions that were greater than 100,000,000. We could index using the ‘’ (greater than) logical operator:
snp_positions[snp_positions > 100000000]
In the square brackets you place the name of the vector followed by the comparison operator and (in this case) a numeric value. Some of the most common logical operators you will use in R are:
Operator Description < less than <= less than or equal to > greater than >= greater than or equal to == exactly equal to != not equal to !x not x a | b a or b a & b a and b
The magic of programming¶
The reason why the expression snp_positions[snp_positions 100000000] works can be better understood if you examine what the expression “snp_positions > 100000000” evaluates to:
snp_positions > 100000000
The output above is a logical vector, the 4th element of which is TRUE. When you pass a logical vector as an index, R will return the true values:
snp_positions[c(FALSE, FALSE, FALSE, TRUE)]
If you have never coded before, this type of situation starts to expose the “magic” of programming. We mentioned before that in the bracket notation you take your named vector followed by brackets which contain an index: named_vector[index]. The “magic” is that the index needs to evaluate to a number. So, even if it does not appear to be an integer (e.g. 1, 2, 3), as long as R can evaluate it, we will get a result. That our expression snp_positions[snp_positions 100000000] evaluates to a number can be seen in the following situation. If you wanted to know which index (1, 2, 3, or 4) in our vector of SNP positions was the one that was greater than 100,000,000?
We can use the which() function to return the indices of any item that evaluates as TRUE in our comparison:
which(snp_positions >100000000)
Why this is important
Often in programming we will not know what inputs and values will be used when our code is executed. Rather than put in a pre-determined value (e.g 100000000) we can use an object that can take on whatever value we need. So for example:
snp_marker_cutoff <- 100000000
snp_positions[snp_positions snp_marker_cutoff]
Ultimately, it’s putting together flexible, reusable code like this that gets at the “magic” of programming!
A few final vector tricks¶
Finally, there are a few other common retrieve or replace operations you may want to know about. First, you can check to see if any of the values of your vector are missing (i.e. are NA). Missing data will get a more detailed treatment later, but the is.NA() function will return a logical vector, with TRUE for any NA value:
# current value of 'snp_genes':
# chr [1:7] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" NA "APOA5"
is.na(snp_genes)
Sometimes, you may wish to find out if a specific value (or several values) is present a vector. You can do this using the comparison operator %in%, which will return TRUE for any value in your collection that is in the vector you are searching:
# current value of 'snp_genes':
# chr [1:7] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" NA "APOA5"
# test to see if "ACTN3" or "APO5A" is in the snp_genes vector
# if you are looking for more than one value, you must pass this as a vector
c("ACTN3","APOA5") %in% snp_genes
Question
Review Exercise 1
- What data types/modes are the following vectors?
- snps
- snp_chromosomes
- snp_positions
Answer
typeof(snps)
typeof(snp_chromosomes)
typeof(snp_positions)
Question
Review Exercise 2
- Add the following values to the specified vectors:
- To the snps vector add: ‘rs662799’
- To the snp_chromosomes vector add: 11
- To the snp_positions vector add: 116792991
Answer
snps <- c(snps, 'rs662799') snps snp_chromosomes <- c(snp_chromosomes, "11") # did you use quotes? snp_chromosomes snp_positions <- c(snp_positions, 116792991) snp_positions
Question
Review Exercise 3
Make the following change to the snp_genes vector:
Hint: Your vector should look like this in ‘Environment’:
chr [1:7] “OXTR” “ACTN3” “AR” “OPRM1” “CYP1A1” NA “APOA5”
If not, recreate the vector by running this expression:
snp_genes <- c(“OXTR”, “ACTN3”, “AR”, “OPRM1”, “CYP1A1”, NA, “APOA5”)
- Create a new version of snp_genes that does not contain CYP1A1 and then
- Add 2 NA values to the end of snp_genes
Answer
snp_genes <- snp_genes[-5]
snp_genes <- c(snp_genes, NA, NA)
snp_genes
Question
Review Exercise 4
Using indexing, create a new vector named combined that contains:
- The the 1st value in snp_genes
- The 1st value in snps
- The 1st value in snp_chromosomes
- The 1st value in snp_positions
Answer
combined <- c(snp_genes[1], snps[1], snp_chromosomes[1], snp_positions[1])
combined
Question
Review Exercise 5
What type of data is combined?
Answer
typeof(combined)
Fix or improve this documentation
- Search for an answer: CyVerse Learning Center
- Ask us for help:
click
on the lower right-hand side of the page
- Report an issue or submit a change: Github Repo Link
- Send feedback: Tutorials@CyVerse.org
Importing data and working with data frames¶
learning-objectives
- “Explain the basic principle of tidy datasets”
- “Be able to load a tabular dataset using base R functions”
- “Be able to determine the structure of a data frame including its dimensions and the datatypes of variables”
- “Be able to subset/retrieve values from a data frame”
- “Understand how R may coerce data into different modes”
- “Be able to change the mode of an object”
- “Understand that R uses factors to store and manipulate categorical data”
- “Be able to manipulate a factor, including subsetting and reordering”
- “Be able to apply an arithmetic function to a data frame”
- “Be able to coerce the class of an object (including variables in a data frame)”
- “Be able to import data from Excel”
- “Be able to save a data frame as a delimited file”
Working with spreadsheets (tabular data)¶
A substantial amount of the data we work with in genomics will be tabular data, this is data arranged in rows and columns - also known as spreadsheets. We could write a whole lesson on how to work with spreadsheets effectively (See this Data Carpentry Lesson). For our purposes, we want to remind you of a few principles before we work with our first set of example data:
1) Keep raw data separate from analyzed data
This is principle number one because if you can’t tell which files are the original raw data, you risk making some serious mistakes (e.g. drawing conclusion from data which have been manipulated in some unknown way).
2) Keep spreadsheet data Tidy
The simplest principle of Tidy data is that we have one row in our spreadsheet for each observation or sample, and one column for every variable that we measure or report on. As simple as this sounds, it’s very easily violated. Most data scientists agree that significant amounts of their time is spent tidying data for analysis.
3) Trust but verify
Finally, while you don’t need to be paranoid about data, you should have a plan for how you will prepare it for analysis. This a focus of this lesson. You probably already have a lot of intuition, expectations, assumptions about your data - the range of values you expect, how many values should have been recorded, etc. Of course, as the data get larger our human ability to keep track will start to fail (and yes, it can fail for small data sets too). R will help you to examine your data so that you can have greater confidence in your analysis, and its reproducibility.
Tip
Keeping you raw data separate
When you work with data in R, you are not changing the original file you loaded that data from. This is different than (for example) working with a spreadsheet program where changing the value of the cell leaves you one “save”-click away from overwriting the original file. You have to purposely use a writing function (e.g. write.csv()) to save data loaded into R. In that case, be sure to save the manipulated data into a new file. More on this later in the lesson.
Importing tabular data into R¶
There are several ways to import data into R. For our purpose here, we will focus on using the tools every R installation comes with (so called “base” R) to import a comma-delimited file containing the results of our variant calling workflow. We will need to load the sheet using a function called read.csv().
Question
Review the arguments of the `read.csv()` function
Before using the `read.csv()` function, use R’s help feature to answer the following questions.
Hint: Entering ‘?’ before the function name and then running that line will bring up the help documentation. Also, when reading this particular help be careful to pay attention to the ‘read.csv’ expression under the ‘Usage’ heading. Other answers will be in the ‘Arguments’ heading.
- What is the default parameter for ‘header’ in the read.csv() function?
B) What argument would you have to change to read a file that was delimited by semicolons (;) rather than commas?
C) What argument would you have to change to read file in which numbers used commas for decimal separation (i.e. 1,00)?
D) What argument would you have to change to read in only the first 10,000 rows of a very large file?
Answer
A) The read.csv() function has the argument ‘header’ set to TRUE by default, this means the function always assumes the first row is header information, (i.e. column names)
B) The read.csv() function has the argument ‘sep’ set to “,”. This means the function assumes commas are used as delimiters, as you would expect. Changing this parameter (e.g. sep=”;”) would now interpret semicolons as delimiters.
C) Although it is not listed in the read.csv() usage, read.csv() is a “version” of the function read.table() and accepts all its arguments. If you set dec=”,” you could change the decimal operator. We’d probably assume the delimiter is some other character.
D) You can set nrow to a numeric value (e.g. nrow=10000) to choose how many rows of a file you read in. This may be useful for very large files where not all the data is needed to test some data cleaning steps you are applying.
Hopefully, this exercise gets you thinking about using the provided help documentation in R. There are many arguments that exist, but which we wont have time to cover. Look here to get familiar with functions you use frequently, you may be surprised at what you find they can do.
Now, let’s read in the file Ecoli_metadata.csv. Right-click to download this file to your local computer, then in the files pane, use the upload function to upload this file into your RStudio session.
Next, we need to import into RStudio and save it as a data frame object. We will use the read.csv function to do so.
## read in a CSV file and save it as 'variants'
metadata <- read.csv("../r_learning/Ecoli.metadata.csv")
Alternatively, we can directly import it from a URL like this
metadata <- read.csv("https://de.cyverse.org/dl/d/48A1073E-88C9-42E6-B3D0-37D73810CC52/Ecoli_metadata.csv")
One of the first things you should notice is that in the Environment window, you have the metadata object, listed as 30 obs. (observations/rows) of 8 variables (columns). Double-clicking on the name of the object will open a view of the data in a new tab.
Summarizing and determining the structure of a data frame¶
A data frame is the standard way in R to store tabular data. A data fame could also be thought of as a collection of vectors, all of which have the same length. Using only two functions, we can learn a lot about out data frame including some summary statistics as well as well as the “structure” of the data frame. Let’s examine what each of these functions can tell us:
## get summary statistics on a data frame
summary(metadata)
Our data frame had 7 variables, so we get 7 fields that summarize the data. The generation`, and genome_size variables are numerical data and so you get summary statistics on the min and max values for these columns, as well as mean, median, and interquartile ranges. Other variables (e.g. clade) are treated as categorical data (which have special treatment in R - more on this in a bit).
Before we operate on the data, we also need to know a little more about the data frame structure to do that we use the str() function:
## get the structure of a data frame
str(metadata)
Ok, thats a lot up unpack! Some things to notice.
- the object type data.frame is displayed in the first row along with its dimensions, in this case 30 observations (rows) and 7 variables (columns)
- Each variable (column) has a name (e.g. sample). This is followed by the object mode (e.g. factor, int, num, etc.). Notice that before each variable name there is a $ - this will be important later.
Introducing Factors¶
Factors are the final major data structure we will introduce in our R genomics lessons. Factors can be thought of as vectors which are specialized for categorical data. Given R’s specialization for statistics, this make sense since categorial and continuous variables usually have different treatments. Sometimes you may want to have data treated as a factor, but in other cases, this may be undesirable.
Since some of the data in our data frame are factors, lets see how factors work. First, we’ll extract one of the columns of our data frame to a new object, so that we don’t end up modifying the metadata object by mistake.
## extract the "sample" column to a new object
samples <- metadata$sample
Let’s look at the first few items in our factor using head():
head(samples)
What we get back are the items in our factor, and also something called “Levels”. Levels are the different categories contained in a factor. By default, R will organize the levels in a factor in alphabetical order. So the first level in this factor is “REL606”.
Lets look at the contents of a factor in a slightly different way using str():
str(samples)
For the sake of efficiency, R stores the content of a factor as a vector of integers, which an integer is assigned to each of the possible levels. Recall levels are assigned in alphabetical order. In this case, the first item in our “samples” object is “REL606”, which happens to be the 7th level of our factor, ordered alphabetically.
Subsetting data frames¶
Next, we are going to talk about how you can get specific values from data frames, and where necessary, change the mode of a column of values.
The first thing to remember is that a data frame is two-dimensional (rows and columns). Therefore, to select a specific value we will will once again use [] (bracket) notation, but we will specify more than one value (except in some cases where we are taking a range).
Question
Exercise: Subsetting a data frame
Try the following indices and functions and try to figure out what they return
- metadata[1,1]
- metadata[2,4]
- metadata[29,7]
- metadata[2, ]
- metadata[-1, ]
- metadata[1:4,1]
- metadata[1:10,c(“generation”,”clade”)]
- metadata[,c(“sample”)]
- head(metadata)
- tail(metadata)
- metadata$sample_id
- metadata[metadata$sample == “REL10979”,]
The subsetting notation is very similar to what we learned for vectors. The key differences include:
- Typically provide two values separated by commas: data.frame[row, column]
- In cases where you are taking a continuous range of numbers use a colon between the numbers (start:stop, inclusive)
- For a non continuous set of numbers, pass a vector using c()
- Index using the name of a column(s) by passing them as vectors using c()
Finally, in all of the subsetting exercises above, we printed values to the screen. You can create a new data frame object by assigning them to a new object name:
# create a new data frame containing only observations from cit+ strains
cit_plus_strains <- metadata[metadata$cit == "plus",]
# check the dimension of the data frame
dim(cit_plus_strains)
# get a summary of the data frame
summary(cit_plus_strains)
Coercing values in data frames¶
Tip
Coercion isn’t limited to data frames
While we are going to address coercion in the context of data frames most of these methods apply to other data structures, such as vectors
Sometimes, it is possible that R will misinterpret the type of data represented in a data frame, or store that data in a mode which prevents you from operating on the data the way you wish. For example, a long list of gene names isn’t usually thought of as a categorical variable, the way that your experimental condition (e.g. control, treatment) might be. More importantly, some R packages you use to analyze your data may expect characters as input, not factors. At other times (such as plotting or some statistical analyses) a factor may be more appropriate. Ultimately, you should know how to change the mode of an object.
First, its very important to recognize that coercion happens in R all the time. This can be a good thing when R gets it right, or a bad thing when the result is not what you expect. Consider:
snp_chromosomes <- c('3', '11', 'X', '6')
typeof(snp_chromosomes)
Although there are several numbers in our vector, they are all in quotes, so we have explicitly told R to consider them as characters. However, even if we removed the quotes from the numbers, R would coerce everything into a character:
snp_chromosomes_2 <- c(3, 11, 'X', 6)
typeof(snp_chromosomes_2)
snp_chromosomes_2[1]
We can use the as. functions to explicitly coerce values from one form into another. Consider the following vector of characters, which all happen to be valid numbers:
snp_positions_2 <- c("8762685", "66560624", "67545785", "154039662")
typeof(snp_positions_2)
snp_positions_2[1]
Now we can coerce snp_positions_2 into a numeric type using as.numeric():
snp_positions_2 <- as.numeric(snp_positions_2)
typeof(snp_positions_2)
snp_positions_2[1]
Sometimes coercion is straight forward, but what would happen if we tried using as.numeric() on snp_chromosomes_2
snp_chromosomes_2 <- as.numeric(snp_chromosomes_2)
If we check, we will see that an NA value (R’s default value for missing data) has been introduced.
snp_chromosomes_2
Trouble can really start when we try to coerce a factor. For example, when we try to coerce the sample_id column in our data frame into a numeric mode look at the result:
as.numeric(metadata$sample)
Strangely, it works! Almost. Instead of giving an error message, R returns numeric values, which in this case are the integers assigned to the levels in this factor. This kind of behavior can lead to hard-to-find bugs, for example when we do have numbers in a factor, and we get numbers from a coercion. If we don’t look carefully, we may not notice a problem.
If you need to coerce an entire column you can overwrite it using an expression like this one:
# make the 'sample' column a character type column
metadata$sample <- as.character(metadata$sample)
# check the type of the column
typeof(metadata$sample)
StringsAsFactors = FALSE¶
Lets summarize this section on coercion with a few take home messages.
- When you explicitly coerce one data type into another (this is known as explicit coercion), be careful to check the result. Ideally, you should try to see if its possible to avoid steps in your analysis that force you to coerce.
- R will sometimes coerce without you asking for it. This is called (appropriately) implicit coercion. For example when we tried to create a vector with multiple data types, R chose one type through implicit coercion.
- Check the structure (str()) of your data frames before working with them!
Regarding the first bullet point, one way to avoid needless coercion when importing a data frame using any one of the read.table() functions such as read.csv() is to set the argument StringsAsFactors to FALSE. By default, this argument is TRUE. Setting it to FALSE will treat any non-numeric column to a character type. read.csv() documentation, you will also see you can explicitly type your columns using the colClasses argument. Other R packages (such as the Tidyverse “readr”) don’t have this particular conversion issue, but many packages will still try to guess a data type.
Data frame bonus material: math, sorting, renaming¶
Here are a few operations that don’t need much explanation, but which are good to know.
There are lots of arithmetic functions you may want to apply to your data frame, covering those would be a course in itself. Here are some additional summary statistical functions.
You can use functions like mean(), min(), max() on an individual column. Let’s look at the “DP” or filtered depth. This value shows the number of filtered reads that support each of the reported variants.
max(metadata$generation)
You can sort a data frame using the order() function:
metadata_sorted_by_genome_size<- metadata[order(metadata$genome_size), ]
head(metadata_sorted_by_generation$genome_size)
Saving your data frame to a file¶
We can save data to a file. We will save our metadata_sorted_by_genome_size object to a .csv file using the write.csv() function:
write.csv(metadata_sorted_by_genome_size, file = "metadata_sorted_by_genome_size.csv")
The write.csv() function has some additional arguments listed in the help, but at a minimum you need to tell it what data frame to write to file, and give a path to a file name in quotes (if you only provide a file name, the file will be written in the current working directory).
Importing data from Excel¶
Excel is one of the most common formats, so we need to discuss how to make these files play nicely with R. The simplest way to import data from Excel is to save your Excel file in .csv format. You can then import into R right away. Sometimes you may not be able to do this (imagine you have data in 300 Excel files, are you going to open and export all of them?).
One common R package (a set of code with features you can download and add to your R installation) is the readxl package which can open and import Excel files. Rather than addressing package installation this second (we’ll discuss this soon!), we can take advantage of RStudio’s import feature which integrates this package. (Note: this feature is available only in the latest versions of RStudio such as is installed on our cloud instance).
First, download this sample Excel file sequencing_results.metadata.xls to your local computer.
In the RStudio menu go to File, select Import Dataset, and choose From Excel… (notice there are several other options you can explore). Say yes to install the required packages.
Next, under File/Url: click the Browse button and navigate to the sequencing_results.metadata.xls file. You should now see a preview of the data to be imported.
Notice that you have the option to change the data type of each variable by clicking arrow (drop-down menu) next to each column title. Under Import Options you may also rename the data, choose a different sheet to import, and choose how you will handle headers and skipped rows. Under Code Preview you can see the code that will be used to import this file. We could have written this code and imported the Excel file without the RStudio import function, but now you can choose your preference.
In this exercise, we will leave the title of the data frame as sequencing_results.metadata, and there are no other options we need to adjust. Click the Import button to import the data.
Finally, let’s check the first few lines of the sequencing_results.metadata.xls data frame:
head(sequencing_results_metadata)
The type of this object is ‘tibble’, a type of data frame we will talk more about in the ‘dplyr’ section. If you needed a true R data frame you could coerce with as.data.frame().
Fix or improve this documentation
- Search for an answer: CyVerse Learning Center
- Ask us for help:
click
on the lower right-hand side of the page
- Report an issue or submit a change: Github Repo Link
- Send feedback: Tutorials@CyVerse.org
Data cleaning with Tidyverse dplyr¶
learning-objectives
- Describe what the dplyr package in R is used for.
- Apply common dplyr functions to manipulate data in R.
- Use the ‘pipe’ operator to link together a sequence of functions.
- Use the ‘mutate’ function to apply other chosen functions to existing columns and create new columns of data.
- Use the ‘split-apply-combine’ concept to split the data into groups, apply analysis to each group, and combine the results.
dplyr is a package for making data manipulation easier.
Packages in R are basically sets of additional functions that let you do more stuff in R. The functions we’ve been using, like str(), come built into R; packages give you access to more functions. You need to install a package and then load it to be able to use it.
install.packages("dplyr") ## install
You might get asked to choose a CRAN mirror – this is basically asking you to choose a site to download the package from. The choice doesn’t matter too much; The RStudio mirror is recommended.
library("dplyr") ## load
You only need to install a package once per computer, but you need to load it every time you open a new R session and want to use that package.
What is dplyr?¶
The package dplyr is a package that tries to provide easy tools for the most common data manipulation tasks. It is built to work directly with data frames. The thinking behind it was largely inspired by the package plyr which has been in use for some time but suffered from being slow in some cases.` dplyr` addresses this by porting much of the computation to C++. An additional feature is the ability to work with data stored directly in an external database. The benefits of doing this are that the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of the query returned.
This addresses a common problem with R in that all operations are conducted in memory and thus the amount of data you can work with is limited by available memory. The database connections essentially remove that limitation in that you can have a database of many 100s GB, conduct queries on it directly and pull back just what you need for analysis in R.
Selecting columns and filtering rows¶
We’re going to learn some of the most common dplyr functions: select(), filter(), mutate(), group_by(), and summarize(). To select columns of a data frame, use select(). The first argument to this function is the data frame (metadata), and the subsequent arguments are the columns to keep.
select(metadata, sample, clade, cit, genome_size)
To choose rows, use filter():
filter(metadata, cit == "plus")
Pipes¶
But what if you wanted to select and filter? There are three ways to do this: use intermediate steps, nested functions, or pipes. With the intermediate steps, you essentially create a temporary data frame and use that as input to the next function. This can clutter up your workspace with lots of objects. You can also nest functions (i.e. one function inside of another). This is handy, but can be difficult to read if too many functions are nested as the process from inside out. The last option, pipes, are a fairly recent addition to R. Pipes let you take the output of one function and send it directly to the next, which is useful when you need to many things to the same data set. Pipes in R look like %>% and are made available via the magrittr package installed as part of dplyr.
metadata %>%
filter(cit == "plus") %>%
select(sample, generation, clade)
In the above we use the pipe to send the metadata data set first through filter, to keep rows where cit was equal to ‘plus’, and then through select to keep the sample and generation and clade columns. When the data frame is being passed to the filter() and select() functions through a pipe, we don’t need to include it as an argument to these functions anymore.
If we wanted to create a new object with this smaller version of the data we could do so by assigning it a new name:
meta_citplus <- metadata %>%
filter(cit == "plus") %>%
select(sample, generation, clade)
meta_citplus
Question
Using pipes, subset the data to include rows where the clade is ‘Cit+’. Retain columns: sample, cit, and genome_size.
answer
metadata %>%
filter(clade == "Cit+") %>%
select(sample, cit, genome_size)
Mutate¶
Frequently you’ll want to create new columns based on the values in existing columns, for example to do unit conversions or find the ratio of values in two columns. For this we’ll use mutate().
To create a new column of genome size in bp:
metadata %>%
mutate(genome_bp = genome_size *1e6)
If this runs off your screen and you just want to see the first few rows, you can use a pipe to view the head() of the data (pipes work with non-dplyr functions too, as long as the dplyr or magrittr packages are loaded).
metadata %>%
mutate(genome_bp = genome_size *1e6) %>%
head
The row has a NA value for clade, so if we wanted to remove those we could insert a filter() in this chain:
metadata %>%
mutate(genome_bp = genome_size *1e6) %>%
filter(!is.na(clade)) %>%
head
is.na() is a function that determines whether something is or is not an NA. The ! symbol negates it, so we’re asking for everything that is not an NA.
Split-apply-combine data analysis and the summarize() function¶
Many data analysis tasks can be approached using the “split-apply-combine” paradigm: split the data into groups, apply some analysis to each group, and then combine the results. dplyr makes this very easy through the use of the group_by() function, which splits the data into groups. When the data is grouped in this way summarize() can be used to collapse each group into a single-row summary. summarize() does this by applying an aggregating or summary function to each group. For example, if we wanted to group by citrate-using mutant status and find the number of rows of data for each status, we would do:
metadata %>%
group_by(cit) %>%
summarize(n())
Here the summary function used was n() to find the count for each group. We can also apply many other functions to individual columns to get other summary statistics. For example, in the R base package we can use built-in functions like mean, median, min, and max. By default, all R functions operating on vectors that contains missing data will return NA. It’s a way to make sure that users know they have missing data, and make a conscious decision on how to deal with it. When dealing with simple statistics like the mean, the easiest way to ignore NA (the missing data) is to use na.rm=TRUE (rm stands for remove).
So to view mean genome_size by mutant status:
metadata %>%
group_by(cit) %>%
summarize(mean_size = mean(genome_size, na.rm = TRUE))
You can group by multiple columns too:
metadata %>%
group_by(cit, clade) %>%
summarize(mean_size = mean(genome_size, na.rm = TRUE))
Looks like for one of these clones, the clade is missing. We could then discard those rows using filter():
metadata %>%
group_by(cit, clade) %>%
summarize(mean_size = mean(genome_size, na.rm = TRUE)) %>%
filter(!is.na(clade))
All of a sudden this isn’t running off the screen anymore. That’s because dplyr has changed our data.frame to a tbl_df. This is a data structure that’s very similar to a data frame; for our purposes the only difference is that it won’t automatically show tons of data going off the screen.
You can also summarize multiple variables at the same time:
metadata %>%
group_by(cit, clade) %>%
summarize(mean_size = mean(genome_size, na.rm = TRUE),
min_generation = min(generation))
Fix or improve this documentation
- Search for an answer: CyVerse Learning Center
- Ask us for help:
click
on the lower right-hand side of the page
- Report an issue or submit a change: Github Repo Link
- Send feedback: Tutorials@CyVerse.org
Visualization basics¶
- Create simple scatterplots, histograms, and boxplots in R.
- Compare the plotting features of base R and the ggplot2 package.
- Customize the aesthetics of an existing plot.
- Create plots from data in a data frame.
- Export plots from RStudio to standard graphical file formats.
Basic plots in R¶
The mathematician Richard Hamming once said, “The purpose of computing is insight, not numbers”, and the best way to develop insight is often to visualize data. Visualization deserves an entire lecture (or course) of its own, but we can explore a few features of R’s plotting packages.
When we are working with large sets of numbers it can be useful to display that information graphically. R has a number of built-in tools for basic graph types such as histograms, scatter plots, bar charts, boxplots and more. We’ll test a few of these out here on the genome_size vector from our metadata.
genome_size <- metadata$genome_size
Scatterplot¶
Let’s start with a scatterplot. A scatter plot provides a graphical view of the relationship between two sets of numbers. We don’t have a variable in our metadata that is a continous variable, so there is nothing to plot it against but we can plot the values against their index values just to demonstrate the function.
plot(genome_size)
Each point represents a clone and the value on the x-axis is the clone index in the file, where the values on the y-axis correspond to the genome size for the clone. For any plot you can customize many features of your graphs (fonts, colors, axes, titles) through graphic options. For example, we can change the shape of the data point using pch.
plot(genome_size, pch=8)
We can add a title to the plot by assigning a string to main:
plot(genome_size, pch=8, main="Scatter plot of genome sizes")
Histogram¶
Another way to visualize the distribution of genome sizes is to use a histogram, we can do this buy using the hist function:
hist(genome_size)
Boxplot¶
Using additional information from our metadata, we can use plots to compare values between the different citrate mutant status using a boxplot. A boxplot provides a graphical view of the median, quartiles, maximum, and minimum of a data set.
# Boxplot
boxplot(genome_size ~ cit, metadata)
Similar to the scatterplots above, we can pass in arguments to add in extras like plot title, axis labels and colors.
boxplot(genome_size ~ cit, metadata, col=c("pink","purple", "darkgrey"),
main="Average expression differences between celltypes", ylab="Expression")
Advanced figures (ggplot2)¶
More recently, R users have moved away from base graphic options and towards a plotting package called ggplot2 that adds a lot of functionality to the basic plots seen above. The syntax takes some getting used to but it’s extremely powerful and flexible. We can start by re-creating some of the above plots but using ggplot functions to get a feel for the syntax.
ggplot2 is best used on data in the data.frame form, so we will will work with metadata for the following figures. Let’s start by loading the ggplot2 library.
install.packages("ggplot2")
library(ggplot2)
The ggplot() function is used to initialize the basic graph structure, then we add to it. The basic idea is that you specify different parts of the plot, and add them together using the + operator.
We will start with a blank plot and will add layers as we go along.
ggplot(metadata)
Geometric objects are the actual marks we put on a plot. Examples include:
- points (geom_point, for scatter plots, dot plots, etc)
- lines (geom_line, for time series, trend lines, etc)
- boxplot (geom_boxplot, for, well, boxplots!)
A plot must have at least one geom; there is no upper limit. You can add a geom to a plot using the + operator
ggplot(metadata) +
geom_point() # note what happens here
Each type of geom usually has a required set of aesthetics to be set, and usually accepts only a subset of all aesthetics –refer to the geom help pages to see what mappings each geom accepts. Aesthetic mappings are set with the aes() function. Examples include:
- position (i.e., on the x and y axes)
- color (“outside” color)
- fill (“inside” color) shape (of points)
- linetype
- size
To start, we will add position for the x- and y-axis since geom_point requires mappings for x and y, all others are optional.
ggplot(metadata) +
geom_point(aes(x = sample, y= genome_size))
The labels on the x-axis are quite hard to read. To do this we need to add an additional theme layer. The ggplot2 theme system handles non-data plot elements such as:
- Axis labels
- Plot background
- Facet label backround
- Legend appearance
There are built-in themes we can use, or we can adjust specific elements. For our figure we will change the x-axis labels to be plotted on a 45 degree angle with a small horizontal shift to avoid overlap. We will also add some additional aesthetics by mapping them to other variables in our dataframe. For example, the color of the points will reflect the number of generations and the shape will reflect citrate mutant status._ The size of the points can be adjusted within the geom_point but does not need to be included in aes() since the value is not mapping to a variable.
ggplot(metadata) +
geom_point(aes(x = sample, y= genome_size, color = generation, shape = cit), size = rel(3.0)) +
theme(axis.text.x = element_text(angle=45, hjust=1))
Histogram¶
To plot a histogram we require another geometric object geom_bar, which requires a statistical transformation. Some plot types (such as scatterplots) do not require transformations, each point is plotted at x and y coordinates equal to the original value. Other plots, such as boxplots, histograms, prediction lines etc. need to be transformed, and usually has a default statistic that can be changed via the stat_bin argument.
ggplot(metadata) +
geom_bar(aes(x = genome_size))
Try plotting with the default value and compare it to the plot using the binwidth values. How do they differ?
ggplot(metadata) +
geom_bar(aes(x = genome_size), stat = "bin", binwidth=0.05)
Boxplot¶
Now that we have all the required information on let’s try plotting a boxplot similar to what we had done using the base plot functions at the start of this lesson. We can add some additional layers to include a plot title and change the axis labels. Explore the code below and all the different layers that we have added to understand what each layer contributes to the final graphic.
ggplot(metadata) +
geom_boxplot(aes(x = cit, y = genome_size, fill = cit)) +
ggtitle('Boxplot of genome size by citrate mutant type') +
xlab('citrate mutant') +
ylab('genome size') +
theme(panel.grid.major = element_line(size = .5, color = "grey"),
axis.text.x = element_text(angle=45, hjust=1),
axis.title = element_text(size = rel(1.5)),
axis.text = element_text(size = rel(1.25)))
Writing figures to file¶
There are two ways in which figures and plots can be output to a file (rather than simply displaying on screen). The first (and easiest) is to export directly from the RStudio ‘Plots’ panel, by clicking on Export when the image is plotted. This will give you the option of png or pdf and selecting the directory to which you wish to save it to.
The second option is to use R functions in the console, allowing you the flexibility to specify parameters to dictate the size and resolution of the output image. Some of the more popular formats include pdf(), png(), which are functions that initialize a plot that will be written directly to a file in the pdf or png format, respectively. Within the function you will need to specify a name for your image in quotes and the width and height. Specifying the width and height is optional, but can be very useful if you are using the figure in a paper or presentation and need it to have a particular resolution. Note that the default units for image dimensions are either pixels (for png) or inches (for pdf). To save a plot to a file you need to:
- Initialize the plot using the function that corresponds to the type of file you want to make: pdf(“filename”)
- Write the code that makes the plot.
- Close the connection to the new file (with your plot) using dev.off().
pdf("figure/boxplot.pdf")
ggplot(example_data) +
geom_boxplot(aes(x = cit, y =....) +
ggtitle(...) +
xlab(...) +
ylab(...) +
theme(panel.grid.major = element_line(...),
axis.text.x = element_text(...),
axis.title = element_text(...),
axis.text = element_text(...)
dev.off()
Resources¶
We have only scratched the surface here. To learn more, see the ggplot2 reference site.
Fix or improve this documentation
- Search for an answer: CyVerse Learning Center
- Ask us for help:
click
on the lower right-hand side of the page
- Report an issue or submit a change: Github Repo Link
- Send feedback: Tutorials@CyVerse.org
Getting help with R¶
learning-objectives
- Be able to ask effective questions when searching for help on forums or using web searches
- Check the version of R and display session info
- Know important learning resources
Getting help with R¶
No matter how much experience you have with R, you will find yourself needing help. There is no shame in researching how to do something in R, and most people will find themselves looking up how to do the same things that they “should know how to do” over and over again. Here are some tips to make this process as helpful and efficient as possible.
Tip
Never memorize something that you can look up” - A. Einstein
Finding help on Stackoverflow and Biostars¶
Two popular websites will be of great help with many R problems. For general R questions, Stack Overflow is probably the most popular online community for developers. If you start your question “How to do X in R” results from Stack Overflow are usually near the top of the list. For bioinformatics specific questions, Biostars is a popular online forum.
Tip
Asking for help using online forums
- When searching for R help, look for answers with the r tag
- Get an account; not required to view answers but to required to post
- Put in effort to check thoroughly before you post a question; folks get annoyed if you ask a very common question that has been answered multiple times
- Be careful. While forums are very helpful, you can’t know for sure if the advice you are getting is correct
- See the How to ask for R help blog post for more useful tips
Help people help you¶
Often, in order to duplicate the issue you are having, someone may need to see the data you are working with or verify the versions of R or R packages you are using. The following R functions will help with this:
You can check the version of R you are working with using the sessionInfo() function. Actually, it is good to save this information as part of your notes on any analysis you are doing. When you run the same script that has worked fine a dozen times before, looking back at these notes will remind you that you upgraded R and forget to check your script.
sessionInfo()
Many times, there may be some issues with your data and the way it is formatted. In that case, you may want to share that data with someone else. However, you may not need to share the whole dataset; looking at a subset of your 50,000 row, 10,000 column data frame may be TMI (too much information)! You can take an object you have in memory such as data frame (if you don’t know what this means yet, we will get to it!) and save it to a file. In our example we will use the dput() function on the iris data frame which is an example dataset that is installed in R:
dput(head(iris)) # iris is an example data.frame that comes with R
# the `head()` function just takes the first 6 lines of the iris dataset
This generates some output (below) which you will be better able to interpret after covering the other R lessons. This info would be helpful in understanding how the data is formatted and possibly revealing problematic issues.
Alternatively, you can also save objects in R memory to a file by specifying the name of the object, in this case the iris data frame, and passing a filename to the file= argument.
saveRDS(iris, file="iris.rds") # By convention, we use the .rds file extension
Final FAQs on R¶
Finally, here are a few pieces of introductory R knowledge that are too good to pass up. While we won’t return to them in this course, we put them here because they come up commonly:
Do I need to click Run every time I want to run a script?
No. In fact, the most common shortcut key allows you to run a command (or any lines of the script that are highlighted):
- Windows execution shortcut: Ctrl + Enter
- Mac execution shortcut: Cmd(⌘) + Enter
To see a complete list of shortcuts, click on the Tools menu and select Keyboard Shortcuts Help
What’s with the brackets in R console output?
R returns an index with your result. When your result contains multiple values, the number tells you what ordinal number begins the line, for example:
1:101 # generates the sequence of numbers from 1 to 101
In the output, [81] indicates that the first value on that line is the 81st item in your result
Can I run my R script without RStudio?
- Yes, remember - RStudio is running R. You get to use lots of the enhancements RStudio provides, but R works independent of RStudio. See these tips for running your commands at the command line
Where else can I learn about RStudio? - Check out the Help menu, especially “Cheatsheets” section
What are some other important resources for learning R?
- Data Carpentry has lessons and you can request or help organize a workshop
- R for Data Science is an excellent introduction to R and Tidyverse
Fix or improve this documentation
- Search for an answer: CyVerse Learning Center
- Ask us for help:
click
on the lower right-hand side of the page
- Report an issue or submit a change: Github Repo Link
- Send feedback: Tutorials@CyVerse.org
Prerequisites¶
Downloads, access, and services¶
In order to complete this tutorial you will need access to the following services/software
Prerequisite | Preparation/Notes | Link/Download |
---|---|---|
CyVerse account | We will use CyVerse VICE to complete this tutorial. This will eliminate the time consuming step of having each learner install R on their own. | CyVerse User Portal |
[Optional] Install R | If not using the CyVerse VICE app, you will need to install R on your own computer. You can follow instructions at this link to install R on your own computer. | Project R |
[Optional] RStudio | If not using the CyVerse VICE app, you will need to install RStudio on your own computer. You can follow instructions at this link to install RStudio on your own computer. (Choose the free “RStudio Desktop”) | RStudio |
Platform(s)¶
We will use the following CyVerse platform(s):
Platform | Interface | Link | Platform Documentation | Quick Start |
---|---|---|---|---|
Data Store | GUI/Command line | Data Store | Data Store Manual | Data Store Guide |
Discovery Environment | Web/Point-and-click | Discovery Environment | DE Manual | Discovery Environment Guide |
Application(s) used¶
Discovery Environment App(s):
App name | Version | Description |
---|---|---|
rstudio-3.5.0 | 3.5.0 | RStudio |
Credits and attributions¶
Material for this tutorial is credited to the Data Carpentry Genomics R lessons (https://datacarpentry.org/genomics-r-intro/index.html) As well as materials from Jeff Holister’s R Tutorial (http://usepa.github.io/introR/2015/01/14/03-Clean/)
Fix or improve this documentation
- Search for an answer: CyVerse Learning Center
- Ask us for help:
click
on the lower right-hand side of the page
- Report an issue or submit a change: Github Repo Link
- Send feedback: Tutorials@CyVerse.org