In this lesson we will begin to work with objects in R, understand conventions around their use, learn about types of objects and how to alter and otherwise manipulate them. R has many different types of objects, and we will not cover all of them in depth. Instead, we will take care to introduce and work with the most common types (vectors and data frames), while introducing briefly types we won’t encounter much in this workshop (matrices, arrays, and lists).
A word of encouragement
Before we begin this lesson, we want to make clear that learning a programming language takes time. At first, things may seem unfamiliar and you may feel like you don’t have the tools to do what you want to do. Sometimes it’s easier to say in words what the intended result is, than to code it out. And that’s okay. As you spend time in R (or in other languages), you will find that you have more functions at your recall, and you will more quickly be able to implement your goals in code.
Reminder
At this point you should be coding along in the “r_basics.R” script we created in the last episode. Writing your commands and comments in the script will make it easier to record what you did and why. We should also be regularly saving the file by clicking the single floppy disk icon or typing Ctrl + S.
Let’s start working with R by creating our first object. Many other programming languages refer to an object as a variable, and it’s okay to use the two interchangeably.
To create an object you need:
In your script, “r_basics.R”, using the R assignment operator ‘<-’, assign ‘1’ to the object ‘a’ as shown. Remember to leave a comment in the line above (using the ‘#’) to explain what you are doing:
# this line creates the object 'a' and assigns it the value '1'
a <- 1
Next, run this line of code in your script. In the RStudio ‘Console’ you should see:
> a <- 1
>
The ‘Console’ will display lines of code run from a script and any outputs or status/warning/error messages (usually in red).
In the ‘Environment’ window you will also get a table:
Values | |
---|---|
a | 1 |
The ‘Environment’ window allows you to keep track of the objects you have created in R.
Exercise: Create some objects
Create the following objects:
- Create an object called
human_chr_number
that has the value of number of pairs of human chromosomes- Create an object called
gene_name
that has the value of your favorite gene name
Here as some possible answers to the challenge:
human_chr_number <- 23
gene_name <- 'pten'
Comment on the assignment operator
Many people use
=
as their preferred assignment operator in R instead of<-
, and this is just as well. Here is a blog post discussing the differences.
The following recommendations on variable naming should keep our code readable and interpretable:
-
). You can use
’_’ to make names more readable. You should avoid using special
characters in your object name (e.g. ! @ # , etc.). Also, object names
cannot begin with a number.There are a few more suggestions about naming and style you may want to learn more about as you write more R code. There are several “style guides” that have advice, and one to start with is the tidyverse R style guide.
Tip: Pay attention to warnings in the script console
If you enter a line of code in your script that contains an error, RStudio may give you an error message and underline this mistake. Sometimes these messages are easy to understand, but often the messages may need some figuring out. Paying attention to these warnings will help you avoid mistakes. In the example below, our object name has a space, which is not allowed in R. The error message does not say this directly, but R is “not sure” about how to assign the name to “human_ chr_number” when the object name we want is “human_chr_number”.
Once an object has a value, you can change that value by overwriting it. R will not give you a warning or error if you overwriting an object, which may or may not be a good thing depending on how you look at it.
# gene_name has the value 'pten' or whatever value you used in the challenge.
# We will now assign the new value 'TP53'
gene_name <- 'TP53'
You can also remove an object from R’s memory entirely. The
rm()
function will delete the object.
# delete the object 'gene_name'
rm(gene_name)
If you run a line of code that has only an object name, R will normally display the contents of that object. In this case, we are told the object no longer exists.
Error: object 'gene_name' not found
In R, every object has two properties:
We will get to the “length” property later in the lesson. The “mode” property **corresponds to the type of data an object represents. The most common modes you will encounter in R are:
Mode (abbreviation) | Type of data |
---|---|
Numeric (num) | Numbers such floating point/decimals (1.0, 0.5, 3.14), there are also more specific numeric types (dbl - Double, int - Integer). These differences are not relevant for most beginners and pertain to how these values are stored in memory |
Character (chr) | A sequence of letters/numbers in single ’’ or double ” ” quotes |
Logical | Boolean values - TRUE or FALSE |
There are a few other modes (i.e. “complex”, “raw” etc.) but these are the three we will work with in this lesson.
Data types are familiar in many programming languages, but also in natural language where we refer to them as the parts of speech, e.g. nouns, verbs, adverbs, etc. Once you know if a word - perhaps an unfamiliar one - is a noun, you can probably guess you can count it and make it plural if there is more than one (e.g. 1 parrot, or 2 parrots). If something is an adjective, you can usually change it into an adverb by adding “-ly” (e.g. slow and slowly). Depending on the context, you may need to decide if a word is in one category or another (e.g “cut” may be a noun when it’s on your finger, or a verb when you are preparing vegetables). These concepts have important analogies when working with R objects.
Exercise: Create objects and check their modes
Create the following objects in R, then use the
mode()
function to verify their modes. Try to guess what the mode will be before you look at the solution
chr_name <- 'chr2'
scale_factor <- 0.47
chr_position <- '1001701'
is_circular <- TRUE
pilot <- Earhart
chr_name <- 'chr2'
scale_factor <- 0.5
chr_position <- '1001701'
is_circular <- TRUE
pilot <- Earhart
Error in eval(expr, envir, enclos): object 'Earhart' not found
mode(chr_name)
[1] "character"
mode(scale_factor)
[1] "numeric"
mode(chr_position)
[1] "character"
mode(is_circular)
[1] "logical"
mode(pilot)
Error in mode(pilot): object 'pilot' not found
Notice that R interprets a series of numbers wrapped in quotes as
being in “character” mode. Also notice that
pilot <- Earhart
gave an error. This is because, without
the quotes, R expected Earhart
to be an object, but there
is no such named object yet. If Earhart
did exist, then the
mode of pilot
would be whatever the mode of
Earhart
was originally. If we did indeed want
pilot
to be assigned the name “Earhart” we would do:
pilot <- 'Earhart'
mode(pilot)
[1] "character"
Once an object exists (and by definition has a mode), R can appropriately manipulate that object. For example, objects of the numeric modes can be added, multiplied, divided, etc. R provides several mathematical (arithmetic) operators including:
Operator | Description |
---|---|
+ | addition |
- | subtraction |
* | multiplication |
/ | division |
^ or ** | exponentiation |
a%/%b | integer division (division where the remainder is discarded) |
a%%b | modulus (returns the remainder after division) |
These can be used with literal numbers:
(1 + (5^0.5))/2
[1] 1.618034
and importantly, can be used on any object that evaluates to (i.e. interpreted by R) a numeric object:
# multiply the object 'human_chr_number' by 2
human_chr_number * 2
[1] 46
Vectors are probably the most commonly used object type in R.
A vector is a collection of values that are all of the same type
(numbers, characters, etc.). One of the most common ways to
create a vector is to use the c()
function - the
“concatenate” or “combine” function. Inside the function you may enter
one or more values; for multiple values, separate each value with a
comma:
# Create a vector of country names
countries <- c('Thailand', 'Ghana', 'Suriname', 'Canada')
# Continent of country
continents <- c('Asia', 'Africa', 'South America', 'North America')
# Population of country
populations <- c(69950850, 30792608, 575990, 38436447) # Source: Wikipedia
# Logical vector of Asian countries
in_asia <- c(TRUE, FALSE, FALSE, FALSE)
Vectors always have a mode and a
length. You can check these with the
mode()
and length()
functions respectively.
Another useful function that gives both of these pieces of information
is the str()
(structure) function.
# Check the mode, length, and structure of 'countries'
mode(countries)
[1] "character"
length(countries)
[1] 4
str(countries)
chr [1:4] "Thailand" "Ghana" "Suriname" "Canada"
str(continents)
chr [1:4] "Asia" "Africa" "South America" "North America"
str(populations)
num [1:4] 69950850 30792608 575990 38436447
str(in_asia)
logi [1:4] TRUE FALSE FALSE FALSE
Vectors are quite important in R. Another data type that we will work with later in this lesson, data frames, are collections of vectors. What we learn here about vectors will pay off even more when we start working with data frames.
Once we have vectors, it is common to want to need to extract one or more values from them. We bracket notation to do this. We type the name of the vector followed by square brackets. In those square brackets we place the index (e.g. a number) in that bracket as follows:
# get the 3rd value in the countries vector
countries[3]
[1] "Suriname"
In R, every item in a vector is indexed, starting from the first item (1) through the final number of items in your vector. You can also retrieve a range of numbers:
# get the 1st through 3rd value in the countries vector
countries[1:3]
[1] "Thailand" "Ghana" "Suriname"
If you want to retrieve several (but not necessarily sequential) items from a vector, you pass a vector of indices; a vector that has the numbered positions you wish to retrieve.
# get the 1st, 3rd, and 4th value in the countries vector
countries[c(1, 3, 4)]
[1] "Thailand" "Suriname" "Canada"
If you want to use a logical vector to retrieve only the Asian countries you can do the following. We will see other ways of logically subsetting vectors in a bit.
# Get the Asian countries
countries[in_asia]
[1] "Thailand"
There are additional (and perhaps less commonly used) ways of subsetting a vector (see these examples).
Sorting a vector can be done in a couple ways. The
sort()
function will sort a vector in
place and return a sorted vectore. The order()
function will sort a vector and return the indices
representing the sorted vector. A couple examples to illustrate the
difference:
sort(countries)
[1] "Canada" "Ghana" "Suriname" "Thailand"
Notice that the returned vector is the character vector of countries with the entries in alphabetical order.
order(countries)
[1] 4 2 3 1
Notice this is giving us the indices, with respect to the original
countries vector, that will give the countries in alphabetical order. So
to get to the same spot as sort(countries)
we would do:
countries[order(countries)]
[1] "Canada" "Ghana" "Suriname" "Thailand"
The usefulness of the order()
function over
sort()
will become apparent when we deal with data frames
later.
Once you have an existing vector, you may want to add a new item to
it. To do so, you can use the c()
function again to add
your new value:
# add the countries 'Bolivia' and 'Tonga' to our list of countries
# this overwrites our existing vector
more_countries <- c(countries, 'Bolivia', 'Tonga')
We can verify that “countries” contains the new entries
more_countries
[1] "Thailand" "Ghana" "Suriname" "Canada" "Bolivia" "Tonga"
Using a negative index will return a version of a vector with that index’s value removed:
more_countries[-6]
[1] "Thailand" "Ghana" "Suriname" "Canada" "Bolivia"
We can remove that value from our vector by overwriting it with this expression:
more_countries <- more_countries[-6]
more_countries
[1] "Thailand" "Ghana" "Suriname" "Canada" "Bolivia"
We can also explicitly rename or add a value to our index using bracket notation:
more_countries[7] <- "Estonia"
more_countries
[1] "Thailand" "Ghana" "Suriname" "Canada" "Bolivia" NA "Estonia"
Notice in the operation above that R inserts an NA
value
to extend our vector so that the country “Estonia” is in index 7. This
may be a good or not-so-good thing depending on how you use this.
Exercise: Working with vectors
- How do we extract the third element of the
countries
vector?- How do we get the length of the
countries
vector?- What function concatenates other values onto a vector?
- How could we multiply all the populations by 2?
countries[3]
length(countries)
c()
populations * 2
We saw with countries[in_asia]
that we could use a
literal logical vector in the bracket notation to extract the
TRUE
indices of a vector. But we can also use a logical
expression. Let’s say we wanted get all of the countries in our vector
of country populations that were less than 10,000,000. We could index
using the ‘<’ (less than) logical operator:
# Subset populations to those less than 10 million
populations[populations < 10000000]
[1] 575990
In the square brackets we essentially write an expression for a
logical test of the vector. In this case
populations < 10000000
, which R will then evaluate into
a logical vector. Some of the most common logical operators you will use
in R are:
Operator | Description | Example |
---|---|---|
< | less than | populations < 575990 |
<= | less than or equal to | populations <= 575990 |
> | greater than | populations > 1000000 |
>= | greater than or equal to | populations >= 38000000 |
== | exactly equal to | continents == 'Africa' |
!= | not equal to | continents != 'Asia' |
!x | not x | !in_asia |
a | b | a or b | in_asia | !in_asia |
a & b | a and b | in_asia & !in_asia |
There is a related way of doing this which returns the indices which
are TRUE
, using the which()
function.
which(populations < 10000000)
[1] 3
And then subsetting populations with the above expression in the square brackets gives the same result:
# Subset populations to those less than 10 million
populations[which(populations < 10000000)]
[1] 575990
Finally, a useful idea is to replace the hard-coded 10,000,000 with a variable. Often we won’t know what inputs and values will be used when our code is executed, so we could also do something like:
pop_thresh = 10000000
populations[populations < pop_thresh]
[1] 575990
And if we were to use that pop_thresh
multiple times in
our code, and wanted to change it, we’d only have to change one location
rather than many, thus reducing errors.
There are a few other common retrieve or replace operations you may
want to know about. First, you can check to see if any of the values of
your vector are missing (i.e. are NA
). The
is.na()
function will return a logical vector, with TRUE
for any NA
value:
more_countries
[1] "Thailand" "Ghana" "Suriname" "Canada" "Bolivia" NA "Estonia"
is.na(more_countries)
[1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE
This will be a useful function for when you read in data and want to see if there are any missing entries.
Sometimes, you may want to see if a specific value (or several
values) is present a vector. You can do this using the comparison
operator %in%
, which will return TRUE for any value in your
collection that is in the vector you are searching:
# test to see if "Ghana" or "Canada" is in the countries vector
# if you are looking for more than one value, you must pass this as a vector
c('Ghana', 'Canada') %in% countries
[1] TRUE TRUE
c('Ghana', 'Canada') %in% more_countries
[1] TRUE TRUE
Review Exercise 1
Add ‘South America’, NA, and ‘Europe’ to the
continents
vector, and assign it the namemore_continents
.
more_continents <- c(continents, c('South America', NA, 'Europe')) # did you use quotes?
more_continents
[1] "Asia" "Africa" "South America" "North America" "South America" NA "Europe"
Review Exercise 2
Using indexing, create a new vector named
combined
that contains:
- The 1st value in
countries
- The 1st value in
continents
- The 1st value in
populations
combined <- c(countries[1], continents[1], populations[1])
combined
[1] "Thailand" "Asia" "69950850"
Review Exercise 3
What type of data is
combined
? Why?
mode(combined)
[1] "character"
str(combined)
chr [1:3] "Thailand" "Asia" "69950850"
Before we continue, let’s reset some vectors so we’re all on the same page for the next lesson:
# Create a vector of country names
countries <- c("Thailand", "Ghana", "Suriname", "Canada")
# Continent of country
continents <- c('Asia', 'Africa', 'South America', 'North America')
# Population of country
populations <- c(69950850, 2108328, 575990, 38436447) # Source: Wikipedia
# Logical vector of Asian countries
in_asia <- c(TRUE, FALSE, FALSE, FALSE)
A matrix stores values in a two-dimensional array, as in a matrix
from linear algebra. A matrix must store data that is all the same type.
We can create a matrix from a vector by using the matrix()
function and specifying the number of rows (or columns), as with:
country_matrix = matrix(countries, nrow = 2)
country_matrix
[,1] [,2]
[1,] "Thailand" "Suriname"
[2,] "Ghana" "Canada"
mode(country_matrix)
[1] "character"
class(country_matrix)
[1] "matrix" "array"
dim(country_matrix)
[1] 2 2
Notice that since a matrix has two-dimensions, it’s “length” should
really be thought of as dimensions, i.e. the number of rows and columns
(in that order with dim()
). To access entires of a matrix,
we also use bracket notation. Some examples of accessing specific
entries:
# Return the element in the first row, second column
country_matrix[1, 2]
[1] "Suriname"
# Return the element in the second row, second column
country_matrix[2, 2]
[1] "Canada"
If we wanted to return a whole row, or a whole column, we would leave the row or column entry blank, respectively, as in:
# Return the first row
country_matrix[1, ]
[1] "Thailand" "Suriname"
# Return the second column
country_matrix[, 2]
[1] "Suriname" "Canada"
Lists are like vectors in that they group data into a one-dimensional set, but lists group together other R objects, such as vectors or other lists. The best way to see this is with an example:
country_list = list(countries, continents, populations)
country_list
[[1]]
[1] "Thailand" "Ghana" "Suriname" "Canada"
[[2]]
[1] "Asia" "Africa" "South America" "North America"
[[3]]
[1] 69950850 2108328 575990 38436447
mode(country_list)
[1] "list"
length(country_list)
[1] 3
Notice that the list contains three vectors, the types of the vectors don’t have to be the same. In fact, the length of the vectors don’t even have to be the same.
To access the first element of country_list
and
return a list, we would use the single-bracket
notation:
country_list[1]
[[1]]
[1] "Thailand" "Ghana" "Suriname" "Canada"
mode(country_list[1])
[1] "list"
If we wanted to access the first element and return an atomic vector, we would use double-bracket notation:
country_list[[1]]
[1] "Thailand" "Ghana" "Suriname" "Canada"
mode(country_list[[1]])
[1] "character"
Tip: Where to learn more
The following are good resources for learning more about R. Some of them can be quite technical, but if you are a regular R user you may ultimately need this technical knowledge.
- R for Beginners: By Emmanuel Paradis and a great starting point
- The R Manuals: Maintained by the R project
- R contributed documentation: Also linked to the R project; importantly there are materials available in several languages
- R for Data Science: A wonderful collection by noted R educators and developers Garrett Grolemund and Hadley Wickham
- Practical Data Science for Stats: Not exclusively about R usage, but a nice collection of pre-prints on data science and applications for R
- Programming in R Software Carpentry lesson: There are several Software Carpentry lessons in R to choose from