Objectives

In this lesson we will introduce two more data types: factors and data frames. Factors are the data type R uses to store categorical data. Data frames are probably the most commonly used data type in R, and is most similar to Excel spreadsheets. We will learn how to import data into R as a data frame, and in subsequent lessons how to manipulate, modify, summarize, and plot data contained in data frames.

Factors

Factors are how R stores categorical information, like the continents, among our examples. It is like a character vector that can only have a finite number of values. To make a factor, we can pass an atomic vector into the factor() function. In the background, R recodes the data as integers and stores the results in an integer vector. It also adds a levels attribute to the integer vector enumerating the set of labels for displaying the factor values. The easiest way to see this is with an example:

continents_factor = factor(continents)
continents_factor
[1] Asia          Africa        South America North America
Levels: Africa Asia North America South America
typeof(continents_factor)
[1] "integer"
class(continents_factor)
[1] "factor"
as.integer(continents_factor)
[1] 2 1 4 3

We observe a few things with this string of commands:

  1. It looks like the type of continents_factor is an integer vector.
  2. But there is a class of vectors, called factors. R will often take into account the class of an object to decide how a function will act on it.
  3. By default the levels of a factor are the unique elements of the character vector in alphabetical order.

You can specify an order by explicitly defining the levels as in:

continents_factor = factor(continents, levels = c('South America', 'Asia', 'Africa', 'North America'))
continents_factor
[1] Asia          Africa        South America North America
Levels: South America Asia Africa North America
as.integer(continents_factor)
[1] 2 3 1 4

Notice that the integer representation of the vector changed order, because the levels changed order. This will be useful later in the RNA-seq Demystified lessons when we discuss testing for differential expression and the assumptions DESeq2 makes about what factor level is the “reference”.

Tip: Treating objects as categories without changing their mode

You don’t have to make an object a factor to get the benefits of treating an object as a factor. See what happens when you use the as.factor() function on continents. To generate a tally, you can sometimes also use the table() function; though sometimes you may need to combine both (i.e. table(as.factor(continents)))

Data Frames

Data frames group vectors together into a two-dimensional table, where each vector becomes a column of the table. The data types within each column must be the same (they are vectors after all), but the columns can be of different data types. Let’s construct our first data frame using some vectors we have laying around. Here we will include column names from the start:

countries_df = data.frame(country = countries, continent = continents, population = populations)
countries_df
   country     continent population
1 Thailand          Asia   69950850
2    Ghana        Africa    2108328
3 Suriname South America     575990
4   Canada North America   38436447
dim(countries_df)
[1] 4 3

Since data frames are two-dimensional, we have to specify both the row and column to extract specific entries, as in:

# Return the element in the second row, third column
countries_df[2, 3]
[1] 2108328

Since the columns of the data frame are named, we can also use those instead:

# Return the element in the Ghana row of the population column
countries_df[2, 'population']
[1] 2108328

To access an entire row or column, we specify just the row or column index, as with:

# Access the third row by index
countries_df[3, ]
   country     continent population
3 Suriname South America     575990

And it is the same with columns:

# Access the second column by index
countries_df[, 2]
[1] "Asia"          "Africa"        "South America" "North America"
# By name
countries_df[, 'continent']
[1] "Asia"          "Africa"        "South America" "North America"

Finally, data frames have another way to access columns using “dollar-sign notation”:

countries_df$continent
[1] "Asia"          "Africa"        "South America" "North America"

Working with spreadsheets

A substantial amount of data is tabular, that is data arranged in rows and columns - also known as spreadsheets. We could write a whole lesson on how to work with spreadsheets effectively (actually, Data Carpentry has). For our purposes, we want to remind you of a few principles before we work with our first set of example data:

1. Keep raw data separate from analyzed data

This is principle number one because if you can’t tell which files are the original raw data, you risk making some serious mistakes (e.g. drawing conclusion from data which have been manipulated in some unknown way).

2. Keep spreadsheet data Tidy

The simplest principle of Tidy data is that we have one row in our spreadsheet for each observation or sample, and one column for every variable that we measure or report on. As simple as this sounds, it’s very easily violated. Most data scientists agree that significant amounts of their time is spent tidying data for analysis. Read more about data organization in this lesson and in this paper.

3. Trust but verify

Finally, you don’t need to be paranoid about data, but you should have a plan for how you will prepare it for analysis. You probably already have a lot of intuition, expectations, assumptions about your data - the range of values you expect, how many values should have been recorded, etc. Of course, as the data get larger our human ability to keep track will start to fail (it can fail for small data sets too). R can help you to examine your data so that you can have greater confidence in your analysis, and its reproducibility.

Importing tabular data into R

There are several ways to import data into R. For our purpose here, we will focus on using the tools every R installation comes with (so called “base” R) to import a comma-delimited file containing the results of our variant calling workflow. We will need to load the sheet using a function called read.csv().

Exercise: Review the arguments of the read.csv() function

Before using the read.csv() function, use R’s help feature to answer the following questions.

Hint: Entering ‘?’ before the function name and then running that line will bring up the help documentation. Also, when reading this particular help be careful to pay attention to the ‘read.csv’ expression under the ‘Usage’ heading. Other answers will be in the ‘Arguments’ heading.

  1. What is the default parameter for ‘header’ in the read.csv() function?
  2. What argument would you have to change to read a file that was delimited by semicolons (;) rather than commas?
  3. What argument would you have to change to read file in which numbers used commas for decimal separation (i.e. 1,00)?
  4. What argument would you have to change to read in only the first 10,000 rows of a very large file?
Solution
  1. The read.csv() function has the argument ‘header’ set to TRUE by default, this means the function always assumes the first row is header information, (i.e. column names)
  2. The read.csv() function has the argument ‘sep’ set to “,”. This means the function assumes commas are used as delimiters, as you would expect. Changing this parameter (e.g. sep=";") would now interpret semicolons as delimiters.
  3. Although it is not listed in the read.csv() usage, read.csv() is a “version” of the function read.table() and accepts all its arguments. If you set dec="," you could change the decimal operator. We’d probably assume the delimiter is some other character.
  4. You can set nrow to a numeric value (e.g. nrow=10000) to choose how many rows of a file you read in. This may be useful for very large files where not all the data is needed to test some data cleaning steps you are applying.
Hopefully, this exercise gets you thinking about using the provided help documentation in R. There are many arguments that exist, but which we wont have time to cover. Look here to get familiar with functions you use frequently, you may be surprised at what you find they can do.


Now, let’s read in the file data/gapminder_data.csv and call this data gapminder. The first argument to pass to our read.csv() function is the file path for our data. The file path must be in quotes and now is a good time to remember to use tab autocompletion. If you use tab autocompletion you avoid typos and errors in file paths. Use it!

## read in a CSV file and save it as 'gapminder'
gapminder <- read.csv("data/gapminder_data.csv")

One of the first things you should notice is that in the Environment window, you have the gapminder object, listed as 1704 obs. (observations/rows) of 6 variables (columns). Double-clicking on the name of the object will open a view of the data in a new tab.

rstudio data frame view

Tip: Changes in R don’t overwrite the original file

When you work with data in R, you are not changing the original file you loaded that data from. This is different than (for example) working with a spreadsheet program where changing the value of the cell leaves you one “save”-click away from overwriting the original file. You have to purposely use a writing function (e.g. write.csv()) to save data loaded into R. In that case, be sure to save the manipulated data into a new file. More on this later in the lesson.

Summarizing, subsetting, and reordering a data frame

A data frame is the standard way in R to store tabular data. A data frame is a collection of vectors of the same length. Let’s use the str() (structure) function to look a little more closely at how data frames work:

## get the structure of a data frame
str(gapminder)
'data.frame':   1704 obs. of  6 variables:
 $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
 $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
 $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
 $ gdpPercap: num  779 821 853 836 740 ...

Some things to notice:

  • the object type data.frame is displayed in the first row along with its dimensions, in this case 1704 observations (rows) and 6 variables (columns)
  • Each variable (column) has a name (e.g. country). This is followed by the object mode (e.g. chr, int, etc.). Notice that before each variable name there is a $ - which is a hint as to how one can access these columns, as we saw when we first introduced data frames.

Summarizing

We can get a birds eye view of a data frame by using the summary() function. Depending on the type of data in the columns, summary() will do particular things:

## get summary statistics on a data frame
summary(gapminder)
   country               year           pop             continent            lifeExp        gdpPercap       
 Length:1704        Min.   :1952   Min.   :6.001e+04   Length:1704        Min.   :23.60   Min.   :   241.2  
 Class :character   1st Qu.:1966   1st Qu.:2.794e+06   Class :character   1st Qu.:48.20   1st Qu.:  1202.1  
 Mode  :character   Median :1980   Median :7.024e+06   Mode  :character   Median :60.71   Median :  3531.8  
                    Mean   :1980   Mean   :2.960e+07                      Mean   :59.47   Mean   :  7215.3  
                    3rd Qu.:1993   3rd Qu.:1.959e+07                      3rd Qu.:70.85   3rd Qu.:  9325.5  
                    Max.   :2007   Max.   :1.319e+09                      Max.   :82.60   Max.   :113523.1  

Our data frame has 6 variables, so we get 6 fields that summarize the data. The year, pop, lifeExp and gdpPercap variables are numerical data and so you get summary statistics on the min and max values for these columns, as well as mean, median, and interquartile ranges. The other variables, country and continent, are treated as characters data (more on this in a bit).

Subsetting

We saw how to subset a data frame in the previous section, but in all those examples we printed values to the screen. You can create a new data frame object by assigning them to a new object name:

# create a new data frame containing only observations from India
india_subset <- gapminder[gapminder$country == 'India',]
india_subset
    country year        pop continent lifeExp gdpPercap
697   India 1952  372000000      Asia  37.373  546.5657
698   India 1957  409000000      Asia  40.249  590.0620
699   India 1962  454000000      Asia  43.605  658.3472
700   India 1967  506000000      Asia  47.193  700.7706
701   India 1972  567000000      Asia  50.651  724.0325
702   India 1977  634000000      Asia  54.208  813.3373
703   India 1982  708000000      Asia  56.596  855.7235
704   India 1987  788000000      Asia  58.553  976.5127
705   India 1992  872000000      Asia  60.223 1164.4068
706   India 1997  959000000      Asia  61.765 1458.8174
707   India 2002 1034172547      Asia  62.879 1746.7695
708   India 2007 1110396331      Asia  64.698 2452.2104
# check the dimension of the data frame
dim(india_subset)
[1] 12  6
# get a summary of the data frame
summary(india_subset)
   country               year           pop             continent            lifeExp        gdpPercap     
 Length:12          Min.   :1952   Min.   :3.720e+08   Length:12          Min.   :37.37   Min.   : 546.6  
 Class :character   1st Qu.:1966   1st Qu.:4.930e+08   Class :character   1st Qu.:46.30   1st Qu.: 690.2  
 Mode  :character   Median :1980   Median :6.710e+08   Mode  :character   Median :55.40   Median : 834.5  
                    Mean   :1980   Mean   :7.011e+08                      Mean   :53.17   Mean   :1057.3  
                    3rd Qu.:1993   3rd Qu.:8.938e+08                      3rd Qu.:60.61   3rd Qu.:1238.0  
                    Max.   :2007   Max.   :1.110e+09                      Max.   :64.70   Max.   :2452.2  

Reordering

With vectors we saw the sort() function returned the reordered vector, and that the order() function returned the indices giving the order of the reordered vector. Let’s use the order() function to change the ordering of india_subset so that the most recent years are at the top of the table.

First look at the help for ?sort and ?order and note the decreasing parameter. If we want the years to be decreasing, we should set this parameter to TRUE.

sort(india_subset$year, decreasing = TRUE)
 [1] 2007 2002 1997 1992 1987 1982 1977 1972 1967 1962 1957 1952

This is the right ordering we want, but but we want the indices so that we can reorder the rows of india_subset. Using order() to give the row indices in the correct order gets us where we want to go:

india_subset_decreasing = india_subset[order(india_subset$year, decreasing = TRUE), ]
india_subset_decreasing
    country year        pop continent lifeExp gdpPercap
708   India 2007 1110396331      Asia  64.698 2452.2104
707   India 2002 1034172547      Asia  62.879 1746.7695
706   India 1997  959000000      Asia  61.765 1458.8174
705   India 1992  872000000      Asia  60.223 1164.4068
704   India 1987  788000000      Asia  58.553  976.5127
703   India 1982  708000000      Asia  56.596  855.7235
702   India 1977  634000000      Asia  54.208  813.3373
701   India 1972  567000000      Asia  50.651  724.0325
700   India 1967  506000000      Asia  47.193  700.7706
699   India 1962  454000000      Asia  43.605  658.3472
698   India 1957  409000000      Asia  40.249  590.0620
697   India 1952  372000000      Asia  37.373  546.5657

Coercing data

When we looked at summary(gapminder) and str(gapminder), we noticed the type for continent was character, but perhaps a factor is more appropriate here because there are only a small number of possible values. We can “coerce” the continent column of gapminder to be a factor using the factor() function.

# Coerce the continent column to a factor
gapminder$continent <- factor(gapminder$continent)

And let’s see how the result of summary() and str() change in response:

summary(gapminder)
   country               year           pop               continent      lifeExp        gdpPercap       
 Length:1704        Min.   :1952   Min.   :6.001e+04   Africa  :624   Min.   :23.60   Min.   :   241.2  
 Class :character   1st Qu.:1966   1st Qu.:2.794e+06   Americas:300   1st Qu.:48.20   1st Qu.:  1202.1  
 Mode  :character   Median :1980   Median :7.024e+06   Asia    :396   Median :60.71   Median :  3531.8  
                    Mean   :1980   Mean   :2.960e+07   Europe  :360   Mean   :59.47   Mean   :  7215.3  
                    3rd Qu.:1993   3rd Qu.:1.959e+07   Oceania : 24   3rd Qu.:70.85   3rd Qu.:  9325.5  
                    Max.   :2007   Max.   :1.319e+09                  Max.   :82.60   Max.   :113523.1  
str(gapminder)
'data.frame':   1704 obs. of  6 variables:
 $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
 $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
 $ gdpPercap: num  779 821 853 836 740 ...

Notice that we got some additionally useful information from summary() from the coercion. Namely, we get the number of entries in gapminder that are on each continent.

Note: StringsAsFactors

There are explicit forms of coercion, as in our factor example above. But there are also implicit coercions. The most famous example of this in R is the stringsAsFactors parameter of read.table(). Prior to R 4.0, when importing a data frame the stringsAsFactors argument was TRUE by default, which caused all character columns to be factors by default. This wasn’t a good default behavior, and the default is now FALSE.

Saving your data frame to a file

We can save data to a file. We will save our india_subset object to a .csv file using the write.csv() function:

write.csv(india_subset, file = "results/india_subset.csv")

The write.csv() function has some additional arguments listed in the help, but at a minimum you need to tell it what data frame to write to file, and give a path to a file name in quotes (if you only provide a file name, the file will be written in the current working directory).

Data tips

At the beginning of the lesson we suggested three things when dealing with data in R:

  1. Keep raw data separate from analyzed data
  2. Keep spreadsheet data Tidy
  3. Trust but verify

As we all know, data can be messy–errors can be made when entering it–and at some point those errors need to be corrected. To elaborate on “Trust but verify,” consider the fact that in the gapminder data set we had many countries over many continents, and any misspelling could cause problems when summarizing the data. For instance, “North America” could accidentally be entered as “NorthAmerica” or “North america”. When reading in data, it’s a good idea to check that there aren’t any odd things happening like this.

For character / categorical data, simply looking at the unique elements of a column can help detect errors quickly:

unique(gapminder$continent)
[1] Asia     Europe   Africa   Americas Oceania 
Levels: Africa Americas Asia Europe Oceania

In this case, we don’t see any obvious misspellings that could cause problems. For numeric data, looking at a summary of the column to see the min/max values, as well as the mean/median, can help determine if any obvious errors are present.

summary(gapminder$pop)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
6.001e+04 2.794e+06 7.024e+06 2.960e+07 1.959e+07 1.319e+09 

In this case, we don’t notice any values that might be out of the ordinary. For example, we know that there are countries with over a billion people, so seeing a maximum of 1,319,000,000 is not strange. At the same time, we know there are very small countries, so seeing a minimum of 60,000 seems reasonable too. What would be unreasonable would be a negative population, for example.