R Basics continued - factors and data frames

Working with spreadsheets (tabular data)

A substantial amount of data is tabular, that is data arranged in rows and columns - also known as spreadsheets. We could write a whole lesson on how to work with spreadsheets effectively (actually, Data Carpentry has). For our purposes, we want to remind you of a few principles before we work with our first set of example data:

1) Keep raw data separate from analyzed data

This is principle number one because if you can’t tell which files are the original raw data, you risk making some serious mistakes (e.g. drawing conclusion from data which have been manipulated in some unknown way).

2) Keep spreadsheet data Tidy

The simplest principle of Tidy data is that we have one row in our spreadsheet for each observation or sample, and one column for every variable that we measure or report on. As simple as this sounds, it’s very easily violated. Most data scientists agree that significant amounts of their time is spent tidying data for analysis. Read more about data organization in this lesson and in this paper.

3) Trust but verify

Finally, while you don’t need to be paranoid about data, you should have a plan for how you will prepare it for analysis. This a focus of this lesson. You probably already have a lot of intuition, expectations, assumptions about your data - the range of values you expect, how many values should have been recorded, etc. Of course, as the data get larger our human ability to keep track will start to fail (and yes, it can fail for small data sets too). R will help you to examine your data so that you can have greater confidence in your analysis, and its reproducibility.

Tip: Keep raw data separate

When you work with data in R, you are not changing the original file you loaded that data from. This is different than (for example) working with a spreadsheet program where changing the value of the cell leaves you one “save”-click away from overwriting the original file. You have to purposely use a writing function (e.g. write.csv()) to save data loaded into R. In that case, be sure to save the manipulated data into a new file. More on this later in the lesson.

Importing tabular data into R

There are several ways to import data into R. For our purpose here, we will focus on using the tools every R installation comes with (so called “base” R) to import a comma-delimited file containing the results of our variant calling workflow. We will need to load the sheet using a function called read.csv().

Exercise: Review the arguments of the read.csv() function

Before using the read.csv() function, use R’s help feature to answer the following questions.

Hint: Entering ‘?’ before the function name and then running that line will bring up the help documentation. Also, when reading this particular help be careful to pay attention to the ‘read.csv’ expression under the ‘Usage’ heading. Other answers will be in the ‘Arguments’ heading.

What is the default parameter for ‘header’ in the read.csv() function?

What argument would you have to change to read a file that was delimited by semicolons (;) rather than commas?

What argument would you have to change to read file in which numbers used commas for decimal separation (i.e. 1,00)?

What argument would you have to change to read in only the first 10,000 rows of a very large file?

Solution

The read.csv() function has the argument ‘header’ set to TRUE by default, this means the function always assumes the first row is header information, (i.e. column names)
The read.csv() function has the argument ‘sep’ set to “,”. This means the function assumes commas are used as delimiters, as you would expect. Changing this parameter (e.g. sep=";") would now interpret semicolons as delimiters.
Although it is not listed in the read.csv() usage, read.csv() is a “version” of the function read.table() and accepts all its arguments. If you set dec="," you could change the decimal operator. We’d probably assume the delimiter is some other character.
You can set nrow to a numeric value (e.g. nrow=10000) to choose how many rows of a file you read in. This may be useful for very large files where not all the data is needed to test some data cleaning steps you are applying.

Hopefully, this exercise gets you thinking about using the provided help documentation in R. There are many arguments that exist, but which we wont have time to cover. Look here to get familiar with functions you use frequently, you may be surprised at what you find they can do.

Now, let’s read in the file data/gapminder_data.csv which will be located in /home/workshop/user/CF_R/data/gapminder_data.csv. Call this data gapminder. The first argument to pass to our read.csv() function is the file path for our data. The file path must be in quotes and now is a good time to remember to use tab autocompletion. If you use tab autocompletion you avoid typos and errors in file paths. Use it!

## read in a CSV file and save it as 'gapminder'

gapminder <- read.csv("data/gapminder_data.csv")

One of the first things you should notice is that in the Environment window, you have the gapminder object, listed as 1704 obs. (observations/rows) of 6 variables (columns). Double-clicking on the name of the object will open a view of the data in a new tab.

rstudio data frame view

Summarizing, subsetting, and determining the structure of a data frame.

A data frame is the standard way in R to store tabular data. A data fame could also be thought of as a collection of vectors, all of which have the same length. Using only two functions, we can learn a lot about out data frame including some summary statistics as well as well as the “structure” of the data frame. Let’s examine what each of these functions can tell us:

## get summary statistics on a data frame

summary(gapminder)

   country               year           pop             continent        
 Length:1704        Min.   :1952   Min.   :6.001e+04   Length:1704       
 Class :character   1st Qu.:1966   1st Qu.:2.794e+06   Class :character  
 Mode  :character   Median :1980   Median :7.024e+06   Mode  :character  
                    Mean   :1980   Mean   :2.960e+07                     
                    3rd Qu.:1993   3rd Qu.:1.959e+07                     
                    Max.   :2007   Max.   :1.319e+09                     
    lifeExp        gdpPercap       
 Min.   :23.60   Min.   :   241.2  
 1st Qu.:48.20   1st Qu.:  1202.1  
 Median :60.71   Median :  3531.8  
 Mean   :59.47   Mean   :  7215.3  
 3rd Qu.:70.85   3rd Qu.:  9325.5  
 Max.   :82.60   Max.   :113523.1

Our data frame has 6 variables, so we get 6 fields that summarize the data. The year, pop, lifeExp and gdpPercap variables are numerical data and so you get summary statistics on the min and max values for these columns, as well as mean, median, and interquartile ranges. The other variables, country and continent, are treated as characters data (more on this in a bit).

Now let’s use the str() (structure) function to look a little more closely at how data frames work:

## get the structure of a data frame
str(gapminder)

'data.frame':   1704 obs. of  6 variables:
 $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
 $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
 $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
 $ gdpPercap: num  779 821 853 836 740 ...

Some things to notice.

the object type data.frame is displayed in the first row along with its dimensions, in this case 1704 observations (rows) and 6 variables (columns)
Each variable (column) has a name (e.g. country). This is followed by the object mode (e.g. chr, int, etc.). Notice that before each variable name there is a $ - this will be important later.

Introducing Factors

Factors are the final major data structure we will introduce in our R lessons. Factors can be thought of as vectors which are specialized for categorical data. Given R’s specialization for statistics, this make sense since categorial and continuous variables are usually treated differently. Sometimes you may want to have data treated as a factor, but in other cases, this may be undesirable.

Let’s see the value of treating variables, some of which are categorical in nature, as factors. Let’s take a look at just the continents.

## extract the "continent" column to a new object
continents <- gapminder$continent

Notice the use of the $, which corresponds to the output of str() above. Let’s take a look at continents and see what we get:

head(continents, 30)

 [1] "Asia"   "Asia"   "Asia"   "Asia"   "Asia"   "Asia"   "Asia"   "Asia"  
 [9] "Asia"   "Asia"   "Asia"   "Asia"   "Europe" "Europe" "Europe" "Europe"
[17] "Europe" "Europe" "Europe" "Europe" "Europe" "Europe" "Europe" "Europe"
[25] "Africa" "Africa" "Africa" "Africa" "Africa" "Africa"

There are 1704 continents (one for each row), and it looks like a lot of them are repeated. Let’s determine what the unique values are.

unique(continents)

[1] "Asia"     "Europe"   "Africa"   "Americas" "Oceania"

This leaves us with just 5 continents. Right now, they are being treated as characters, but we could treat them as categories (after all, the number and names of the continents is fixed, for the time being). Doing this will have some nice downstream effects. Let’s try to generate a plot of the continents character vector as it is right now:

plot(continents)

Warning in xy.coords(x, y, xlabel, ylabel, log): NAs introduced by coercion

Warning in min(x): no non-missing arguments to min; returning Inf

Warning in max(x): no non-missing arguments to max; returning -Inf

Error in plot.window(...): need finite 'ylim' values

Nope! Though the plot() function will do its best to give us a quick plot, it is unable to do so here. One way to fix this it to tell R to treat the continents as categories (i.e. a factor vector); we will create a new object to avoid confusion using the factor() function:

factor_continents <- factor(continents)

Let’s learn a little more about this new type of vector:

str(factor_continents)

 Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...

The categories (“Africa”, “Americas”, etc.) of the factor are returned; these are called “Levels”. Levels are the different categories contained in a factor. By default, R will organize the levels in a factor in alphabetical order. So the first level in this factor is “Africa”.

For the sake of efficiency, R stores the content of a factor as a vector of integers, where an integer is assigned to each of the possible levels. We can see that by the output:

Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...

But why does str() display 3 repeated? Let’s take a look at the head of factor_continents:

head(factor_continents)

[1] Asia Asia Asia Asia Asia Asia
Levels: Africa Americas Asia Europe Oceania

We see repeated values of “Asia”. Notice the levels of the factor are also displayed, and “Asia” is the third in the list, hence the 3.

We can see how many items in our vector are in each category:

summary(factor_continents)

  Africa Americas     Asia   Europe  Oceania 
     624      300      396      360       24

As you can imagine, this is useful when you want to generate a tally.

Tip: treating objects as categories without changing their mode

You don’t have to make an object a factor to get the benefits of treating an object as a factor. See what happens when you use the as.factor() function on continents. To generate a tally, you can sometimes also use the table() function; though sometimes you may need to combine both (i.e. table(as.factor(object)))

Plotting and ordering factors

One of the most common uses for factors will be when you plot categorical values. For example, suppose we want to know how many of our gapminder had each possible SNP we could generate a plot:

plot(factor_continents)

This isn’t a particularly spectacular example of a plot but it works, and it recapitulates the table from summary(). We’ll be learning much more about creating nice, publication-quality graphics later in the workshop.

If you recall, factors are ordered alphabetically. That might make sense, but some categories (e.g., “red”, “blue”, “green”) often do not have an intrinsic order. What if we wanted to order our plot according to the numerical value (i.e., in ascending order of continent frequency)? We can enforce an order on our factors:

ordered_factor_continents <- factor(factor_continents, levels = names(sort(table(factor_continents))))

# And let's quickly look at the result
head(ordered_factor_continents, 20)

 [1] Asia   Asia   Asia   Asia   Asia   Asia   Asia   Asia   Asia   Asia  
[11] Asia   Asia   Europe Europe Europe Europe Europe Europe Europe Europe
Levels: Oceania Americas Europe Asia Africa

Notice here that the order of the elements in the factor didn’t change, but the order of the levels did.

Let’s deconstruct this from the inside out (you can try each of these commands to see why this works):

We create a table of factor_continents to get the frequency of each continent: table(factor_continents)
We sort this table: sort(table(factor_continents)); use the decreasing = parameter for this function if you wanted to change from the default of FALSE
Using the names function gives us just the character names of the table sorted by frequencies:names(sort(table(factor_continents)))
The factor function is what allows us to create a factor. We give it the factor_continents object as input, and use the levels= parameter to enforce the ordering of the levels.

Now we see our plot has be reordered:

plot(ordered_factor_continents)

Factors come in handy in many places when using R. Even using more sophisticated plotting packages such as ggplot2 will sometimes require you to understand how to manipulate factors.

Subsetting data frames

Next, we are going to talk about how you can get specific values from data frames, and where necessary, change the mode of a column of values.

The first thing to remember is that a data frame is two-dimensional (rows and columns). Therefore, to select a specific value we will will once again use [] (bracket) notation, but we will specify more than one value (except in some cases where we are taking a range).

Exercise: Subsetting a data frame

Try the following indices and functions and try to figure out what they return.

gapminder[1,1]

gapminder[2,4]

gapminder[801,29]

gapminder[2, ]

gapminder[-1, ]

gapminder[1:4,1]

gapminder[1:10, c("country","continent")]

gapminder[,c("country")]

head(gapminder)

tail(gapminder)

gapminder$country

gapminder[gapminder$country == "Italy",]

Solution

gapminder[1,1]

[1] "Afghanistan"

gapminder[2,4]

[1] "Asia"

gapminder[801,4]

[1] "Asia"

gapminder[2, ]

      country year     pop continent lifeExp gdpPercap
2 Afghanistan 1957 9240934      Asia  30.332   820.853

gapminder[-1, ]

      country year      pop continent lifeExp gdpPercap
2 Afghanistan 1957  9240934      Asia  30.332  820.8530
3 Afghanistan 1962 10267083      Asia  31.997  853.1007
4 Afghanistan 1967 11537966      Asia  34.020  836.1971
5 Afghanistan 1972 13079460      Asia  36.088  739.9811
6 Afghanistan 1977 14880372      Asia  38.438  786.1134
7 Afghanistan 1982 12881816      Asia  39.854  978.0114

gapminder[1:4,1]

[1] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan"

gapminder[1:10, c("country","continent")]

       country continent
1  Afghanistan      Asia
2  Afghanistan      Asia
3  Afghanistan      Asia
4  Afghanistan      Asia
5  Afghanistan      Asia
6  Afghanistan      Asia
7  Afghanistan      Asia
8  Afghanistan      Asia
9  Afghanistan      Asia
10 Afghanistan      Asia

gapminder[,c("country")]

[1] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan"
[6] "Afghanistan"

head(gapminder)

      country year      pop continent lifeExp gdpPercap
1 Afghanistan 1952  8425333      Asia  28.801  779.4453
2 Afghanistan 1957  9240934      Asia  30.332  820.8530
3 Afghanistan 1962 10267083      Asia  31.997  853.1007
4 Afghanistan 1967 11537966      Asia  34.020  836.1971
5 Afghanistan 1972 13079460      Asia  36.088  739.9811
6 Afghanistan 1977 14880372      Asia  38.438  786.1134

tail(gapminder)

      country year      pop continent lifeExp gdpPercap
1699 Zimbabwe 1982  7636524    Africa  60.363  788.8550
1700 Zimbabwe 1987  9216418    Africa  62.351  706.1573
1701 Zimbabwe 1992 10704340    Africa  60.377  693.4208
1702 Zimbabwe 1997 11404948    Africa  46.809  792.4500
1703 Zimbabwe 2002 11926563    Africa  39.989  672.0386
1704 Zimbabwe 2007 12311143    Africa  43.487  469.7093

11..

gapminder$country

[1] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan"
[6] "Afghanistan"

gapminder[gapminder$country == "Italy",]

    country year      pop continent lifeExp gdpPercap
769   Italy 1952 47666000    Europe  65.940  4931.404
770   Italy 1957 49182000    Europe  67.810  6248.656
771   Italy 1962 50843200    Europe  69.240  8243.582
772   Italy 1967 52667100    Europe  71.060 10022.401
773   Italy 1972 54365564    Europe  72.190 12269.274
774   Italy 1977 56059245    Europe  73.480 14255.985
775   Italy 1982 56535636    Europe  74.980 16537.483
776   Italy 1987 56729703    Europe  76.420 19207.235
777   Italy 1992 56840847    Europe  77.440 22013.645
778   Italy 1997 57479469    Europe  78.820 24675.024
779   Italy 2002 57926999    Europe  80.240 27968.098
780   Italy 2007 58147733    Europe  80.546 28569.720

The subsetting notation is very similar to what we learned for vectors. The key differences include:

We typically provide two values separated by commas: data.frame[row, column]
In case of a continuous range of numbers use a colon between the numbers (start:stop, inclusive)
For a non continuous set of numbers, pass a vector using c()
Index using the name of a column(s) by passing them as vectors using c()

Finally, in all of the subsetting exercises above, we printed values to the screen. You can create a new data frame object by assigning them to a new object name:

# create a new data frame containing only observations from India
india_subset <- gapminder[gapminder$country == "India",]

# check the dimension of the data frame
dim(india_subset)

[1] 12  6

# get a summary of the data frame
summary(india_subset)

   country               year           pop             continent        
 Length:12          Min.   :1952   Min.   :3.720e+08   Length:12         
 Class :character   1st Qu.:1966   1st Qu.:4.930e+08   Class :character  
 Mode  :character   Median :1980   Median :6.710e+08   Mode  :character  
                    Mean   :1980   Mean   :7.011e+08                     
                    3rd Qu.:1993   3rd Qu.:8.938e+08                     
                    Max.   :2007   Max.   :1.110e+09                     
    lifeExp        gdpPercap     
 Min.   :37.37   Min.   : 546.6  
 1st Qu.:46.30   1st Qu.: 690.2  
 Median :55.40   Median : 834.5  
 Mean   :53.17   Mean   :1057.3  
 3rd Qu.:60.61   3rd Qu.:1238.0  
 Max.   :64.70   Max.   :2452.2

Coercing values to a specific mode

Sometimes, it is possible that R will misinterpret the type of data represented in a data frame, or store that data in a mode which prevents you from operating on the data the way you wish. For example, a long list of gene names isn’t usually thought of as a categorical variable, the way that your experimental condition (e.g. control, treatment) might be. More importantly, some R packages you use to analyze your data may expect characters as input, not factors (the results can be beguiling). At other times (such as plotting or some statistical analyses) a factor may be more appropriate (as we saw above). Ultimately, you should know how to change the mode of an object.

First, its very important to recognize that coercion happens in R all the time. This can be a good thing when R gets it right, or a bad thing when the result is not what you expect. Consider:

chromosomes <- c('3', '11', 'X', '6')
typeof(chromosomes)

[1] "character"

Although there are several numbers in our vector, they are all in quotes, so we have explicitly told R to consider them as characters. However, even if we removed the quotes from the numbers, R would coerce everything into a character:

chromosomes_2 <- c(3, 11, 'X', 6)
typeof(chromosomes_2)

[1] "character"

chromosomes_2[1]

[1] "3"

We can use the as. functions to explicitly coerce values from one form into another. Consider the following vector of characters, which all happen to be valid numbers:

positions <- c("8762685", "66560624", "67545785", "154039662")
typeof(positions)

[1] "character"

positions[1]

[1] "8762685"

Now we can coerce positions into a numeric type using as.numeric():

positions <- as.numeric(positions)
typeof(positions)

[1] "double"

positions[1]

[1] 8762685

Sometimes coercion is straight-forward, but what would happen if we tried using as.numeric() on chromosomes_2

chromosomes_2 <- as.numeric(chromosomes_2)

Warning: NAs introduced by coercion

If we check, we will see that an NA value (R’s default value for missing data) has been introduced.

chromosomes_2

[1]  3 11 NA  6

Trouble can really start when we try to coerce a factor. For example, when we try to coerce the factor_continents factor from above into a numeric mode look at the result:

head(as.numeric(factor_continents))

[1] 3 3 3 3 3 3

Strangely, it works! Almost. Instead of giving an error message, R returns numeric values, which in this case are the integers assigned to the levels in this factor. This kind of behavior can lead to hard-to-find bugs, for example when we do have numbers in a factor, and we get numbers from a coercion. If we don’t look carefully, we may not notice a problem.

Coercing in Data Frames

If you need to coerce an entire column you can overwrite it using an expression like this one:

# make the 'continent' column a factor

gapminder$continent <- as.factor(gapminder$continent)

# check the type of the column
typeof(gapminder$continent)

[1] "integer"

# look at summary() again
summary(gapminder)

   country               year           pop               continent  
 Length:1704        Min.   :1952   Min.   :6.001e+04   Africa  :624  
 Class :character   1st Qu.:1966   1st Qu.:2.794e+06   Americas:300  
 Mode  :character   Median :1980   Median :7.024e+06   Asia    :396  
                    Mean   :1980   Mean   :2.960e+07   Europe  :360  
                    3rd Qu.:1993   3rd Qu.:1.959e+07   Oceania : 24  
                    Max.   :2007   Max.   :1.319e+09                 
    lifeExp        gdpPercap       
 Min.   :23.60   Min.   :   241.2  
 1st Qu.:48.20   1st Qu.:  1202.1  
 Median :60.71   Median :  3531.8  
 Mean   :59.47   Mean   :  7215.3  
 3rd Qu.:70.85   3rd Qu.:  9325.5  
 Max.   :82.60   Max.   :113523.1

Notice that the continent column is summarized more informatively now that it is a factor.

StringsAsFactors = ?

Lets summarize this section on coercion with a few take home messages.

When you explicitly coerce one data type into another (this is known as explicit coercion), be careful to check the result. Ideally, you should try to see if its possible to avoid steps in your analysis that force you to coerce.
R will sometimes coerce without you asking for it. This is called (appropriately) implicit coercion. For example when we tried to create a vector with multiple data types, R chose one type through implicit coercion.
Check the structure (str()) of your data frames before working with them!

Tip: coercion isn’t limited to data frames

Prior to R 4.0 when importing a data frame using any one of the read.table() functions such as read.csv() , the argument StringsAsFactors was by default set to true TRUE. Setting it to FALSE will treat any non-numeric column to a character type. read.csv() documentation, you will also see you can explicitly type your columns using the colClasses argument. Other R packages (such as the Tidyverse “readr”) don’t have this particular conversion issue, but many packages will still try to guess a data type.

Data frame bonus material: math, sorting, renaming

Here are a few operations that don’t need much explanation, but which are good to know.

There are lots of arithmetic functions you may want to apply to your data frame, covering those would be a course in itself (there is some starting material here). Our lessons will cover some additional summary statistical functions in a subsequent lesson, but overall we will focus on data cleaning and visualization.

You can use functions like mean(), min(), max() on an individual column. Let’s look at the “year” or filtered depth. This value shows the number of filtered reads that support each of the reported gapminder.

max(gapminder$year)

[1] 2007

You can sort a data frame using the order() function:

sorted_by_year <- gapminder[order(gapminder$year), ]
head(sorted_by_year$year)

[1] 1952 1952 1952 1952 1952 1952

Exercise

The order() function lists values in increasing order by default. Look at the documentation for this function and change sorted_by_year to start with the latest year at the top.

Solution

sorted_by_year <- gapminder[order(gapminder$year, decreasing = TRUE), ]
head(sorted_by_year$year)

[1] 2007 2007 2007 2007 2007 2007

You can rename columns:

colnames(gapminder)[colnames(gapminder) == "lifeExp"] <- "life_exp"
colnames(gapminder)[colnames(gapminder) == "gdpPercap"] <- "gdp_per_cap"

# check the column name (hint names are returned as a vector)
colnames(gapminder)

[1] "country"     "year"        "pop"         "continent"   "life_exp"   
[6] "gdp_per_cap"

Saving your data frame to a file

We can save data to a file. We will save our india_subset object to a .csv file using the write.csv() function:

write.csv(india_subset, file = "results/india_subset.csv")

The write.csv() function has some additional arguments listed in the help, but at a minimum you need to tell it what data frame to write to file, and give a path to a file name in quotes (if you only provide a file name, the file will be written in the current working directory).

R Basics continued - factors and data frames

Working with spreadsheets (tabular data)

Tip: Keep raw data separate

Importing tabular data into R

Exercise: Review the arguments of the read.csv() function

Summarizing, subsetting, and determining the structure of a data frame.

Introducing Factors

Tip: treating objects as categories without changing their mode

Plotting and ordering factors

Subsetting data frames

Exercise: Subsetting a data frame

Coercing values to a specific mode

Coercing in Data Frames

StringsAsFactors = ?

Tip: coercion isn’t limited to data frames

Data frame bonus material: math, sorting, renaming

Exercise

Saving your data frame to a file

Exercise: Review the arguments of the `read.csv()` function