Basic R
We have three main goals for this week’s lab session:
- Getting familiar with R functions,
- Learning how to handle vectors, and
- Learning how to handle matrices.
(You might feel that what we do in the lab sessions are not quite related to what we learn in the lecture. The connection between the two will become more evident after the first three weeks. For this week and the next, we need to go through some basics of R to get you started, so bear with me.)
1. Using functions
So far we have written very short pieces of code to perform simple calculations and operations. We will learn how to do a lot more complex mathematical operations on our data, draw graphs, and carry out statistical analyses by writing more complex pieces of code.
R comes with a variety of ready-made pieces of code that do things like managing data, drawing graphs, and calculating statistics. These are called functions.
Goolge can provide R Cheat Sheet, which lists many R functions we frequently use.
Each function ends in a pair of brackets ()
Each function can take arguments. You put them in the brackets. For example, there is a function that makes R display the list of objects currently stored in the memory.
Let’s execute the following line:
# ls()
The output it returns reads: character(0)
. This means that currently there is nothing in R’s memory. This is because we have not created any object since we started an R session. Let’s create several objects and run this function again.
Execute the following
x1 <- 0.3
x2 <- "Hello"
x3 <- TRUE
ls()
## [1] "x1" "x2" "x3"
Now, R says that there are three objects in memory: “x1” “x2” “x3”. The ls function only tells you the names of the objects stored in memory, but doesn’t tell you what they are.
Recall that, to find out what’s inside these objects, we call their names
x1
## [1] 0.3
x2
## [1] "Hello"
x3
## [1] TRUE
Notice that x1 has a numeric value (0.3), but x2 is a word enclosed in double quotation marks (“Hello”), and x3 is a word without any quotation marks. Each of these three objects has a different property.
- x1 is called a numeric object.
- x2 is called a character object.
- x3 is called a logical object (TRUE or FALSE).
There are functions that tell us the property of a given object. The is.numeric function tells us whether or not a given object is a numeric one. To use this function, we write:
is.numeric(x1)
## [1] TRUE
R gives you an answer by saying TRUE
or FALSE
. Since x1 is indeed a numeric object, we obtained an answer TRUE. However, if we ask:
is.numeric(x2); class(x2)
## [1] FALSE
## [1] "character"
is.numeric(x3)
## [1] FALSE
For both operations, the answers are FALSE. There are functions to find out if an object is a character object (is.character) or if an object is a logical object (is.logical).
is.character(x1)
## [1] FALSE
is.character(x2)
## [1] TRUE
is.character(x3)
## [1] FALSE
is.logical(x1)
## [1] FALSE
is.logical(x2)
## [1] FALSE
is.logical(x3)
## [1] TRUE
What we provided inside ()
is called an argument. We provided x1, x2, and x3 as an argument when running the above functions.
Some functions require that you provide one or more arguments inside ()
, whereas others don’t. For example, the ls
function does not require any argument inside ()
.
ls()
## [1] "x1" "x2" "x3"
But, the is.numeric function returns an error if you try to use it without providing any argument:
# is.numeric()
R says Error in is.numeric()
: 0 arguments passed to ‘is.numeric’ which requires 1. This means that this function needs you to specify one argument.
Let’s try using some other functions:
rm
The rm
function deletes (removes) an object from the memory. For obvious reasons, this function requires an argument; otherwise R cannot know which object you want removed.
Let’s remove x1 from the memory.
rm(x1)
Let’s check if x1 was indeed removed.
# x1
R should give you an error message (Error: object ‘x1’ not found)
Or, you can do
ls()
## [1] "x2" "x3"
to see that x1 is not among the list of objects currently stored.
Math functions
There are functions to perform mathematical operations. Take the square root of a number:
sqrt(400)
## [1] 20
We could put a numerical object instead of a number
x4 <- 400
sqrt(x4)
## [1] 20
Take the absolute value of a number (i.e. unsigned number)
abs(-8)
## [1] 8
We can nest one function inside another function
sqrt( abs(-400) )
## [1] 20
Here, we ask R to return the absolute value of -400
, then to return the square root of it.
Take the natural log of a number
log(25)
## [1] 3.218876
Exponentiate a number
exp(3.218876)
## [1] 25
As you may know, exp
is an inverse function of log
log(exp(1.234))
## [1] 1.234
exp(log(5.678))
## [1] 5.678
Functions with multiple arguments
Some functions can take multiple arguments. For example, the sum function returns the sum of all arguments:
sum(1,2,3,4,5)
## [1] 15
This is equivalent to 1 + 2 + 3 + 4 + 5
.
Sometimes you provide arguments to control what the function does. For example, the round function rounds off a number to a certain number of decimal places. In using this function,
- You must tell R what number you want rounded (the first argument)
- You can also provide an optional argument that specifies how many decimal places to round. This is done by providing a second argument with a comma separating it from the first argument
round(3.14159265, 4)
## [1] 3.1416
round(3.14159265, 3)
## [1] 3.142
round(3.14159265, 2)
## [1] 3.14
The second argument we provided controls how many decimal points the output should have. As I said, in this particular case, the second argument is optional. If we don’t provide a second argument, R assumes that you want the number rounded to an integer:
round(3.14159265)
## [1] 3
which is equivalent to specifying 0 as a second argument
round(3.14159265, digit =0)
## [1] 3
In providing multiple arguments like this, it is highly recommended that you specify the name of the argument explicitly. In this particular case, the name of the second argument is “digits”.
round(3.14159265, digits = 4)
## [1] 3.1416
round(3.14159265, 4)
## [1] 3.1416
Although the above two codes give you the same answers, the first code is much more intuitive and informative for readers.
To find out what arguments you can (or you must) provide for a given function, and what value is used as a default value when you don’t provide any argument, we can take a look at a help file.
To take a look at the help file for a given function, we put a question mark in front of a function, as follows:
# ?class
# ?round
This opens up the help file in the Help pane. We can see that the help file explains five different functions that do similar operations, including ceiling, floor, trunc, round, and signif. Read the description for the round function. We can see that the default value of “digits” is 0. That is why the following two give us the same answers:
round(3.1415925, digits = 0)
## [1] 3
round(3.1415925)
## [1] 3
2. Vectors
So far we have worked with single numbers. For example, we have seen how to multiply a single number 3 by another single number 23. (i.e., 3 * 23).
In statistical analysis, we will be working with sequences and tables of numbers rather than single numbers. They are called vectors and matrices.
A vector is nothing but a collection of numbers. Vectors do not have to be sequential or ordered.
The easiest way to create a vector is to use the c
function. (short for concatenate or combine).
c(2, -1, 0, 9)
## [1] 2 -1 0 9
The above is a vector that binds together four numbers, 2, -1, 0, 9. We can also store this vector into an object.
vec.1 <- c(2, -1, 0, 9)
vec.1
## [1] 2 -1 0 9
Another way to create a vector is to use the seq function (short for sequence). As the name implies, this function creates a sequence from one number to another. It takes at least two arguments, from and to
seq(from = 0, to = 10)
## [1] 0 1 2 3 4 5 6 7 8 9 10
seq(from = 5, to = 1)
## [1] 5 4 3 2 1
here are some additional arguments you may provide, by
or length
. The by
arguments specifies the increments.
For example, if we specify by = 0.5
, R will create a sequence with an increment of 0.5.
seq(from = 0, to = 5, by = 0.5)
## [1] 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
The length
arguments determines the number of elements of the resulting sequence. For example, if we specify “length = 10”, R will create a sequence that has 10 elements (numbers).
seq(from = 0, to = 2, length = 10)
## [1] 0.0000000 0.2222222 0.4444444 0.6666667 0.8888889 1.1111111 1.3333333
## [8] 1.5555556 1.7777778 2.0000000
For obvious reasons, we cannot use by
and length
simultaneously.
# seq(from = 0, to = 5, by = 0.5, length = 10)
When we just want to create a simple sequence with an increment of 1, we can use the : operator instead.
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
This is the same as
seq(from = 1, to = 10)
## [1] 1 2 3 4 5 6 7 8 9 10
Vector operation
Let’s create several vectors and perform some operations on them.
x.vec <- seq(from = 2, to = 6)
y.vec <- c(3, 0, 1, -2, 4.3)
x.vec
## [1] 2 3 4 5 6
y.vec
## [1] 3.0 0.0 1.0 -2.0 4.3
We can do arithmetic operations with vectors. For example
x.vec
## [1] 2 3 4 5 6
x.vec + 3
## [1] 5 6 7 8 9
adds three to every number in the vector.
y.vec
## [1] 3.0 0.0 1.0 -2.0 4.3
y.vec * -2
## [1] -6.0 0.0 -2.0 4.0 -8.6
multiplies every number in the vector by minus 2.
Since the number of elements is the same for x.vec and y.vec, we can do arithmetic operations with them.
x.vec
## [1] 2 3 4 5 6
y.vec
## [1] 3.0 0.0 1.0 -2.0 4.3
x.vec + y.vec
## [1] 5.0 3.0 5.0 3.0 10.3
Notice that the first number (5.0) in the output is the sum of the first number in x.vec (2) and the first number in y.vec (3.0), the second number (3.0) in the output is the sum of the second numbers of x.vec and y.vec, and so on.
x.vec * y.vec
## [1] 6.0 0.0 4.0 -10.0 25.8
y.vec / x.vec
## [1] 1.5000000 0.0000000 0.2500000 -0.4000000 0.7166667
Functions work on vectors as well.
x.vec
## [1] 2 3 4 5 6
sqrt(x.vec)
## [1] 1.414214 1.732051 2.000000 2.236068 2.449490
What would happen if we try to combine two vectors of different length? It turns out that R will still work, but the output may not be what you’d expect. Moreover, R gives you a warning.
z.vec <- c(1,2,3)
x.vec
## [1] 2 3 4 5 6
x.vec + z.vec
## Warning in x.vec + z.vec: longer object length is not a multiple of shorter
## object length
## [1] 3 5 7 6 8
Here, the x.vec
is of length 5, whereas the z.vec
is of length 3. R is warning you that the longer object length (5) is not a multiple of shorter object length (3).
Matrices
When we have a collection of numbers that are arranged in two dimensions rather than one, we say we have a matrix. We can create one using the matrix function.
mat.1 <- matrix(data = seq(from = 1, to = 12), nrow = 3, ncol = 4)
mat.1
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
Let’s see what the code above does.
The first argument data = ...
specifies the contents (numbers) that are stored in the matrix called mat.1. In this particular case, we tell R to create a sequence of numbers from 1 to 12 (1,2,3,4,…,11,12).
The second argument nrow = ...
specifies the number of rows. The third argument ncol = ...
specifies the number of columns.
Obviously, if you specify nrow, you don’t really need to specify ncol because it’s implied (since we have 12 elements, nrow = 3 implies that ncol is 4).
When using functions that take multiple arguments like this, it is advisable that you provide arguments in separate lines, as follows
Select the following four lines and execute them
mat.1 <- matrix(data = seq(from = 1, to = 12),
nrow = 3,
ncol = 4)
mat.1
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
This is to improve readability of your code. Readers can easily see that you are providing nrow and ncol as arguments for the matrix function, not for the seq function.
Notice that a matrix is nothing but a collection of vectors (and a vector is nothing but a collection of numbers). In other words, we can break down any matrices into vectors.
Let’s take a look at the mat.1 matrix again
mat.1
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
There are two ways to break down this matrix. First, we can think of this matrix as being made up with three row vectors. The first row vector is 1 4 7 10
, the second is 2 5 8 11
, and the third is 3 6 9 12
.
To extract (row or column) vectors, we use square brackets []. For example, to extract the second row vector, we write
mat.1[2, ]
## [1] 2 5 8 11
Notice that we need to put a comma (,) after the number 2. This is to tell R that you want the 2nd column, not the 2nd element of a matrix.
The second way to break down this matrix is to think of this as being made up with four column vectors. The first column vector is 1 2 3
, the second is 4 5 6
, the third is 7 8 9
, and the fourth is 10 11 12
. To extract column vectors from a matrix, we also use square brackets [], but need to put a number after a comma.
For example, to extract the third column vector, we write.
mat.1[ , 3]
## [1] 7 8 9
We can extract a cell entry (single number) by combining these two techniques that we just learned. That is, to extract the element stored in the i-th row, j-th column, we write [i, j]
. For example, to find out the number in the 3rd row, 4th column, we write
mat.1[3, 4]
## [1] 12
If we don’t specify any row or column numbers inside the square brackets, R returns the entire matrix.
mat.1[ , ]
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
mat.1[]
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
We can also specify multiple rows and/or columns at the same time.
mat.1[, c(2,3)] # This gives us the second and the third columns
## [,1] [,2]
## [1,] 4 7
## [2,] 5 8
## [3,] 6 9
mat.1[1:2, ] # This gives us the first and the second rows.
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
We can transpose a matrix
t(mat.1)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
## [4,] 10 11 12
Instead of using the matrix function, we can create a matrix by combining multiple vectors. Recall that we have created two vectors, x.vec and y.vec above. We can create new matrices by binding these two together.
The rbind
function binds multiple vectors, treating them as row vectors.
rbind(x.vec, y.vec)
## [,1] [,2] [,3] [,4] [,5]
## x.vec 2 3 4 5 6.0
## y.vec 3 0 1 -2 4.3
Similarly, the cbind function binds multiple vectors treating them as column vectors.
cbind(x.vec, y.vec)
## x.vec y.vec
## [1,] 2 3.0
## [2,] 3 0.0
## [3,] 4 1.0
## [4,] 5 -2.0
## [5,] 6 4.3
3. Data frame
In practice, we work with a data set that contains several different types of data (numeric / logical / character). These data sets are stored in a format called “data frame.”
Let’s construct a mini data set stored in the data frame format.
We assume that we have collected information on a few variables for 7 countries. The data set will look like this:
# Country_ID Country_Name Regime_Type GDP_PC EU_Member
# 1 United States Democracy 51163 FALSE
# 2 United Kingdom Democracy 39367 TRUE
# 3 Japan Democracy 46838 FALSE
# 4 China Dictatorship 6070 FALSE
# 5 Brazil Democracy 11347 FALSE
# 6 Germany Democracy 41376 TRUE
# 7 Egypt Dictatorship 3115 FALSE
The data set above has five variables (columns), of which
- –two are numeric variables (Country_ID and GDP_PC);
- –two are character variables (Country_Name and Regime_Type) and;
- –one is a logical variable (EU_Member).
Let’s now construct this data set “by hand” and store it as an R object.
However, it is useful to know how to construct a data set by hand, in case we come across any errors or problems in handling data sets. It is also useful to practice your data management skills with smaller data sets first, before working with bigger data sets.
Anyhow, one way to view the data set above is to think of it as a collection of (column) vectors (just like we viewed matrices as a collection of vectors).
So, we have a vector called Country_ID
that contains 1, 2, 3, 4, 5, 6, 7 as values, another vector called Country_Name
that contains “United States”, “United Kingdom”, … etc. as values, etc.
So, let’s first create 5 vectors, as follows
first column
Country_ID <- seq(from = 1, to = 7)
second column
Country_Name <- c("United States", "United Kingdom", "Japan", "China", "Brazil", "Germany", "Egypt")
The line above is a bit too long. We can write the sentence in the following way instead and get the same result (you have to execute the two lines at once).
Country_Name <- c("United States", "United Kingdom", "Japan",
"China", "Brazil", "Germany", "Egypt")
We can insert a line break at other places as well. For example, the following will also work
Country_Name <- c(
"United States", "United Kingdom", "Japan",
"China", "Brazil", "Germany", "Egypt")
However, you MUST NOT insert a line break within quotation marks, like the following
Bad_Vector <- c("United States", "United
Kingdom", "Japan", "China", "Brazil", "Germany", "Egypt")
This is not only difficult to read, but also incorrect. R will interpret it to mean that the second element is “United + (line break) + Kingdom”. So, if we take a look at it
# Bad_Vector
The symbol that shows up in the second element, \n
, is “line break” in R.
Third, fourth, and fifth columns
Regime_Type <- c("Democracy", "Democracy", "Democracy", "Dictatorship",
"Democracy", "Democracy", "Dictatorship")
GDP_PC <- c(51163, 39367, 46838, 6070, 11347, 41376, 3115)
EU_Member <- c(FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE)
Notice that we don’t put quotation marks around TRUE
or FALSE
because these are special words that R recognizes as values. Notice also that TRUE
and FALSE
are shown in a different color (probably light blue / pale purple), because, again, R recognizes that they are special.
Now that we have all five columns, let’s bind them together and create a data set. We use the data.frame()
function.
data.frame(Country_ID, Country_Name, Regime_Type, GDP_PC, EU_Member)
## Country_ID Country_Name Regime_Type GDP_PC EU_Member
## 1 1 United States Democracy 51163 FALSE
## 2 2 United Kingdom Democracy 39367 TRUE
## 3 3 Japan Democracy 46838 FALSE
## 4 4 China Dictatorship 6070 FALSE
## 5 5 Brazil Democracy 11347 FALSE
## 6 6 Germany Democracy 41376 TRUE
## 7 7 Egypt Dictatorship 3115 FALSE
# Of course, we could (and should) store this as an object. Let's call it
# my.data.
my.data <- data.frame(Country_ID, Country_Name, Regime_Type, GDP_PC, EU_Member)
# As the name of the function data.frame() implies, the object my.data
# is in the data.frame format.
class(my.data)
## [1] "data.frame"
# A nice thing about the data.frame object is that we can access each column
# by calling the column name. For example,
my.data
## Country_ID Country_Name Regime_Type GDP_PC EU_Member
## 1 1 United States Democracy 51163 FALSE
## 2 2 United Kingdom Democracy 39367 TRUE
## 3 3 Japan Democracy 46838 FALSE
## 4 4 China Dictatorship 6070 FALSE
## 5 5 Brazil Democracy 11347 FALSE
## 6 6 Germany Democracy 41376 TRUE
## 7 7 Egypt Dictatorship 3115 FALSE
# to access the 4th column (GDP_PC variable), we type in the column name
# (GDP_PC) followed by the data frame name (my.data) and a $ symbol.
my.data $ GDP_PC
## [1] 51163 39367 46838 6070 11347 41376 3115
# Alternatively, we can use the square bracket (just like we did with matrices)
my.data[, 4]
## [1] 51163 39367 46838 6070 11347 41376 3115
# That said, calling the variable directly is better than using the [i,j]
# expression for the sake of readability of your code.
# We can apply various functions on data frames.
# For example, the summary function returns summaries of all the variables
# included in a data frame object.
summary(my.data)
## Country_ID Country_Name Regime_Type GDP_PC
## Min. :1.0 Length:7 Length:7 Min. : 3115
## 1st Qu.:2.5 Class :character Class :character 1st Qu.: 8708
## Median :4.0 Mode :character Mode :character Median :39367
## Mean :4.0 Mean :28468
## 3rd Qu.:5.5 3rd Qu.:44107
## Max. :7.0 Max. :51163
## EU_Member
## Mode :logical
## FALSE:5
## TRUE :2
##
##
##
# If you want a summary of one variable included in the data set, we just
# call its name in using the summary function:
summary(my.data $ GDP_PC)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3115 8708 39367 28468 44107 51163
# We see things like Min., 1st Qu, Median, Mean, etc.
This concludes this week’s joint exercise. Now, go to individual exercises!