Topic 2: An Introduction to R

About

This notebook is a first introduction to the R language for data analysis. Learn about calculation, variables, scalars, vectors, data frames and how to work with each in R.

An Introduction to R

Through these workbooks you’ll be interacting with data using the R programming language. Alongside Python, R is one of the most widely used tools for statistics and data science. This workbook is a first introduction to working with data using R.

Why R?

You might be asking why do I need to use R rather than a different software like Minitab, SPSS, SAS, or a number of others. Consider the following three points.

  • R is open source, meaning it is free to use and continually developed by a large community of users. You are not dependent on a paid license or a specific institution for access.
  • Because R is a programming language rather than a point-and-click interface, you are not limited to a fixed collection of menu options. While there is certainly a learning curve, R gives you tremendous flexibility in what you can analyze, visualize, and automate.
  • Analyses written in R are transparent and reproducible. Rather than trying to remember which buttons were clicked, you can save and rerun your exact analysis later, modify it easily, and share it with others.
  • We live in a world increasingly shaped by technology and data. Learning to read and write code is a valuable and transferable skill that extends far beyond statistics.

This workbook will introduce you to interacting with R through executable code cells. Whenever you see a code cell, feel free to play around with any code that appears in the cell and see how your changes impact the results of the code. You can execute code cells by hitting the Run button, using ctrl+Enter, or cmd+Enter. The majority of the code cells that you encounter are exercises that can give you feedback. Running the code will result in the corresponding output, along with an indicator of whether you were right or not. Don’t worry if you get anything wrong, you can always try again!

Try editing the following cell to execute 2 + 3 rather than 2 + 2, as is preset.

Hint 1

Replace the second two with a three.

2 + ___
Hint 2 (Solved)

Replace the second two with a three.

2 + 3
2 + 3

2 + 3

A few warnings

Coding can be frustrating – you are certain to run into errors (even very experienced programmers do!). Here are a few common things to watch out for.

  • Spelling counts! If you are getting an error, check to see that you’ve spelled everything correctly, including both function and object names.
  • Capitalization counts! Again, check to see that everything is typed exactly as it should be.
  • Be sure to ask for help when you need it.
    • For most functionality, R has built-in help documentation. You can find help on a function, package, or dataset by typing a question mark (?) followed immediately by the package name and executing that line. For example, running ?max will generate the help vignette for R’s max() function. Unfortunately the help documentation won’t populate within these interactive workbooks, but if you go back to RStudio, then running these commands from the prompt (>) in the console (the lower left pane) will work as described.
    • If you are struggling with troubleshooting your code, ask a teacher, mentor, or friend. Speaking from experience, it is really hard to identify small issues like a misplaced comma in code. A fresh set of eyes often does the trick.
    • Google is your friend here. If you are having an issue, chances are that someone else has had the same issue. Try a descriptive Google search, such as how do I find the mean of a list of numbers in R. The website StackExchange is a question and answer site that will commonly be listed in your search results – check to see that the question of the top of the page is relevant to what you asked, and then scroll down to the answer with the green check mark next to it. This has been marked as the accepted solution.
      • While Google and the wider web can be great resources, they can also lead you to very complex solutions. The interactive code blocks in these notebooks will either ask you to implement functionality that has been previously introduced, or to make small edits to existing code. Web-based solutions that look quite complex are almost surely not the intended solutions to any code block task.
  • If you find yourself smashing keys and getting frustrated, it is time for a break. Those fresh eyes I mentioned earlier don’t always need to be somebody else’s. Coding is hard work and takes time, but with patience and practice you’ll improve quickly.

Hello World!

If you’ve never coded before (or even if you have), type print("Hello World!") in the interactive R cell below and run it by hitting ctrl+Enter or cmd+Enter for MAC users.

Hint 1

Use the print() function to print out text.

print(___)
Hint 2

Use the print() function to print out text. Your text must go inside of quotes (single-quotes or double-quotes work, as long as you are consistent).

print("___")
Hint 3 (Solved)

Print "Hello World!"

print("Hello World!")
print("Hello World!")

print("Hello World!")

An R Primer

The following subsections constitute a primer for working in R. You’ll be exposed to R as a simple calculator and also as a tool for interfacing with data frames. Think of data frames (sometimes called data matrices or data arrays) as tables of data, just like Excel Spreadsheets. We’ll start by working with R as a calculator though, which you’ve actually done already in this workbook – when you changed the code cell to evaluate 2 + 3 instead of 2 + 2.

R can be used as a Calculator

You can (and will) use R to perform basic calculation tasks. You will not need the use of an additional calculator for this course – R is capable of everything your old calculator was and also much more.

Use the code block below to experiment.

Use the code block below to evaluate the expression 4^3 - 16.

Hint 1

Type in the expression.

Hint 2

Type in the expression.

___^___ - ___
Hint 3 (Solved)

Type in the expression.

4^3 - 16
4^3 - 16

4^3 - 16

R has built in functions to handle operations like the square root (sqrt()), logarithm (log()), etc. Google is your friend here – if you need to carry out an operation, do a quick internet search to discover how you can call it in R.

We can also piece together more complicated expressions in R by making use of parentheses () for grouping. Beware that none of the other brackets (square brackets or curly braces) may be used for grouping since they have special meaning in R – we’ll see them in action later in our course. For now, use the code block below to evaluate \(\displaystyle{\frac{54 - 50}{8/\sqrt{20}}}\).

Hint 1

Replace each blank with an appropriate value or expression.

(___)/(___)
Hint 2

Replace each blank with an appropriate value or expression.

(___ - ___)/(___/___)
Hint 3

Replace each blank with an appropriate value or expression.

(___ - ___)/(___/sqrt(___))
Hint 4 (Solved)

Replace each blank with an appropriate value or expression.

(54 - 50)/(8/sqrt(20))
(54 - 50)/(8/sqrt(20))

(54 - 50)/(8/sqrt(20))

Add parentheses to the expression below so that it evaluates to 1.

Hint 1

Start by grouping the 8 - 6

(8 - 6)/2 + 5*3*4 - 12
Hint 2

Notice (8 - 6)/2 is 1; can you add another set of parentheses to “zero-out” all but this first term?

(8 - 6)/2 + 5*3*4 - 12
Hint 3 (Solved)

Notice (8 - 6)/2 is 1; can you add another set of parentheses to “zero-out” all but this first term?

(8 - 6)/2 + 5*(3*4 - 12)
(8 - 6)/2 + 5*(3*4 - 12)

(8 - 6)/2 + 5*(3*4 - 12)

Were you able to get the code to evaluate to 1 just by introducing parentheses? If not, check the solution code for a possible answer.

Defining Objects in R

In R we can define objects via the <- operator. For example, we can define the value 4 to the container x using x <- 4. Use the code block below to change the value stored in x from 4 to -22.

Hint 1

Replace the 4 with the new value for x.

x <- ___
Hint 2 (Solved)

Replace the 4 with the new value for x.

x <- -22
x <- -22

x <- -22

Use the code block below to store the value 5 in the container y and then compute the quantity x*y - y.

Hint 1

The variable x is already defined, start by defining y as 5.

Hint 2

The variable x is already defined, start by defining y as 5.

y <- ___
Hint 3

The variable x is already defined, start by defining y as 5. Now compute x*y - y.

y <- 5
___
Hint 4 (Solved)

The variable x is already defined, start by defining y as 5. Now compute x*y - y.

y <- 5
x*y - y
y <- 5 x*y - y

y <- 5
x*y - y

In R, it is helpful to think of objects as either vectors (lists of entries) or data frames (tables), because that’s how R treats its objects. We can create an object called numVec which is a vector of six numerical values (1, 8, -2, 2, -99, 43) using numVec <- c(1, 8, -2, 2, -99, 43). Any time we wish to reference or provide a list of items in R, we use the c() operator to combine those items. Once we have a vector, we can operate with it.

numVec <- c(1, 8, -2, 2, -99, 43)
numVec + c(1, 1, 1, 0, -2, 4)
[1]    2    9   -1    2 -101   47
1 + numVec
[1]   2   9  -1   3 -98  44
2*numVec
[1]    2   16   -4    4 -198   86

Notice that we can add two vectors of the same length to one another and the addition is completed coordinatewise. We can also add a single value to a vector and that single value is recycled and added to each component of the vector (each of the values). The same works for multiplication or division (and subtraction too).

Pitfalls Associated with Recycling

While this recycling behavior may seem convenient (and it is), it does require that we be aware of a few possible pitfalls.

  • If we try to add two vectors that do not have the same length, R will indeed perform the operation but will warn us if the shorter vector is not recycled a full number of times while the operation is completed.
  • If the recycling works without issue, then R does not report a warning to us. This can be problematic, because typically we don’t want to add vectors of different sizes – we need to pay close attention.
numVec1 <- c(1, 2, 9, 7, 3, 4)
numVec2 <- c(2, 9, -2, 4)
numVec3 <- c(1, 7)

numVec1 + numVec2
Warning in numVec1 + numVec2: longer object length is not a multiple of shorter
object length
[1]  3 11  7 11  5 13
numVec1 + numVec3
[1]  2  9 10 14  4 11

While we are unlikely to encounter these issues in our course, it is worth calling out here.

Data Frames

We typically aren’t just working with scalars (single values) or a handful of vectors. Usually a statistician, analyst, or data scientist will be working with a table (or many tables) filled with raw data. We’ll call these tables data frames, and you can see the first six rows of the diamonds data frame below. You may remember the diamonds dataset from our first notebook.

carat cut color clarity depth table price x y z
0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48

We can call a column of a dataframe in several different ways. For example, diamonds$cut, diamonds[ , "cut"], diamonds[ , 3], and diamonds |> select(cut) all access the cut column of the diamonds data frame but the first method returns a vector of values while the others return a data frame. The last option makes use of a special operator (|>), called the pipe. We’ll see that pipes make data manipulation really convenient and readable. Explore each method of selecting a column in the code cell below. After you are done exploring and comparing the results, alter the original code so that it uses the piping method to access the carat column instead of the cut, returning the result as a data frame.

Hint 1

Replace cut with carat inside of select().

diamonds |>
  select(cut)
Hint 2

Replace cut with carat inside of select().

diamonds |>
  select(___)
Hint 3 (Solved)

Replace cut with carat inside of select().

diamonds |>
  select(carat)
diamonds |> select(carat)

diamonds |>
  select(carat)

Submit

If you are part of a course with an instructor who is grading your work on these activities, please copy and submit both of the hashes below using the method your instructor has requested (there is only an exercise hash for this activity, no question hash).

Question Hash

Since there were no multiple choice or checkbox exercises in this activity, there is no question hash to generate. You’ll see both question and exercise hashes in the majority of future activities.

Exercise Hash

Click the button below to generate your exercise submission code. This hash encodes your work on the graded code exercises in this activity.

You must have attempted the graded exercises before clicking — clicking generates a snapshot of your current results. If you have completed the activity over multiple sessions, please go back through and hit the Run Code button on each graded exercise before generating the hash below, to ensure your most recent results are recorded.

Summary

Main Takeaways
  • R can be used as a simple calculator — in the same way as any hand-held calculator you’ve used in your previous coursework.
  • When doing multiplication in R, we must include the asterisk (*), otherwise R will throw an error. For example, 5(4) results in an error, but 5*(4) results in 20, as expected.
  • The first real advantage with R comes in interacting with large datasets called data frames.
    • A data frame is just like a single tab of an Excel or Google Sheets file — rows are observations and columns are variables.
    • We can access columns of a data frame using the $ operator (returns a vector) or by piping (|>) the data frame into select() (returns a data frame).
  • The second main advantage to working in a coding environment like R is that your entire analysis is fully documented and transparent in a way that analyses done using point-and-click software are not.
  • We live in a world where we are constantly interacting with software — learning R is learning a transferable skill in addition to learning statistics.
  • Coding is hard work. Stumbling is expected. Be patient with yourself, ask for help when you need it, and take breaks when you’re frustrated.
Looking Ahead

The next activity introduces descriptive statistics — computing and interpreting numerical summaries for both numerical and categorical data. You’ll use the tools introduced here (basic calculation, vectors, data frames, and the pipe) to calculate means, medians, standard deviations, construct frequency tables, and more with R.

Remember that you are not expected to have any prior programming background to be successful here. We’ll build everything we need from the ground up.