In this lesson we will introduce ways to get data from external files into R, how R works with data, and how to use functions to explore your data frame.
After completing this lesson learners will be able to
Math130/notes
folder.In this lesson we are only going to explore reading files that exist on your computer into R from three most commonly used data sources: A tab-delimited text file, A CSV file and an Excel file.
The three different files we will be using have different file types, or extensions.
email.txt
is a .txt
or “text” file.NCBirths.csv
is a .csv
or “comma separated values” file.fatal-police-shootings-data.xlsx
is a .xlsx
or Excel file.Each of these file types differ in the type of delimiter used. The delimiter is a character or symbol that separates columns of data from each other when stored in an external file. Recall back to the earlier lesson on data frames and matrices. Each column in the matrix represented data on a specific variable. Something had to tell R how to distinguish which values went with which variable.
There are two main types of delimiters we will consider in this class; comma and tab. That does not mean that data can’t be stored in other ways, these are just the two most common.
Each of these different data types requires a different function or mechanism to import the data into R. If you use the wrong mechanism, the data may not be read in correctly if at all.
To import data into R, we have to tell the program explicitly where to find the files you just downloaded. To do that we need to find the file’s path. The path is a programmatic way to direct your computer to the location of a file. It’s like the file’s address, or, where it lives on your computer.
In all the examples below, the path shown (data/
) is the path on MY computer. You will have to update this path to YOUR path on YOUR machine that points to your MATH130 folder.
You can find the path of a file by navigating to the desired file in your finder or browser window. Right click and copy the file to the clipboard, then right click and pasting into a R markdown file. This will show a path that looks something like this:
:///C:/Math 130/Data/Challenger.txt file
You will need to remove the file:///
part at the beginning before using this path. You will have three chances in this lab to practice this.
Source: https://stackoverflow.com/questions/52695546/how-to-copy-path-of-a-file-in-mac-os
The first thing you should always do after importing a data file is look at the raw data and ask yourself the following things:
There are many ways this can be done. Here are three:
str()
to show you the structure of the data frame.head()
to show you the top 6 rowsdata[1:x, 1:x]
- when the data frame is too large for the above two to work well, we can use matrix notation to view only the first X columns, and first X rows.Next we will read in three different data sets, each with unique file types, and look at each one to make sure it was read in correctly.
Text files are very simple files that have a .txt
file extension. Common delimiters include a space, a comma (,) or a tab. Uncommon delimiters could include a % or even a semi-colon. By opening the file directly on your computer (not in R) you can see what delimiter is being used.
We will use the read.table()
function that is in base R to read in any type of delimited file. A tab-delimited text file can be read in using "\t"
as the deliminator character. In this class you ALWAYS want to include header=TRUE
to signify that the data in the first row contains our column names.
<- read.table("../data/email.txt", header=TRUE, sep="\t") email
Here we call the str()
or structure function to examine the data that was imported.
str(email)
## 'data.frame': 3921 obs. of 21 variables:
## $ spam : int 0 0 0 0 0 0 0 0 0 0 ...
## $ to_multiple : int 0 0 0 0 0 0 1 1 0 0 ...
## $ from : int 1 1 1 1 1 1 1 1 1 1 ...
## $ cc : int 0 0 0 0 0 0 0 1 0 0 ...
## $ sent_email : int 0 0 0 0 0 0 1 1 0 0 ...
## $ time : chr "2011-12-31 22:16:41" "2011-12-31 23:03:59" "2012-01-01 08:00:32" "2012-01-01 01:09:49" ...
## $ image : int 0 0 0 0 0 0 0 1 0 0 ...
## $ attach : int 0 0 0 0 0 0 0 1 0 0 ...
## $ dollar : int 0 0 4 0 0 0 0 0 0 0 ...
## $ winner : chr "no" "no" "no" "no" ...
## $ inherit : int 0 0 1 0 0 0 0 0 0 0 ...
## $ viagra : int 0 0 0 0 0 0 0 0 0 0 ...
## $ password : int 0 0 0 0 2 2 0 0 0 0 ...
## $ num_char : num 11.37 10.5 7.77 13.26 1.23 ...
## $ line_breaks : int 202 202 192 255 29 25 193 237 69 68 ...
## $ format : int 1 1 1 1 0 0 1 1 0 1 ...
## $ re_subj : int 0 0 0 0 0 0 0 0 0 0 ...
## $ exclaim_subj: int 0 0 0 0 0 0 0 0 0 0 ...
## $ urgent_subj : int 0 0 0 0 0 0 0 0 0 0 ...
## $ exclaim_mess: int 0 1 6 48 1 1 1 18 1 0 ...
## $ number : chr "big" "small" "small" "small" ...
email
data set has 3921 observations, and 21 variables.$
.time
. If we wanted to examine a time trend, we would need to convert the class into a datetime
What happens if you forget to include the arguments for
header
orsep
? Try that now and discuss what happened.
Forgetting both arguments
<- read.table("../data/email.txt") email
## Error in read.table("../data/email.txt"): duplicate 'row.names' are not allowed
Forgetting the header
<- read.table("../data/email.txt", sep="\t")
email head(email)
Forgetting the separator
<- read.table("../data/email.txt", header=TRUE)
email head(email)
CSV is a fancy way of saying a text file with comma-separated values (i.e. CSV). This file type can also open in Excel–the icons even look similar. We could use read.table()
to import this data file, but read.csv()
is optimized to read in CSV files.
<- read.csv("../data/NCbirths.csv", header=TRUE)
NCbirths head(NCbirths)
## fage mage mature weeks premie visits marital gained weight
## 1 NA 13 younger mom 39 full term 10 married 38 7.63
## 2 NA 14 younger mom 42 full term 15 married 20 7.88
## 3 19 15 younger mom 37 full term 11 married 38 6.63
## 4 21 15 younger mom 41 full term 6 married 34 8.00
## 5 NA 15 younger mom 39 full term 9 married 27 6.38
## 6 NA 15 younger mom 38 full term 19 married 22 5.38
## lowbirthweight gender habit whitemom
## 1 not low male nonsmoker not white
## 2 not low male nonsmoker not white
## 3 not low female nonsmoker white
## 4 not low male nonsmoker white
## 5 not low female nonsmoker not white
## 6 low male nonsmoker not white
NCbirths
data set has 1000 rows and 13 columns.NA
values for the variable fage
(fathers age).The best method to read in Excel files is from the readxl
package by Hadley Wickham. This package need to be installed first, and then can be simply loaded using the library()
function each time you start an R
session where you will be reading in this type of data.
The read_excel()
function is what we are going to use. Note the use of the underscore _
instead of a period .
between read
and excel
.
library(readxl)
<- read_excel("../data/fatal-police-shootings-data.xlsx", sheet=1, col_names=TRUE)
police 1:10,1:5] police[
## # A tibble: 10 × 5
## id name date manner_of_death armed
## <dbl> <chr> <dttm> <chr> <chr>
## 1 3 Tim Elliot 2015-01-02 00:00:00 shot gun
## 2 4 Lewis Lee Lembke 2015-01-02 00:00:00 shot gun
## 3 5 John Paul Quintero 2015-01-03 00:00:00 shot and Tasered unarmed
## 4 8 Matthew Hoffman 2015-01-04 00:00:00 shot toy weapon
## 5 9 Michael Rodriguez 2015-01-04 00:00:00 shot nail gun
## 6 11 Kenneth Joe Brown 2015-01-04 00:00:00 shot gun
## 7 13 Kenneth Arnold Buck 2015-01-05 00:00:00 shot gun
## 8 15 Brock Nichols 2015-01-06 00:00:00 shot gun
## 9 16 Autumn Steele 2015-01-06 00:00:00 shot unarmed
## 10 17 Leslie Sapp III 2015-01-06 00:00:00 shot toy weapon
date
is a dttm
or date-time variable. This means R
recognizes it directly as a date, not some string of numbers.manner_of_death
and city
are read in as character
instead of factor. We may or may not want to change that later.age
are of type dbl
(double). This is similar to integer
or numeric
, so we are fine.