readr
read_delim()
and read_csv()
from the readr
package
# example for character delimited: read_delim(file = "file.txt", delim = "\t")
# comma delimited: read_csv("file.csv")
https://sisbid.github.io/Data-Wrangling/data/ufo/ufo_data_complete.csv
# From URL ufo <- read_csv( "https://sisbid.github.io/Data-Wrangling/data/ufo/ufo_data_complete.csv" ) # From your 'data-wrangling' directory ufo <- read_csv("ufo_data_complete.csv")
(Warning message: One or more parsing issues, call 'problems()'
– more on this later)
The read_delim()
and related functions return a “tibble” is a data.frame
with special printing, which is the primary data format for most data cleaning and analyses.
class(ufo)
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
Check to make sure you see the new object in the Environment pane.
There are also data importing functions provided in base R (rather than the readr
package), like read.delim
and read.csv
.
These functions have slightly different syntax for reading in data, like header
and as.is
.
However, while many online resources use the base R tools, recent versions of RStudio switched to use these new readr
data import tools, so we will use them here. They are also up to two times faster for reading in large datasets, and have a progress bar which is nice.
readr
read_table()
from the readr
package, allows any number of whitespace characters between columns, and the lines can be of different lengths.
# example for whitespace delimited : read_table(file = "file.txt")
The argument trim_ws
removes trailing and leading spaces around your data.
# example: read_csv(file = "file.txt", trim_ws = TRUE)
What if your file is in the “Home” directory?
Backtrack using the relative path with ../
like:
ufo <- read_csv("../ufo_data_complete.csv.gz")
Or, read in from a subfolder:
ufo <- read_csv("data/ufo/ufo_data_complete.csv")
Warning: One or more parsing issues, call `problems()` on your data frame for details, e.g.: dat <- vroom(...) problems(dat)
Rows: 88875 Columns: 11 ── Column specification ──────────────────────────────────────────────────────── Delimiter: "," chr (10): datetime, city, state, country, shape, duration (hours/min), comme... dbl (1): duration (seconds) ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head()
shows first few rowstail()
shows the last few rowsView()
shows the data as a spreadsheetspec()
gives specification of column typesstr()
gives the column types and specsglimpse()
similar to str
(dplyr package)nrow()
displays the number of rows of a data framencol()
displays the number of columnsdim()
displays a vector of length 2: # rows, # columnsnrow(ufo)
[1] 88875
ncol(ufo)
[1] 11
dim(ufo)
[1] 88875 11
colnames()
displays the column namescolnames(ufo)
[1] "datetime" "city" "state" [4] "country" "shape" "duration (seconds)" [7] "duration (hours/min)" "comments" "date posted" [10] "latitude" "longitude"
glimpse()
glimpse(ufo)
Rows: 88,875 Columns: 11 $ datetime <chr> "10/10/1949 20:30", "10/10/1949 21:00", "10/10/… $ city <chr> "san marcos", "lackland afb", "chester (uk/engl… $ state <chr> "tx", "tx", NA, "tx", "hi", "tn", NA, "ct", "al… $ country <chr> "us", NA, "gb", "us", "us", "us", "gb", "us", "… $ shape <chr> "cylinder", "light", "circle", "circle", "light… $ `duration (seconds)` <dbl> 2700, 7200, 20, 20, 900, 300, 180, 1200, 180, 1… $ `duration (hours/min)` <chr> "45 minutes", "1-2 hrs", "20 seconds", "1/2 hou… $ comments <chr> "This event took place in early fall around 194… $ `date posted` <chr> "4/27/2004", "12/16/2005", "1/21/2008", "1/17/2… $ latitude <chr> "29.8830556", "29.38421", "53.2", "28.9783333",… $ longitude <chr> "-97.9411111", "-98.581082", "-2.916667", "-96.…
head(problems(ufo))
# A tibble: 6 × 5 row col expected actual file <int> <int> <chr> <chr> <chr> 1 878 12 11 columns 12 columns /Users/avahoffman/Dropbox/JHSPH/Data-Wrangl… 2 1713 12 11 columns 12 columns /Users/avahoffman/Dropbox/JHSPH/Data-Wrangl… 3 1815 12 11 columns 12 columns /Users/avahoffman/Dropbox/JHSPH/Data-Wrangl… 4 2858 12 11 columns 12 columns /Users/avahoffman/Dropbox/JHSPH/Data-Wrangl… 5 3734 12 11 columns 12 columns /Users/avahoffman/Dropbox/JHSPH/Data-Wrangl… 6 4756 12 11 columns 12 columns /Users/avahoffman/Dropbox/JHSPH/Data-Wrangl…
dim(problems(ufo))
[1] 199 5
read_excel()
function from readxl
packagexlsx
, openxlsx
haven
package has functions to read SAS, SPSS, Stata formatsUse the sheet
argument to indicate which sheet to pull from. It can refer to the sheet’s index or name.
# example: read_excel(path = "file.xlsx", sheet = 2) read_excel(path = "file.xlsx", sheet = "data")
While its nice to be able to read in a variety of data formats, it’s equally important to be able to output data somewhere.
write_delim()
: Write a data frame to a delimited file write_csv()
: Write a data frame to a comma-delimited file
This is about twice as fast as write.csv()
, and never writes row names.
For example, we can write back out just the first 100 lines of the ufo
dataset:
first_100 <- ufo[1:100,] write_delim(first_100, file = "ufo_first100.csv", delim = ",") write_csv(first_100, file = "ufo_first100.csv")
write_rds
If you want to save one object, you can use readr::write_rds
to save to a compressed rds
file:
write_rds(ufo, file = "ufo_dataset.rds", compress = "xz")
Read it back in:
ufo_new <- read_rds(file = "ufo_dataset.rds")
save
The save
command can save a set of R
objects into an “R data file”, with the extension .rda
or .RData
.
x = 5 save(ufo, x, file = "ufo_data.rda")
The opposite of save
is load
.
load(file = "ufo_data.rda")
read_delim()
, read_csv()
, read_table()
for common data typestrim_ws
and na
arguments!read_excel()
has the sheet
argument for reading from different sheets of the Excel filestr()
, View()
, and glimpse()
can help you understand your data betterwrite_delim()
and write_csv()
https://sisbid.github.io/Data-Wrangling/labs/data-io-lab-part2.Rmd