Agenda

  • Thanks to Prof. Cosma Shalizi (CMU Statistics) for this material
  • What databases are, and why
  • SQL
  • Interfacing R and SQL

Databases

  • A record is a collection of fields
  • A table is a collection of records which all have the same fields (with different values)
  • A database is a collection of tables

Databases vs. Dataframes

  • R's dataframes are actually tables
R jargon Database jargon
column field
row record
dataframe table
types of the columns table schema
bunch of related dataframes database

Why Do We Need Database Software?

  • Size
    • R keeps its dataframes in memory
    • Industrial databases can be much bigger
    • Work with selected subsets
  • Speed
    • Clever people have worked very hard on getting just what you want fast
  • Concurrency
    • Many users accessing the same database simultaneously
    • Lots of potential for trouble (two users want to change the same record at once)

The Client-Server Model

  • Databases live on a server, which manages them
  • Users interact with the server through a client program
  • Lets multiple users access the same database simultaneously

SQL

  • SQL (structured query language) is the standard for database software
  • Mostly about queries, which are like doing a selection in R
debt[debt$Country=="France",c("growth","ratio")]
with(debt,debt[Country=="France",c("growth","ratio")])
subset(x=debt,subset=(Country=="France"),select=c("growth","ratio"))
  • Let's look at how SQL does stuff like this

SELECT

SELECT columns or computations
  FROM table
  WHERE condition
  GROUP BY columns
  HAVING condition
  ORDER BY column [ASC|DESC]
  LIMIT offset,count;
  • SELECT is the first word of a query, then modifiers say which fields/columns to use, and what conditions records/rows must meet, from which tables

  • The final semi-colon is obligatory

SELECT

SELECT PlayerID,yearID,AB,H FROM Batting;

Four columns from table Batting

SELECT * FROM Salaries;

All columns from table Salaries

SELECT * FROM Salaries ORDER BY Salary;

As above, but by ascending value of Salary

SELECT * FROM Salaries ORDER BY Salary DESC;

Descending order

SELECT * FROM Salaries ORDER BY Salary DESC LIMIT 10;

top 10 salaries

SELECT

Picking out rows meeting a condition

SELECT PlayerID,yearID,AB,H FROM Batting WHERE AB > 100 AND H > 0;

vs.

Batting[Batting$AB>100 & Batting$H > 0, c("PlayerID","yearID","AB","H")]

Calculated Columns

  • SQL knows about some simple summary statistics:
SELECT MIN(AB), AVG(AB), MAX(AB) FROM Batting;
  • It can do arithmetic
SELECT AB,H,H/CAST(AB AS REAL) FROM Batting;

Because AB and H are integers, and it won't give you a fractional part by default

  • Calculated columns can get names:
SELECT PlayerID,yearID,H/CAST(AB AS REAL) AS BattingAvg FROM Batting
  ORDER BY BattingAvg DESC LIMIT 10;

Aggregating

We can do calculations on value-grouped subsets, like in aggregate or d*ply

SELECT playerID, SUM(salary) FROM Salaries GROUP BY playerID

Selecting Again

  • First cut of records is with WHERE
  • Aggregation of recordw with GROUP BY
  • Post-aggregation selection with HAVING
SELECT playerID, SUM(salary) AS totalSalary FROM Salaries GROUP BY playerID
  HAVING totalSalary > 200000000

JOIN

  • So far FROM has just been one table
  • Sometimes we need to combine information from many tables
patient_last patient_first physician_id complaint
Morgan Dexter 37010 insomnia
Soprano Anthony 79676 malaise
Swearengen Albert NA healthy as a goddam horse
Garrett Alma 90091 nerves
Holmes Sherlock 43675 nicotine-patch addiction
physician_last physician_first physicianID plan
Meridian Emmett 37010 UPMC
Melfi Jennifer 79676 BCBS
Cochran Amos 90091 UPMC
Watson John 43675 VA

JOIN

  • Suppose we want to know which doctors are treating patients for insomnia
  • Complaints are in one table
  • Physicians are in the other
  • In R, we'd use merge to link the tables up by physicianID
  • Here, physician_id or physicianID is acting as the key or unique identifier

JOIN

  • SQL doesn't have merge, it has JOIN as a modifier to FROM
SELECT physician_first, physician_last FROM patients INNER JOIN physicians ON patients.physician_id == physicians.physicianID WHERE condition=="insomnia"

Creates a (virtual) table linking records where physician_id in one table matches physicianID in the other

  • If the names were the same in the two tables, we could write (e.g.)
SELECT nameLast,nameFirst,yearID,AB,H FROM Master INNER JOIN Batting
  USING(playerID);

INNER JOIN ... USING links records with the same value of playerID

  • There are some syntax variants here; see the handout

JOIN

  • LEFT OUTER JOIN includes records from the first table which don't match any record in the 2nd
    • The "extra" records get NA in the 2nd table's fields
  • RIGHT OUTER JOIN is just what you'd think
    • so is FULL OUTER JOIN

Updated Translation Table

R jargon Database jargon
column field
row record
dataframe table
types of the columns table schema
bunch of dataframes database
selections, subset SELECT ... FROM ... WHERE ... HAVING
aggregate, d*ply GROUP BY
merge JOIN
order ORDER BY

Connecting R to SQL

  • SQL is a language; database management systems (DMBS) actually implement it and do the work
    • MySQL, SQLite, etc., etc.
  • They all have somewhat different conventions
  • The R package DBI is a unified interface to them
  • Need a separate "driver" for each DBMS

Connecting R to SQL

install.packages("DBI", dependencies = TRUE) # Install DBI
install.packages("RSQLite", dependencies = TRUE) # Install driver for SQLite
library(RSQLite)
drv <- dbDriver('SQLite')
con <- dbConnect(drv, dbname="baseball.db")

con is now a persistent connection to the database baseball.db

Connecting R to SQL

dbListTables(con)         # Get tables in the database (returns vector)
dbListFields(con, name)  # List fields in a table
dbReadTable(con, name)   # Import a table as a data frame

Connecting R to SQL

dbGetQuery(conn, statement)
df <- dbGetQuery(con, paste(
  "SELECT nameLast,nameFirst,yearID,salary",
  "FROM Master NATURAL JOIN Salaries"))

Connecting R to SQL

Usual workflow:

  • Load the driver, connect to the right database
  • R sends an SQL query to the DBMS
  • SQL executes the query, sending back a manageably small dataframe
  • R does the actual statistics
  • Close the connection when you're done

Going the Other Way

  • The sqldf package lets you use SQL commands on dataframes
  • Mostly useful if you already know SQL better than R…

Summary

  • A database is basically a way of dealing efficiently with lots of potentially huge dataframes
  • SQL is the standard language for telling databases what to do, especially what queries to run
  • Everything in an SQL query is something we've practiced already in R
    • subsetting/selection, aggregation, merging, ordering
  • Connect R to the database, send it an SQL query, analyse the returned dataframe
  • More information at [http://www.stat.cmu.edu/~cshalizi/statcomp/14/]