- Thanks to Prof. Cosma Shalizi (CMU Statistics) for this material
- What databases are, and why
- SQL
- Interfacing R and SQL
| R jargon | Database jargon |
|---|---|
| column | field |
| row | record |
| dataframe | table |
| types of the columns | table schema |
| bunch of related dataframes | database |
debt[debt$Country=="France",c("growth","ratio")]
with(debt,debt[Country=="France",c("growth","ratio")])
subset(x=debt,subset=(Country=="France"),select=c("growth","ratio"))
SELECT columns or computations FROM table WHERE condition GROUP BY columns HAVING condition ORDER BY column [ASC|DESC] LIMIT offset,count;
SELECT is the first word of a query, then modifiers say which fields/columns to use, and what conditions records/rows must meet, from which tables
The final semi-colon is obligatory
SELECT PlayerID,yearID,AB,H FROM Batting;
Four columns from table Batting
SELECT * FROM Salaries;
All columns from table Salaries
SELECT * FROM Salaries ORDER BY Salary;
As above, but by ascending value of Salary
SELECT * FROM Salaries ORDER BY Salary DESC;
Descending order
SELECT * FROM Salaries ORDER BY Salary DESC LIMIT 10;
top 10 salaries
Picking out rows meeting a condition
SELECT PlayerID,yearID,AB,H FROM Batting WHERE AB > 100 AND H > 0;
vs.
Batting[Batting$AB>100 & Batting$H > 0, c("PlayerID","yearID","AB","H")]
SELECT MIN(AB), AVG(AB), MAX(AB) FROM Batting;
SELECT AB,H,H/CAST(AB AS REAL) FROM Batting;
Because AB and H are integers, and it won't give you a fractional part by default
SELECT PlayerID,yearID,H/CAST(AB AS REAL) AS BattingAvg FROM Batting ORDER BY BattingAvg DESC LIMIT 10;
We can do calculations on value-grouped subsets, like in aggregate or d*ply
SELECT playerID, SUM(salary) FROM Salaries GROUP BY playerID
WHEREGROUP BYHAVINGSELECT playerID, SUM(salary) AS totalSalary FROM Salaries GROUP BY playerID HAVING totalSalary > 200000000
FROM has just been one tablepatient_last |
patient_first |
physician_id |
complaint |
|---|---|---|---|
| Morgan | Dexter | 37010 | insomnia |
| Soprano | Anthony | 79676 | malaise |
| Swearengen | Albert | NA | healthy as a goddam horse |
| Garrett | Alma | 90091 | nerves |
| Holmes | Sherlock | 43675 | nicotine-patch addiction |
physician_last |
physician_first |
physicianID |
plan |
|---|---|---|---|
| Meridian | Emmett | 37010 | UPMC |
| Melfi | Jennifer | 79676 | BCBS |
| Cochran | Amos | 90091 | UPMC |
| Watson | John | 43675 | VA |
merge to link the tables up by physicianIDphysician_id or physicianID is acting as the key or unique identifiermerge, it has JOIN as a modifier to FROMSELECT physician_first, physician_last FROM patients INNER JOIN physicians ON patients.physician_id == physicians.physicianID WHERE condition=="insomnia"
Creates a (virtual) table linking records where physician_id in one table matches physicianID in the other
SELECT nameLast,nameFirst,yearID,AB,H FROM Master INNER JOIN Batting USING(playerID);
INNER JOIN ... USING links records with the same value of playerID
LEFT OUTER JOIN includes records from the first table which don't match any record in the 2nd
NA in the 2nd table's fieldsRIGHT OUTER JOIN is just what you'd think
FULL OUTER JOIN| R jargon | Database jargon |
|---|---|
| column | field |
| row | record |
| dataframe | table |
| types of the columns | table schema |
| bunch of dataframes | database |
selections, subset |
SELECT ... FROM ... WHERE ... HAVING |
aggregate, d*ply |
GROUP BY |
merge |
JOIN |
order |
ORDER BY |
DBI is a unified interface to theminstall.packages("DBI", dependencies = TRUE) # Install DBI
install.packages("RSQLite", dependencies = TRUE) # Install driver for SQLite
library(RSQLite)
drv <- dbDriver('SQLite')
con <- dbConnect(drv, dbname="baseball.db")
con is now a persistent connection to the database baseball.db
dbListTables(con) # Get tables in the database (returns vector) dbListFields(con, name) # List fields in a table dbReadTable(con, name) # Import a table as a data frame
dbGetQuery(conn, statement) df <- dbGetQuery(con, paste( "SELECT nameLast,nameFirst,yearID,salary", "FROM Master NATURAL JOIN Salaries"))
Usual workflow:
sqldf package lets you use SQL commands on dataframes