- Thanks to Prof. Cosma Shalizi (CMU Statistics) for this material
- What databases are, and why
- SQL
- Interfacing R and SQL
R jargon | Database jargon |
---|---|
column | field |
row | record |
dataframe | table |
types of the columns | table schema |
bunch of related dataframes | database |
debt[debt$Country=="France",c("growth","ratio")] with(debt,debt[Country=="France",c("growth","ratio")]) subset(x=debt,subset=(Country=="France"),select=c("growth","ratio"))
SELECT columns or computations FROM table WHERE condition GROUP BY columns HAVING condition ORDER BY column [ASC|DESC] LIMIT offset,count;
SELECT
is the first word of a query, then modifiers say which fields/columns to use, and what conditions records/rows must meet, from which tables
The final semi-colon is obligatory
SELECT PlayerID,yearID,AB,H FROM Batting;
Four columns from table Batting
SELECT * FROM Salaries;
All columns from table Salaries
SELECT * FROM Salaries ORDER BY Salary;
As above, but by ascending value of Salary
SELECT * FROM Salaries ORDER BY Salary DESC;
Descending order
SELECT * FROM Salaries ORDER BY Salary DESC LIMIT 10;
top 10 salaries
Picking out rows meeting a condition
SELECT PlayerID,yearID,AB,H FROM Batting WHERE AB > 100 AND H > 0;
vs.
Batting[Batting$AB>100 & Batting$H > 0, c("PlayerID","yearID","AB","H")]
SELECT MIN(AB), AVG(AB), MAX(AB) FROM Batting;
SELECT AB,H,H/CAST(AB AS REAL) FROM Batting;
Because AB
and H
are integers, and it won't give you a fractional part by default
SELECT PlayerID,yearID,H/CAST(AB AS REAL) AS BattingAvg FROM Batting ORDER BY BattingAvg DESC LIMIT 10;
We can do calculations on value-grouped subsets, like in aggregate
or d*ply
SELECT playerID, SUM(salary) FROM Salaries GROUP BY playerID
WHERE
GROUP BY
HAVING
SELECT playerID, SUM(salary) AS totalSalary FROM Salaries GROUP BY playerID HAVING totalSalary > 200000000
FROM
has just been one tablepatient_last |
patient_first |
physician_id |
complaint |
---|---|---|---|
Morgan | Dexter | 37010 | insomnia |
Soprano | Anthony | 79676 | malaise |
Swearengen | Albert | NA | healthy as a goddam horse |
Garrett | Alma | 90091 | nerves |
Holmes | Sherlock | 43675 | nicotine-patch addiction |
physician_last |
physician_first |
physicianID |
plan |
---|---|---|---|
Meridian | Emmett | 37010 | UPMC |
Melfi | Jennifer | 79676 | BCBS |
Cochran | Amos | 90091 | UPMC |
Watson | John | 43675 | VA |
merge
to link the tables up by physicianID
physician_id
or physicianID
is acting as the key or unique identifiermerge
, it has JOIN
as a modifier to FROM
SELECT physician_first, physician_last FROM patients INNER JOIN physicians ON patients.physician_id == physicians.physicianID WHERE condition=="insomnia"
Creates a (virtual) table linking records where physician_id
in one table matches physicianID
in the other
SELECT nameLast,nameFirst,yearID,AB,H FROM Master INNER JOIN Batting USING(playerID);
INNER JOIN ... USING
links records with the same value of playerID
LEFT OUTER JOIN
includes records from the first table which don't match any record in the 2nd
NA
in the 2nd table's fieldsRIGHT OUTER JOIN
is just what you'd think
FULL OUTER JOIN
R jargon | Database jargon |
---|---|
column | field |
row | record |
dataframe | table |
types of the columns | table schema |
bunch of dataframes | database |
selections, subset |
SELECT ... FROM ... WHERE ... HAVING |
aggregate , d*ply |
GROUP BY |
merge |
JOIN |
order |
ORDER BY |
DBI
is a unified interface to theminstall.packages("DBI", dependencies = TRUE) # Install DBI install.packages("RSQLite", dependencies = TRUE) # Install driver for SQLite library(RSQLite) drv <- dbDriver('SQLite') con <- dbConnect(drv, dbname="baseball.db")
con
is now a persistent connection to the database baseball.db
dbListTables(con) # Get tables in the database (returns vector) dbListFields(con, name) # List fields in a table dbReadTable(con, name) # Import a table as a data frame
dbGetQuery(conn, statement) df <- dbGetQuery(con, paste( "SELECT nameLast,nameFirst,yearID,salary", "FROM Master NATURAL JOIN Salaries"))
Usual workflow:
sqldf
package lets you use SQL commands on dataframes