--- title: 'Databases' author: "James M. Flegal" output: ioslides_presentation: smaller: yes --- ## Agenda - Thanks to Prof. Cosma Shalizi (CMU Statistics) for this material - What databases are, and why - SQL - Interfacing R and SQL ## Databases - A **record** is a collection of **fields** - A **table** is a collection of records which all have the same fields (with different values) - A **database** is a collection of tables ## Databases vs. Dataframes - R's dataframes are actually tables R jargon | Database jargon --------|---------------- column | field row | record dataframe | table types of the columns | table **schema** bunch of related dataframes | database ## Why Do We Need Database Software? - _Size_ + R keeps its dataframes in memory + Industrial databases can be much bigger + Work with selected subsets - _Speed_ + Clever people have worked very hard on getting just what you want fast - _Concurrency_ + Many users accessing the same database simultaneously + Lots of potential for trouble (two users want to change the same record at once) ## The Client-Server Model - Databases live on a **server**, which manages them - Users interact with the server through a **client** program - Lets multiple users access the same database simultaneously ## SQL - **SQL** (**structured query language**) is the standard for database software - Mostly about **queries**, which are like doing a selection in R ``` debt[debt$Country=="France",c("growth","ratio")] with(debt,debt[Country=="France",c("growth","ratio")]) subset(x=debt,subset=(Country=="France"),select=c("growth","ratio")) ``` - Let's look at how SQL does stuff like this ## SELECT ``` SELECT columns or computations FROM table WHERE condition GROUP BY columns HAVING condition ORDER BY column [ASC|DESC] LIMIT offset,count; ``` - `SELECT` is the first word of a query, then modifiers say which fields/columns to use, and what conditions records/rows must meet, from which tables - The final semi-colon is obligatory ## SELECT ``` SELECT PlayerID,yearID,AB,H FROM Batting; ``` Four columns from table `Batting` ``` SELECT * FROM Salaries; ``` All columns from table `Salaries` ``` SELECT * FROM Salaries ORDER BY Salary; ``` As above, but by ascending value of `Salary` ``` SELECT * FROM Salaries ORDER BY Salary DESC; ``` Descending order ``` SELECT * FROM Salaries ORDER BY Salary DESC LIMIT 10; ``` top 10 salaries ## SELECT Picking out rows meeting a condition ``` SELECT PlayerID,yearID,AB,H FROM Batting WHERE AB > 100 AND H > 0; ``` vs. ``` Batting[Batting$AB>100 & Batting$H > 0, c("PlayerID","yearID","AB","H")] ``` ## Calculated Columns - SQL knows about some simple summary statistics: ``` SELECT MIN(AB), AVG(AB), MAX(AB) FROM Batting; ``` - It can do arithmetic ``` SELECT AB,H,H/CAST(AB AS REAL) FROM Batting; ``` Because `AB` and `H` are integers, and it won't give you a fractional part by default - Calculated columns can get names: ``` SELECT PlayerID,yearID,H/CAST(AB AS REAL) AS BattingAvg FROM Batting ORDER BY BattingAvg DESC LIMIT 10; ``` ## Aggregating We can do calculations on value-grouped subsets, like in `aggregate` or `d*ply` ``` SELECT playerID, SUM(salary) FROM Salaries GROUP BY playerID ``` ## Selecting Again - First cut of records is with `WHERE` - Aggregation of recordw with `GROUP BY` - Post-aggregation selection with `HAVING` ``` SELECT playerID, SUM(salary) AS totalSalary FROM Salaries GROUP BY playerID HAVING totalSalary > 200000000 ``` ## JOIN - So far `FROM` has just been one table - Sometimes we need to combine information from many tables `patient_last` | `patient_first` | `physician_id` | complaint ---------------|-----------------|---------------|---------- Morgan | Dexter | 37010 | insomnia Soprano | Anthony | 79676 | malaise Swearengen | Albert | NA | healthy as a goddam horse Garrett | Alma | 90091 | nerves Holmes | Sherlock | 43675 | nicotine-patch addiction `physician_last` | `physician_first` | `physicianID` | `plan` ------------------|-------------------|---------------|------ Meridian | Emmett | 37010 | UPMC Melfi | Jennifer | 79676 | BCBS Cochran | Amos | 90091 | UPMC Watson | John | 43675 | VA ## JOIN - Suppose we want to know which doctors are treating patients for insomnia + Complaints are in one table + Physicians are in the other - In R, we'd use `merge` to link the tables up by `physicianID` - Here, `physician_id` or `physicianID` is acting as the **key** or **unique identifier** ## JOIN - SQL doesn't have `merge`, it has `JOIN` as a modifier to `FROM` ``` SELECT physician_first, physician_last FROM patients INNER JOIN physicians ON patients.physician_id == physicians.physicianID WHERE condition=="insomnia" ``` Creates a (virtual) table linking records where `physician_id` in one table matches `physicianID` in the other - If the names were the same in the two tables, we could write (e.g.) ``` SELECT nameLast,nameFirst,yearID,AB,H FROM Master INNER JOIN Batting USING(playerID); ``` `INNER JOIN ... USING` links records with the same value of `playerID` - There are some syntax variants here; see the handout ## JOIN - `LEFT OUTER JOIN` includes records from the first table which don't match any record in the 2nd + The "extra" records get `NA` in the 2nd table's fields - `RIGHT OUTER JOIN` is just what you'd think + so is `FULL OUTER JOIN` ## Updated Translation Table R jargon | Database jargon --------|---------------- column | field row | record dataframe | table types of the columns | table schema bunch of dataframes | database selections, `subset` | `SELECT ... FROM ... WHERE ... HAVING` `aggregate`, `d*ply` | `GROUP BY` `merge` | `JOIN` `order` | `ORDER BY` ## Connecting R to SQL - SQL is a language; database management systems (DMBS) actually implement it and do the work + MySQL, SQLite, etc., etc. - They all have somewhat different conventions - The R package `DBI` is a unified interface to them - Need a separate "driver" for each DBMS ## Connecting R to SQL ``` install.packages("DBI", dependencies = TRUE) # Install DBI install.packages("RSQLite", dependencies = TRUE) # Install driver for SQLite library(RSQLite) drv <- dbDriver('SQLite') con <- dbConnect(drv, dbname="baseball.db") ``` `con` is now a persistent connection to the database `baseball.db` ## Connecting R to SQL ``` dbListTables(con) # Get tables in the database (returns vector) dbListFields(con, name) # List fields in a table dbReadTable(con, name) # Import a table as a data frame ``` ## Connecting R to SQL ``` dbGetQuery(conn, statement) df <- dbGetQuery(con, paste( "SELECT nameLast,nameFirst,yearID,salary", "FROM Master NATURAL JOIN Salaries")) ``` ## Connecting R to SQL Usual workflow: - Load the driver, connect to the right database - R sends an SQL query to the DBMS - SQL **executes the query**, sending back a manageably small dataframe - R does the actual statistics - Close the connection when you're done ## Going the Other Way - The `sqldf` package lets you use SQL commands on dataframes - Mostly useful if you already know SQL better than R... ## Summary - A database is basically a way of dealing efficiently with lots of potentially huge dataframes - SQL is the standard language for telling databases what to do, especially what queries to run - Everything in an SQL query is something we've practiced already in R + subsetting/selection, aggregation, merging, ordering - Connect R to the database, send it an SQL query, analyse the returned dataframe - More information at [http://www.stat.cmu.edu/~cshalizi/statcomp/14/]