Tutorial 5
This tutorial is a little more advanced. Please copy the commmands below into Sequel Pro (or alternatively, run mysql
from the command line on our AWS server after ssh’ing in). This tutorial is a little more “real world.” It won’t get much more complicated than this in our use of SQL.
NOTE: In this tutorial, the username is ubuntu
. Replace references to ubuntu
with your <sunet-id>
.
List all the databases. (You may not see some databases below because you have been given access to subset of them, or see some other ones you made or we made for you.)
show databases
Database |
---|
sql_intro |
ubuntu |
ubuntu_album |
ubuntu_test |
ubuntu_world |
The database sql_intro
contains our tables for this tutorial. Enter the database.
use sql_intro
List the tables.
show tables
Tables_in_sql_intro |
---|
counties |
tweets |
Copy Table
We will use both the tables counties
and tweets
in this tutorial. Because we will be modifying them during the tutorial, let us make a copy.
First enter our database where we will create the tables. Then capture the output of select
into a table, which effectively copies the table completely. If we wanted to create a new table with a subset of columns, we would specify that in place of *
. Note the use of ‘;’ to separate multiple statements.
use ubuntu;
create table tweets as (select * from sql_intro.tweets);
create table counties as (select * from sql_intro.counties);
Check the schema/structure of the table.
describe tweets
Field | Type | Null | Key | Default | Extra |
---|---|---|---|---|---|
created_at | datetime | YES | None | ||
fips | varchar(5) | YES | None | ||
id | varchar(20) | YES | None | ||
place_country_code | varchar(2) | YES | None | ||
place_full_name | varchar(110) | YES | None | ||
text | text | YES | None | ||
user_created_at | varchar(30) | YES | None | ||
user_id | varchar(20) | YES | None | ||
user_statuses_count | int(11) | YES | None |
View Sample Of Records
Now that we have copied the tables, lets look at a sample of tweets. Instead of looking at consecutive samples, lets look at random sample. Run it multiple times to see different samples. Can you see how this is printing random records ?
select * from tweets order by rand() limit 5
coordinates | created_at | fips | id | place_country_code | place_full_name | text | user_created_at | user_id | user_statuses_count |
---|---|---|---|---|---|---|---|---|---|
2020-05-11 11:08:34 | 51065 | 1259802574917521411 | Amid This Pandemic, Women CEOs Share That Leadership And Culture Matter Most https://t.co/W6kC3M6XBM https://t.co/5gU97cppsw | Thu Jun 02 22:57:19 +0000 2011 | 309942480 | 62998 | |||
2020-05-10 16:01:06 | 36029 | 1259513805270405124 | Supporting #FirstResponders + #NursesAreHeroes Movement // Please share with your network // The more support we can get, the more help we can deploy to #HealthCareWorkers // #COVID19 // #UnitedWeStand // #Justhanks >> Powered by the #THANKSBLOCK CHAIN... https://t.co/Tu1CsQ7vws | Sun Mar 31 17:32:21 +0000 2019 | 1112407295386963969 | 2398 | |||
2020-05-13 09:05:37 | 06037 | 1260496410497163264 | Can you imagine, where are the PPE’s https://t.co/7TKsfPRUIw | Mon Apr 14 22:26:20 +0000 2014 | 2444536730 | 357 | |||
2020-05-16 17:09:01 | 06037 | 1261705224260972544 | A great @ProjectLincoln ad would be one that draws a parallel between Trump and his obsession over ONE person, Obama, and him not giving a solitary shit about the almost 90000 people who have died in this pandemic. What say you @TheRickWilson and @gtconway3d? | Mon Dec 15 00:05:05 +0000 2008 | 18125906 | 12783 | |||
2020-05-07 20:05:15 | 12101 | 1258488084779450368 | May 7, 2020 COVID-19 Update - https://t.co/QUBKS3hZr2 https://t.co/Q5Ffr82iL9 | Thu Sep 01 04:06:28 +0000 2016 | 771197505836986369 | 765 |
Drop Column
Lets check if the column coordinates has anything of relevance.
select count(*) as n_tweets from tweets where length(coordinates) > 0
n_tweets |
---|
0 |
So the coordinates columns is empty for the table. We might as well drop the column altogether.
alter table tweets drop column coordinates
Remove Duplicates
Check if there are any duplicates in the data by finding number of distinct tweet ids.
select count(distinct(id)) as n_tweets, count(*) as total from tweets
n_tweets | total |
---|---|
2283613 | 2370923 |
So there are duplicate tweets in this data. Let us make a table n_duplicates
with number of duplicates for every tweet id for only the tweets that have duplicates.
create table n_duplicates as (select id, count(*) as n_duplicates from tweets group by id having count(*) > 1)
Print a few entries from the duplicates table.
select * from n_duplicates limit 10 offset 100
id | n_duplicates |
---|---|
1257490235446841344 | 2 |
1257490235664949249 | 2 |
1257490237091127299 | 2 |
1257490238051663873 | 2 |
1257490238399799296 | 2 |
1257490239037362176 | 2 |
1257490241289687042 | 2 |
1257490242141126657 | 2 |
1257490242359234561 | 2 |
1257490242543779842 | 2 |
There are indeed duplicate tweets in this data. Print the full tweets for some of these ids. Can you see why order by
is needed here ?
select * from tweets where id in (select id from n_duplicates)
order by id limit 8 offset 150
created_at | fips | id | place_country_code | place_full_name | text | user_created_at | user_id | user_statuses_count |
---|---|---|---|---|---|---|---|---|
2020-05-05 02:00:04 | 27123 | 1257490213846355968 | 82% of survey respondents approve of Walz’s response to the coronavirus pandemic here in Minnesota. (Via KSTP) https://t.co/CGwmasTk9L | Wed Dec 01 16:21:42 +0000 2010 | 221787532 | 7330 | ||
2020-05-05 02:00:04 | 27123 | 1257490213846355968 | 82% of survey respondents approve of Walz’s response to the coronavirus pandemic here in Minnesota. (Via KSTP) https://t.co/CGwmasTk9L | Wed Dec 01 16:21:42 +0000 2010 | 221787532 | 7330 | ||
2020-05-05 02:00:04 | 06067 | 1257490214202732545 | US | Sacramento, CA | The Guardian:Five Eyes network contradicts theory Covid-19 leaked from lab. https://t.co/B6VohC3H0f via @GoogleNews | Fri Jul 21 15:16:36 +0000 2017 | 888417429583417345 | 16148 |
2020-05-05 02:00:04 | 06067 | 1257490214202732545 | US | Sacramento, CA | The Guardian:Five Eyes network contradicts theory Covid-19 leaked from lab. https://t.co/B6VohC3H0f via @GoogleNews | Fri Jul 21 15:16:36 +0000 2017 | 888417429583417345 | 16148 |
2020-05-05 02:00:05 | 31055 | 1257490215784124417 | From the death of a security guard in Michigan who was enforcing mask rules to NRA layoffs and pay cuts, here is a summary of today's COVID-19 developments. https://t.co/PduodDyq08 | Mon Mar 09 15:26:09 +0000 2009 | 23447418 | 92650 | ||
2020-05-05 02:00:05 | 31055 | 1257490215784124417 | From the death of a security guard in Michigan who was enforcing mask rules to NRA layoffs and pay cuts, here is a summary of today's COVID-19 developments. https://t.co/PduodDyq08 | Mon Mar 09 15:26:09 +0000 2009 | 23447418 | 92650 | ||
2020-05-05 02:00:05 | 42043 | 1257490216065150978 | As the state continues to gather information regarding cases of Covid-19, there are some who notice a trend. https://t.co/6tSvRYfVPH | Tue Jun 17 12:36:05 +0000 2008 | 15145034 | 164394 | ||
2020-05-05 02:00:05 | 42043 | 1257490216065150978 | As the state continues to gather information regarding cases of Covid-19, there are some who notice a trend. https://t.co/6tSvRYfVPH | Tue Jun 17 12:36:05 +0000 2008 | 15145034 | 164394 |
We need to remove these duplicates. To remove the duplicates, we first assign each row a unique number by creating a new column row_num. Then we creat a new table unique_tweets that for each group of duplicates, retains row with smallest row_num, therefore retaining one tweet among duplicates. We can instead drop rows from tweets table but that is slower than creating a new table and dropping the old table. This works when we have enough space for the new table.
The use of auto_increment
allows us to fill up the column automatically with a sequence.
alter table tweets add column row_num INT auto_increment primary key
We create a table containing row_num of tweets we want to keep. Will max(row_num) as row_num
work for what we are tying to do ?
create table unique_tweets_row_num as
(select min(row_num) as row_num from tweets group by id)
We create a new table unique_tweets with row_num in unique_tweets_row_num.
create table unique_tweets as
(select * from tweets where row_num in
(select row_num from unique_tweets_row_num))
Let us double-check if duplicates are gone.
select count(distinct(id)) as n_tweets, count(*) as total from unique_tweets
n_tweets | total |
---|---|
2283613 | 2283613 |
We can drop the tables we created for temporary use.
show tables
Tables_in_ubuntu1 |
---|
counties |
unique_tweets |
drop table n_duplicates, tweets, unique_tweets_row_num
Rename the table unique_tweets to tweets now that duplicates are removed.
rename table unique_tweets to tweets
Print the schema of our table tweets
.
describe tweets
Field | Type | Null | Key | Default | Extra |
---|---|---|---|---|---|
created_at | datetime | YES | None | ||
fips | varchar(5) | YES | None | ||
id | varchar(20) | YES | None | ||
place_country_code | varchar(2) | YES | None | ||
place_full_name | varchar(110) | YES | None | ||
text | text | YES | None | ||
user_created_at | varchar(30) | YES | None | ||
user_id | varchar(20) | YES | None | ||
user_statuses_count | int(11) | YES | None | ||
row_num | int(11) | NO | 0 |
We don’t need the column row_num either. We can drop it.
alter table tweets drop column row_num
Working with dates
Find the date/time range of tweets.
select min(created_at), max(created_at) from tweets
min(created_at) | max(created_at) |
---|---|
2020-05-05 00:00:00 | 2020-05-18 23:59:59 |
This is 2 weeks of data. Get the number tweets per day (date).
select date(created_at) as date, count(*) as n_tweets from tweets
group by date order by date
date | n_tweets |
---|---|
2020-05-05 | 158795 |
2020-05-06 | 165513 |
2020-05-07 | 157250 |
2020-05-08 | 144608 |
2020-05-09 | 139884 |
2020-05-10 | 118675 |
2020-05-11 | 161456 |
2020-05-12 | 187807 |
2020-05-13 | 204860 |
2020-05-14 | 205149 |
2020-05-15 | 200291 |
2020-05-16 | 135812 |
2020-05-17 | 122249 |
2020-05-18 | 181264 |
Get number of tweets per (date-)hour.
select date_format(created_at,'%Y-%m-%d %H:00') as hour, count(*) as n_tweets from tweets
group by hour order by hour limit 10 offset 50
hour | n_tweets |
---|---|
2020-05-07 06:00 | 2427 |
2020-05-07 07:00 | 1582 |
2020-05-07 08:00 | 1029 |
2020-05-07 09:00 | 1172 |
2020-05-07 10:00 | 2146 |
2020-05-07 11:00 | 4017 |
2020-05-07 12:00 | 4849 |
2020-05-07 13:00 | 5644 |
2020-05-07 14:00 | 6807 |
2020-05-07 15:00 | 2253 |
Let’s say we want to know how many tweets–on average–are sent every hour of the day. By average we mean that we want to average across dates.
We first create a table to hold the number of tweets every hour for a given date.
create table hourly as
(select date(created_at) as date, hour(created_at) as hour, count(*) as n_tweets from tweets group by date, hour)
select * from hourly
limit 10 offset 50
date | hour | n_tweets |
---|---|---|
2020-05-07 | 6 | 2427 |
2020-05-07 | 7 | 1582 |
2020-05-07 | 8 | 1029 |
2020-05-07 | 9 | 1170 |
2020-05-07 | 10 | 2145 |
2020-05-07 | 11 | 4017 |
2020-05-07 | 12 | 4848 |
2020-05-07 | 13 | 5644 |
2020-05-07 | 14 | 6807 |
2020-05-07 | 15 | 2253 |
Now that we have that much information, we can collapse across days, to get the average over all days for every hour.
select hour, avg(n_tweets) as avg_n_tweets from hourly
group by hour
hour | avg_n_tweets |
---|---|
0 | 7216.6667 |
1 | 6810.3077 |
2 | 6507.5000 |
3 | 5618.9286 |
4 | 4602.3571 |
5 | 3402.6429 |
6 | 2604.7143 |
7 | 2635.2143 |
8 | 1767.0714 |
9 | 1971.9286 |
10 | 2760.2857 |
11 | 4372.0714 |
12 | 6599.0714 |
13 | 9132.2857 |
14 | 10620.2143 |
15 | 11015.3571 |
16 | 10912.5385 |
17 | 11129.5714 |
18 | 11575.4286 |
19 | 11492.0000 |
20 | 10513.3846 |
21 | 9824.4615 |
22 | 8441.4167 |
23 | 8309.6364 |
So that works. But let’s see if we can do this simpler!
We can do this in one command without creating a temporary table hourly by nesting SQL queries, rather than making ourselves these temporary tables. Look at the queries above to convince yourself that the query below is functionally equivalent.
select hour, avg(n_tweets) as avg_n_tweets from (
select date(created_at) as date, hour(created_at) as hour, count(*) as n_tweets from tweets group by date,hour) as tmp
group by hour
hour | avg_n_tweets |
---|---|
0 | 7216.6667 |
1 | 6810.3077 |
2 | 6507.5000 |
3 | 5618.9286 |
4 | 4602.3571 |
5 | 3402.6429 |
6 | 2604.7143 |
7 | 2635.2143 |
8 | 1767.0714 |
9 | 1971.9286 |
10 | 2760.2857 |
11 | 4372.0714 |
12 | 6599.0714 |
13 | 9132.2857 |
14 | 10620.2143 |
15 | 11015.3571 |
16 | 10912.5385 |
17 | 11129.5714 |
18 | 11575.4286 |
19 | 11492.0000 |
20 | 10513.3846 |
21 | 9824.4615 |
22 | 8441.4167 |
23 | 8309.6364 |
Alright, let’s drop our temp table!
drop table hourly
Distinct and counting
Get the number of distinct users in the data.
select count(distinct(user_id)) as n_users from tweets
n_users |
---|
797173 |
Get number of tweets per user and have the results in descending order of number of tweets.
select user_id, count(*) as n_tweets from tweets
group by user_id order by n_tweets desc limit 10 offset 100
user_id | n_tweets |
---|---|
701476526596997120 | 410 |
166488367 | 406 |
2876963253 | 403 |
9020932 | 399 |
14437914 | 399 |
19733334 | 393 |
34553489 | 391 |
44959878 | 391 |
126424795 | 389 |
22707692 | 389 |
Searching using wildcards
We can find tweets that contain urls. A simple and basic version of that is using wildcards.
A wildcard (*
the shell) is a symbol that matches any number of characters, including 0 characters. In SQL, the importance two to know are
%
- matches any number of characters, including no character. Which means that ‘%bla’ also matches strings that just start with ‘bla’.
?
- matches one and exactly one arbitrary character. ‘?bla’ matches ‘bbla’ and ‘hbla’.
We use it with the SELECT LIKE command, like so:
select count(*) as n_tweets_with_url from tweets where text like '%https://%'
n_tweets_with_url |
---|
1661586 |
select * from tweets where text like '%https://%'
limit 5 offset 100
created_at | fips | id | place_country_code | place_full_name | text | user_created_at | user_id | user_statuses_count |
---|---|---|---|---|---|---|---|---|
2020-05-05 00:00:06 | 39061 | 1257460021941678080 | A Remote Workforce Makes Security Training More Critical | #coronavirus #covid19 #cybersecurity #privacy #security #compliance #databreach #biometrics #fingerprint #facialrecognition #smarthome #phishing #ransomware #CCPA #BIPA #IoT https://t.co/0d2IDtxu1r https://t.co/fN9I3AAf56 | Fri Aug 20 19:12:14 +0000 2010 | 180888396 | 8759 | ||
2020-05-05 00:00:06 | 41059 | 1257460022205984771 | Stay home. Don’t hang out with people who don’t live in your house unless you’re social distancing. https://t.co/ZPOxHuPxlG | Wed May 17 00:23:52 +0000 2017 | 864637555618304001 | 499 | ||
2020-05-05 00:00:06 | 11001 | 1257460023057436682 | From snacking to sibling rivalry, kids give us their honest opinions about #quarantinelife. Spoiler alert:they *REALLY* miss the playground. #flattenthecurve https://t.co/hML6bJp3zs https://t.co/KHuJzTS0yl | Thu Aug 27 17:53:15 +0000 2009 | 69356392 | 10098 | ||
2020-05-05 00:00:06 | 21111 | 1257460023539781633 | '60 Minutes' correspondent Lesley Stahl says she was 'scared' and hospitalized for a week fighting COVID-19 https://t.co/R7pLfvXfhP https://t.co/cQsisv6czN | Mon Oct 03 16:09:47 +0000 2011 | 384392137 | 10879 | ||
2020-05-05 00:00:06 | 34013 | 1257460023925641217 | Visit the Reeve Foundation COVID-19 Information Center for information about the #coronavirus and SCI. #TodaysCare https://t.co/8xFect3qZk | Thu Apr 16 15:12:43 +0000 2009 | 31877693 | 40221 |
Merge/Join county, state
In table tweets the column fips is the fips code of the US county from which tweet likely originated.
FYI: There are about 3000 counties, every US county has a unique fips code. The first two digits in the FIPS code designate a state (state “1” being Alabama), and the last three digits specify a county within the state. For example, Santa Clara county is 6085 in CA, San Mateo is 6081, and 48269 is King county in TX.
Let us now add columns county , state corresponding to the fips in the tweets table, information for which is in the counties table. This means we need to join these two tables on fips.
Print the schema of the table counties.
describe counties
Field | Type | Null | Key | Default | Extra |
---|---|---|---|---|---|
fips | varchar(5) | YES | None | ||
county | varchar(40) | YES | None | ||
state | varchar(20) | YES | None |
View a few records.
select * from counties limit 10 offset 100
fips | county | state |
---|---|---|
6061 | Placer County | California |
26161 | Washtenaw County | Michigan |
22071 | Orleans Parish | Louisiana |
48355 | Nueces County | Texas |
27003 | Anoka County | Minnesota |
12081 | Manatee County | Florida |
12021 | Collier County | Florida |
42079 | Luzerne County | Pennsylvania |
29510 | St. Louis city | Missouri |
48039 | Brazoria County | Texas |
The fips in table tweets are all of length 5 but fips in table counties are of length 4 & 5.
select distinct(length(fips)) from tweets
(length(fips)) |
---|
5 |
select distinct(length(fips)) from counties
(length(fips)) |
---|
4 |
5 |
Before we merge, we need to make sure the fips in table counties is of same length, with zero padding, as in table tweets. Here we left-pad the fips with 0 to make sure the length is 5. You can double-check this by running previous statement again.
update counties set fips = lpad(fips,5,'0')
We intend to bring the county , state columns from counties into tweets for every tweet. But are all the fips in tweets present in table counties ?
select count(distinct(fips)) as n_fips from tweets where fips not in (
select fips from counties)
n_fips |
---|
28 |
So there are fips in tweets for which we have no record in counties. When we join, we have two options:
- Left-Join: For every fips in tweets that has an entry in counties , result table will have county , state from counties. For fips in tweets with no entry in counties , result table will fill NULL for county , state.
- Inner-Join: For every fips in tweets that has an entry in counties , result will have county , state from counties. For fips in tweets with no entry in counties , result table will drop the record.
We will perform a inner-join for this tutorial, thereby keeping records in tweets_that have an entry in _counties. Can you see why we need to name the columns like tweets.*, counties.county, counties.state
?
create table tweets_counties as (
select tweets.*, counties.county, counties.state from tweets inner join counties on tweets.fips = counties.fips)
Print the schema of the merged table.
describe tweets_counties
Field | Type | Null | Key | Default | Extra |
---|---|---|---|---|---|
created_at | datetime | YES | None | ||
fips | varchar(5) | YES | None | ||
id | varchar(20) | YES | None | ||
place_country_code | varchar(2) | YES | None | ||
place_full_name | varchar(110) | YES | None | ||
text | text | YES | None | ||
user_created_at | varchar(30) | YES | None | ||
user_id | varchar(20) | YES | None | ||
user_statuses_count | int(11) | YES | None | ||
county | varchar(40) | YES | None | ||
state | varchar(20) | YES | None |
We can drop all tables except this new one as we now have everything in one table.
drop table tweets, counties
Rename table to tweets.
rename table tweets_counties to tweets
Let us find the average number of users per state.
select state, count(distinct(user_id)) as n_users from tweets group by state
limit 10
state | n_users |
---|---|
Alabama | 6581 |
Alaska | 3718 |
Arizona | 14715 |
Arkansas | 2942 |
California | 144352 |
Colorado | 14636 |
Connecticut | 6131 |
Delaware | 1260 |
District of Columbia | 28195 |
Finally, remove the table tweets.
drop table tweets
Well done! If you understand everything up to here you are in good SQL-shape! BTW, don’t hesitate to google any questions, stackoverflow and the mySQL online reference are your friend. Everybody does that, and those who say they don’t are lying.