Tutorial 5

This tutorial is a little more advanced. Please copy the commmands below into Sequel Pro (or alternatively, run mysql from the command line on our AWS server after ssh’ing in). This tutorial is a little more “real world.” It won’t get much more complicated than this in our use of SQL.

NOTE: In this tutorial, the username is ubuntu. Replace references to ubuntu with your <sunet-id>.

List all the databases. (You may not see some databases below because you have been given access to subset of them, or see some other ones you made or we made for you.)

show databases

Database
sql_intro
ubuntu
ubuntu_album
ubuntu_test
ubuntu_world

The database sql_intro contains our tables for this tutorial. Enter the database.

use sql_intro

List the tables.

show tables

Tables_in_sql_intro
counties
tweets

Copy Table

We will use both the tables counties and tweets in this tutorial. Because we will be modifying them during the tutorial, let us make a copy.

First enter our database where we will create the tables. Then capture the output of select into a table, which effectively copies the table completely. If we wanted to create a new table with a subset of columns, we would specify that in place of *. Note the use of ‘;’ to separate multiple statements.

use ubuntu;
create table tweets as (select * from sql_intro.tweets);
create table counties as (select * from sql_intro.counties);

Check the schema/structure of the table.

describe tweets

Field	Type	Null	Default
created_at	datetime	YES	None
fips	varchar(5)	YES	None
id	varchar(20)	YES	None
place_country_code	varchar(2)	YES	None
place_full_name	varchar(110)	YES	None
text	text	YES	None
user_created_at	varchar(30)	YES	None
user_id	varchar(20)	YES	None
user_statuses_count	int(11)	YES	None

View Sample Of Records

Now that we have copied the tables, lets look at a sample of tweets. Instead of looking at consecutive samples, lets look at random sample. Run it multiple times to see different samples. Can you see how this is printing random records ?

select * from tweets order by rand() limit 5

created_at	fips	id	text	user_created_at	user_id	user_statuses_count
2020-05-11 11:08:34	51065	1259802574917521411	Amid This Pandemic, Women CEOs Share That Leadership And Culture Matter Most https://t.co/W6kC3M6XBM https://t.co/5gU97cppsw	Thu Jun 02 22:57:19 +0000 2011	309942480	62998
2020-05-10 16:01:06	36029	1259513805270405124	Supporting #FirstResponders + #NursesAreHeroes Movement // Please share with your network // The more support we can get, the more help we can deploy to #HealthCareWorkers // #COVID19 // #UnitedWeStand // #Justhanks >> Powered by the #THANKSBLOCK CHAIN... https://t.co/Tu1CsQ7vws	Sun Mar 31 17:32:21 +0000 2019	1112407295386963969	2398
2020-05-13 09:05:37	06037	1260496410497163264	Can you imagine, where are the PPE’s https://t.co/7TKsfPRUIw	Mon Apr 14 22:26:20 +0000 2014	2444536730	357
2020-05-16 17:09:01	06037	1261705224260972544	A great @ProjectLincoln ad would be one that draws a parallel between Trump and his obsession over ONE person, Obama, and him not giving a solitary shit about the almost 90000 people who have died in this pandemic. What say you @TheRickWilson and @gtconway3d?	Mon Dec 15 00:05:05 +0000 2008	18125906	12783
2020-05-07 20:05:15	12101	1258488084779450368	May 7, 2020 COVID-19 Update - https://t.co/QUBKS3hZr2 https://t.co/Q5Ffr82iL9	Thu Sep 01 04:06:28 +0000 2016	771197505836986369	765

Drop Column

Lets check if the column coordinates has anything of relevance.

select count(*) as n_tweets from tweets where length(coordinates) > 0

n_tweets
0

So the coordinates columns is empty for the table. We might as well drop the column altogether.

alter table tweets drop column coordinates

Remove Duplicates

Check if there are any duplicates in the data by finding number of distinct tweet ids.

select count(distinct(id)) as n_tweets, count(*) as total from tweets

n_tweets	total
2283613	2370923

So there are duplicate tweets in this data. Let us make a table n_duplicates with number of duplicates for every tweet id for only the tweets that have duplicates.

create table n_duplicates as (select id, count(*) as n_duplicates from tweets group by id having count(*) > 1)

Print a few entries from the duplicates table.

select * from n_duplicates limit 10 offset 100

id	n_duplicates
1257490235446841344	2
1257490235664949249	2
1257490237091127299	2
1257490238051663873	2
1257490238399799296	2
1257490239037362176	2
1257490241289687042	2
1257490242141126657	2
1257490242359234561	2
1257490242543779842	2

There are indeed duplicate tweets in this data. Print the full tweets for some of these ids. Can you see why order by is needed here ?

select * from tweets where id in (select id from n_duplicates) 
order by id limit 8 offset 150

created_at	fips	id	place_country_code	place_full_name	text	user_created_at	user_id	user_statuses_count
2020-05-05 02:00:04	27123	1257490213846355968			82% of survey respondents approve of Walz’s response to the coronavirus pandemic here in Minnesota. (Via KSTP) https://t.co/CGwmasTk9L	Wed Dec 01 16:21:42 +0000 2010	221787532	7330
2020-05-05 02:00:04	27123	1257490213846355968			82% of survey respondents approve of Walz’s response to the coronavirus pandemic here in Minnesota. (Via KSTP) https://t.co/CGwmasTk9L	Wed Dec 01 16:21:42 +0000 2010	221787532	7330
2020-05-05 02:00:04	06067	1257490214202732545	US	Sacramento, CA	The Guardian:Five Eyes network contradicts theory Covid-19 leaked from lab. https://t.co/B6VohC3H0f via @GoogleNews	Fri Jul 21 15:16:36 +0000 2017	888417429583417345	16148
2020-05-05 02:00:04	06067	1257490214202732545	US	Sacramento, CA	The Guardian:Five Eyes network contradicts theory Covid-19 leaked from lab. https://t.co/B6VohC3H0f via @GoogleNews	Fri Jul 21 15:16:36 +0000 2017	888417429583417345	16148
2020-05-05 02:00:05	31055	1257490215784124417			From the death of a security guard in Michigan who was enforcing mask rules to NRA layoffs and pay cuts, here is a summary of today's COVID-19 developments. https://t.co/PduodDyq08	Mon Mar 09 15:26:09 +0000 2009	23447418	92650
2020-05-05 02:00:05	31055	1257490215784124417			From the death of a security guard in Michigan who was enforcing mask rules to NRA layoffs and pay cuts, here is a summary of today's COVID-19 developments. https://t.co/PduodDyq08	Mon Mar 09 15:26:09 +0000 2009	23447418	92650
2020-05-05 02:00:05	42043	1257490216065150978			As the state continues to gather information regarding cases of Covid-19, there are some who notice a trend. https://t.co/6tSvRYfVPH	Tue Jun 17 12:36:05 +0000 2008	15145034	164394
2020-05-05 02:00:05	42043	1257490216065150978			As the state continues to gather information regarding cases of Covid-19, there are some who notice a trend. https://t.co/6tSvRYfVPH	Tue Jun 17 12:36:05 +0000 2008	15145034	164394

We need to remove these duplicates. To remove the duplicates, we first assign each row a unique number by creating a new column row_num. Then we creat a new table unique_tweets that for each group of duplicates, retains row with smallest row_num, therefore retaining one tweet among duplicates. We can instead drop rows from tweets table but that is slower than creating a new table and dropping the old table. This works when we have enough space for the new table.

The use of auto_increment allows us to fill up the column automatically with a sequence.

alter table tweets add column row_num INT auto_increment primary key

We create a table containing row_num of tweets we want to keep. Will max(row_num) as row_num work for what we are tying to do ?

create table unique_tweets_row_num as 
(select min(row_num) as row_num from tweets group by id)

We create a new table unique_tweets with row_num in unique_tweets_row_num.

create table unique_tweets as 
(select * from tweets where row_num in 
(select row_num from unique_tweets_row_num))

Let us double-check if duplicates are gone.

select count(distinct(id)) as n_tweets, count(*) as total from unique_tweets

n_tweets	total
2283613	2283613

We can drop the tables we created for temporary use.

show tables

Tables_in_ubuntu1
counties
unique_tweets

drop table n_duplicates, tweets, unique_tweets_row_num

Rename the table unique_tweets to tweets now that duplicates are removed.

rename table unique_tweets to tweets

Print the schema of our table tweets.

describe tweets

Field	Type	Null	Default
created_at	datetime	YES	None
fips	varchar(5)	YES	None
id	varchar(20)	YES	None
place_country_code	varchar(2)	YES	None
place_full_name	varchar(110)	YES	None
text	text	YES	None
user_created_at	varchar(30)	YES	None
user_id	varchar(20)	YES	None
user_statuses_count	int(11)	YES	None
row_num	int(11)	NO	0

We don’t need the column row_num either. We can drop it.

alter table tweets drop column row_num

Working with dates

Find the date/time range of tweets.

select min(created_at), max(created_at) from tweets

min(created_at)	max(created_at)
2020-05-05 00:00:00	2020-05-18 23:59:59

This is 2 weeks of data. Get the number tweets per day (date).

select date(created_at) as date, count(*) as n_tweets from tweets 
group by date order by date

date	n_tweets
2020-05-05	158795
2020-05-06	165513
2020-05-07	157250
2020-05-08	144608
2020-05-09	139884
2020-05-10	118675
2020-05-11	161456
2020-05-12	187807
2020-05-13	204860
2020-05-14	205149
2020-05-15	200291
2020-05-16	135812
2020-05-17	122249
2020-05-18	181264

Get number of tweets per (date-)hour.

select date_format(created_at,'%Y-%m-%d %H:00') as hour, count(*) as n_tweets from tweets 
group by hour order by hour limit 10 offset 50

hour	n_tweets
2020-05-07 06:00	2427
2020-05-07 07:00	1582
2020-05-07 08:00	1029
2020-05-07 09:00	1172
2020-05-07 10:00	2146
2020-05-07 11:00	4017
2020-05-07 12:00	4849
2020-05-07 13:00	5644
2020-05-07 14:00	6807
2020-05-07 15:00	2253

Let’s say we want to know how many tweets–on average–are sent every hour of the day. By average we mean that we want to average across dates.

We first create a table to hold the number of tweets every hour for a given date.

create table hourly as 
(select date(created_at) as date, hour(created_at) as hour, count(*) as n_tweets from tweets group by date, hour)

select * from hourly 
limit 10 offset 50

date	hour	n_tweets
2020-05-07	6	2427
2020-05-07	7	1582
2020-05-07	8	1029
2020-05-07	9	1170
2020-05-07	10	2145
2020-05-07	11	4017
2020-05-07	12	4848
2020-05-07	13	5644
2020-05-07	14	6807
2020-05-07	15	2253

Now that we have that much information, we can collapse across days, to get the average over all days for every hour.

select hour, avg(n_tweets) as avg_n_tweets from hourly
group by hour

hour	avg_n_tweets
0	7216.6667
1	6810.3077
2	6507.5000
3	5618.9286
4	4602.3571
5	3402.6429
6	2604.7143
7	2635.2143
8	1767.0714
9	1971.9286
10	2760.2857
11	4372.0714
12	6599.0714
13	9132.2857
14	10620.2143
15	11015.3571
16	10912.5385
17	11129.5714
18	11575.4286
19	11492.0000
20	10513.3846
21	9824.4615
22	8441.4167
23	8309.6364

So that works. But let’s see if we can do this simpler!

We can do this in one command without creating a temporary table hourly by nesting SQL queries, rather than making ourselves these temporary tables. Look at the queries above to convince yourself that the query below is functionally equivalent.

select hour, avg(n_tweets) as avg_n_tweets from (
    select date(created_at) as date, hour(created_at) as hour, count(*) as n_tweets from tweets group by date,hour) as tmp
group by hour

hour	avg_n_tweets
0	7216.6667
1	6810.3077
2	6507.5000
3	5618.9286
4	4602.3571
5	3402.6429
6	2604.7143
7	2635.2143
8	1767.0714
9	1971.9286
10	2760.2857
11	4372.0714
12	6599.0714
13	9132.2857
14	10620.2143
15	11015.3571
16	10912.5385
17	11129.5714
18	11575.4286
19	11492.0000
20	10513.3846
21	9824.4615
22	8441.4167
23	8309.6364

Alright, let’s drop our temp table!

drop table hourly

Distinct and counting

Get the number of distinct users in the data.

select count(distinct(user_id)) as n_users from tweets

n_users
797173

Get number of tweets per user and have the results in descending order of number of tweets.

select user_id, count(*) as n_tweets from tweets 
group by user_id order by n_tweets desc limit 10 offset 100

user_id	n_tweets
701476526596997120	410
166488367	406
2876963253	403
9020932	399
14437914	399
19733334	393
34553489	391
44959878	391
126424795	389
22707692	389

Searching using wildcards

We can find tweets that contain urls. A simple and basic version of that is using wildcards.

A wildcard (* the shell) is a symbol that matches any number of characters, including 0 characters. In SQL, the importance two to know are

% - matches any number of characters, including no character. Which means that ‘%bla’ also matches strings that just start with ‘bla’. ? - matches one and exactly one arbitrary character. ‘?bla’ matches ‘bbla’ and ‘hbla’.

We use it with the SELECT LIKE command, like so:

select count(*) as n_tweets_with_url from tweets where text like '%https://%'

n_tweets_with_url
1661586

select * from tweets where text like '%https://%' 
limit 5 offset 100

created_at	fips	id	text	user_created_at	user_id	user_statuses_count
2020-05-05 00:00:06	39061	1257460021941678080	A Remote Workforce Makes Security Training More Critical \| #coronavirus #covid19 #cybersecurity #privacy #security #compliance #databreach #biometrics #fingerprint #facialrecognition #smarthome #phishing #ransomware #CCPA #BIPA #IoT https://t.co/0d2IDtxu1r https://t.co/fN9I3AAf56	Fri Aug 20 19:12:14 +0000 2010	180888396	8759
2020-05-05 00:00:06	41059	1257460022205984771	Stay home. Don’t hang out with people who don’t live in your house unless you’re social distancing. https://t.co/ZPOxHuPxlG	Wed May 17 00:23:52 +0000 2017	864637555618304001	499
2020-05-05 00:00:06	11001	1257460023057436682	From snacking to sibling rivalry, kids give us their honest opinions about #quarantinelife. Spoiler alert:they REALLY miss the playground. #flattenthecurve https://t.co/hML6bJp3zs https://t.co/KHuJzTS0yl	Thu Aug 27 17:53:15 +0000 2009	69356392	10098
2020-05-05 00:00:06	21111	1257460023539781633	'60 Minutes' correspondent Lesley Stahl says she was 'scared' and hospitalized for a week fighting COVID-19 https://t.co/R7pLfvXfhP https://t.co/cQsisv6czN	Mon Oct 03 16:09:47 +0000 2011	384392137	10879
2020-05-05 00:00:06	34013	1257460023925641217	Visit the Reeve Foundation COVID-19 Information Center for information about the #coronavirus and SCI. #TodaysCare https://t.co/8xFect3qZk	Thu Apr 16 15:12:43 +0000 2009	31877693	40221

Merge/Join county, state

In table tweets the column fips is the fips code of the US county from which tweet likely originated.

FYI: There are about 3000 counties, every US county has a unique fips code. The first two digits in the FIPS code designate a state (state “1” being Alabama), and the last three digits specify a county within the state. For example, Santa Clara county is 6085 in CA, San Mateo is 6081, and 48269 is King county in TX.

Let us now add columns county , state corresponding to the fips in the tweets table, information for which is in the counties table. This means we need to join these two tables on fips.

Print the schema of the table counties.

describe counties

Field	Type	Null	Default
fips	varchar(5)	YES	None
county	varchar(40)	YES	None
state	varchar(20)	YES	None

View a few records.

select * from counties limit 10 offset 100

fips	county	state
6061	Placer County	California
26161	Washtenaw County	Michigan
22071	Orleans Parish	Louisiana
48355	Nueces County	Texas
27003	Anoka County	Minnesota
12081	Manatee County	Florida
12021	Collier County	Florida
42079	Luzerne County	Pennsylvania
29510	St. Louis city	Missouri
48039	Brazoria County	Texas

The fips in table tweets are all of length 5 but fips in table counties are of length 4 & 5.

select distinct(length(fips)) from tweets

(length(fips))
5

select distinct(length(fips)) from counties

(length(fips))
4
5

Before we merge, we need to make sure the fips in table counties is of same length, with zero padding, as in table tweets. Here we left-pad the fips with 0 to make sure the length is 5. You can double-check this by running previous statement again.

update counties set fips = lpad(fips,5,'0')

We intend to bring the county , state columns from counties into tweets for every tweet. But are all the fips in tweets present in table counties ?

select count(distinct(fips)) as n_fips from tweets where fips not in (
    select fips from counties)

n_fips
28

So there are fips in tweets for which we have no record in counties. When we join, we have two options:

Left-Join: For every fips in tweets that has an entry in counties , result table will have county , state from counties. For fips in tweets with no entry in counties , result table will fill NULL for county , state.
Inner-Join: For every fips in tweets that has an entry in counties , result will have county , state from counties. For fips in tweets with no entry in counties , result table will drop the record.

We will perform a inner-join for this tutorial, thereby keeping records in tweets_that have an entry in _counties. Can you see why we need to name the columns like tweets.*, counties.county, counties.state ?

create table tweets_counties as (
    select tweets.*, counties.county, counties.state from tweets inner join counties on tweets.fips = counties.fips)

Print the schema of the merged table.

describe tweets_counties

Field	Type	Null	Default
created_at	datetime	YES	None
fips	varchar(5)	YES	None
id	varchar(20)	YES	None
place_country_code	varchar(2)	YES	None
place_full_name	varchar(110)	YES	None
text	text	YES	None
user_created_at	varchar(30)	YES	None
user_id	varchar(20)	YES	None
user_statuses_count	int(11)	YES	None
county	varchar(40)	YES	None
state	varchar(20)	YES	None

We can drop all tables except this new one as we now have everything in one table.

drop table tweets, counties

Rename table to tweets.

rename table tweets_counties to tweets

Let us find the average number of users per state.

select state, count(distinct(user_id)) as n_users from tweets group by state
limit 10

state	n_users
Alabama	6581
Alaska	3718
Arizona	14715
Arkansas	2942
California	144352
Colorado	14636
Connecticut	6131
Delaware	1260
District of Columbia	28195

Finally, remove the table tweets.

drop table tweets

Well done! If you understand everything up to here you are in good SQL-shape! BTW, don’t hesitate to google any questions, stackoverflow and the mySQL online reference are your friend. Everybody does that, and those who say they don’t are lying.

Stanford PSYCH 290

Natural Language Processing & Text-Based Machine Learning in the Social Sciences