Stanford PSYCH 290

Natural Language Processing & Text-Based Machine Learning in the Social Sciences

Tutorial 5

This tutorial is a little more advanced. Please copy the commmands below into Sequel Pro (or alternatively, run mysql from the command line on our AWS server after ssh’ing in). This tutorial is a little more “real world.” It won’t get much more complicated than this in our use of SQL.

NOTE: In this tutorial, the username is ubuntu. Replace references to ubuntu with your <sunet-id>.

List all the databases. (You may not see some databases below because you have been given access to subset of them, or see some other ones you made or we made for you.)

show databases

Database
sql_intro
ubuntu
ubuntu_album
ubuntu_test
ubuntu_world

The database sql_intro contains our tables for this tutorial. Enter the database.

use sql_intro

List the tables.

show tables

Tables_in_sql_intro
counties
tweets

Copy Table

We will use both the tables counties and tweets in this tutorial. Because we will be modifying them during the tutorial, let us make a copy.

First enter our database where we will create the tables. Then capture the output of select into a table, which effectively copies the table completely. If we wanted to create a new table with a subset of columns, we would specify that in place of *. Note the use of ‘;’ to separate multiple statements.

use ubuntu;
create table tweets as (select * from sql_intro.tweets);
create table counties as (select * from sql_intro.counties);

Check the schema/structure of the table.

describe tweets

Field Type Null Key Default Extra
created_at datetime YES None
fips varchar(5) YES None
id varchar(20) YES None
place_country_code varchar(2) YES None
place_full_name varchar(110) YES None
text text YES None
user_created_at varchar(30) YES None
user_id varchar(20) YES None
user_statuses_count int(11) YES None

View Sample Of Records

Now that we have copied the tables, lets look at a sample of tweets. Instead of looking at consecutive samples, lets look at random sample. Run it multiple times to see different samples. Can you see how this is printing random records ?

select * from tweets order by rand() limit 5

coordinates created_at fips id place_country_code place_full_name text user_created_at user_id user_statuses_count
2020-05-11 11:08:34 51065 1259802574917521411 Amid This Pandemic, Women CEOs Share That Leadership And Culture Matter Most https://t.co/W6kC3M6XBM https://t.co/5gU97cppsw Thu Jun 02 22:57:19 +0000 2011 309942480 62998
2020-05-10 16:01:06 36029 1259513805270405124 Supporting #FirstResponders + #NursesAreHeroes Movement // Please share with your network // The more support we can get, the more help we can deploy to #HealthCareWorkers // #COVID19 // #UnitedWeStand // #Justhanks &gt;&gt; Powered by the #THANKSBLOCK CHAIN... https://t.co/Tu1CsQ7vws Sun Mar 31 17:32:21 +0000 2019 1112407295386963969 2398
2020-05-13 09:05:37 06037 1260496410497163264 Can you imagine, where are the PPE’s https://t.co/7TKsfPRUIw Mon Apr 14 22:26:20 +0000 2014 2444536730 357
2020-05-16 17:09:01 06037 1261705224260972544 A great @ProjectLincoln ad would be one that draws a parallel between Trump and his obsession over ONE person, Obama, and him not giving a solitary shit about the almost 90000 people who have died in this pandemic. What say you @TheRickWilson and @gtconway3d? Mon Dec 15 00:05:05 +0000 2008 18125906 12783
2020-05-07 20:05:15 12101 1258488084779450368 May 7, 2020 COVID-19 Update - https://t.co/QUBKS3hZr2 https://t.co/Q5Ffr82iL9 Thu Sep 01 04:06:28 +0000 2016 771197505836986369 765

Drop Column

Lets check if the column coordinates has anything of relevance.

select count(*) as n_tweets from tweets where length(coordinates) > 0

n_tweets
0

So the coordinates columns is empty for the table. We might as well drop the column altogether.

alter table tweets drop column coordinates

Remove Duplicates

Check if there are any duplicates in the data by finding number of distinct tweet ids.

select count(distinct(id)) as n_tweets, count(*) as total from tweets

n_tweets total
2283613 2370923

So there are duplicate tweets in this data. Let us make a table n_duplicates with number of duplicates for every tweet id for only the tweets that have duplicates.

create table n_duplicates as (select id, count(*) as n_duplicates from tweets group by id having count(*) > 1)

Print a few entries from the duplicates table.

select * from n_duplicates limit 10 offset 100

id n_duplicates
1257490235446841344 2
1257490235664949249 2
1257490237091127299 2
1257490238051663873 2
1257490238399799296 2
1257490239037362176 2
1257490241289687042 2
1257490242141126657 2
1257490242359234561 2
1257490242543779842 2

There are indeed duplicate tweets in this data. Print the full tweets for some of these ids. Can you see why order by is needed here ?

select * from tweets where id in (select id from n_duplicates) 
order by id limit 8 offset 150

created_at fips id place_country_code place_full_name text user_created_at user_id user_statuses_count
2020-05-05 02:00:04 27123 1257490213846355968 82% of survey respondents approve of Walz’s response to the coronavirus pandemic here in Minnesota. (Via KSTP) https://t.co/CGwmasTk9L Wed Dec 01 16:21:42 +0000 2010 221787532 7330
2020-05-05 02:00:04 27123 1257490213846355968 82% of survey respondents approve of Walz’s response to the coronavirus pandemic here in Minnesota. (Via KSTP) https://t.co/CGwmasTk9L Wed Dec 01 16:21:42 +0000 2010 221787532 7330
2020-05-05 02:00:04 06067 1257490214202732545 US Sacramento, CA The Guardian:Five Eyes network contradicts theory Covid-19 leaked from lab. https://t.co/B6VohC3H0f via @GoogleNews Fri Jul 21 15:16:36 +0000 2017 888417429583417345 16148
2020-05-05 02:00:04 06067 1257490214202732545 US Sacramento, CA The Guardian:Five Eyes network contradicts theory Covid-19 leaked from lab. https://t.co/B6VohC3H0f via @GoogleNews Fri Jul 21 15:16:36 +0000 2017 888417429583417345 16148
2020-05-05 02:00:05 31055 1257490215784124417 From the death of a security guard in Michigan who was enforcing mask rules to NRA layoffs and pay cuts, here is a summary of today's COVID-19 developments. https://t.co/PduodDyq08 Mon Mar 09 15:26:09 +0000 2009 23447418 92650
2020-05-05 02:00:05 31055 1257490215784124417 From the death of a security guard in Michigan who was enforcing mask rules to NRA layoffs and pay cuts, here is a summary of today's COVID-19 developments. https://t.co/PduodDyq08 Mon Mar 09 15:26:09 +0000 2009 23447418 92650
2020-05-05 02:00:05 42043 1257490216065150978 As the state continues to gather information regarding cases of Covid-19, there are some who notice a trend. https://t.co/6tSvRYfVPH Tue Jun 17 12:36:05 +0000 2008 15145034 164394
2020-05-05 02:00:05 42043 1257490216065150978 As the state continues to gather information regarding cases of Covid-19, there are some who notice a trend. https://t.co/6tSvRYfVPH Tue Jun 17 12:36:05 +0000 2008 15145034 164394

We need to remove these duplicates. To remove the duplicates, we first assign each row a unique number by creating a new column row_num. Then we creat a new table unique_tweets that for each group of duplicates, retains row with smallest row_num, therefore retaining one tweet among duplicates. We can instead drop rows from tweets table but that is slower than creating a new table and dropping the old table. This works when we have enough space for the new table.

The use of auto_increment allows us to fill up the column automatically with a sequence.

alter table tweets add column row_num INT auto_increment primary key

We create a table containing row_num of tweets we want to keep. Will max(row_num) as row_num work for what we are tying to do ?

create table unique_tweets_row_num as 
(select min(row_num) as row_num from tweets group by id)

We create a new table unique_tweets with row_num in unique_tweets_row_num.

create table unique_tweets as 
(select * from tweets where row_num in 
(select row_num from unique_tweets_row_num))

Let us double-check if duplicates are gone.

select count(distinct(id)) as n_tweets, count(*) as total from unique_tweets

n_tweets total
2283613 2283613

We can drop the tables we created for temporary use.

show tables

Tables_in_ubuntu1
counties
unique_tweets
drop table n_duplicates, tweets, unique_tweets_row_num

Rename the table unique_tweets to tweets now that duplicates are removed.

rename table unique_tweets to tweets

Print the schema of our table tweets.

describe tweets

Field Type Null Key Default Extra
created_at datetime YES None
fips varchar(5) YES None
id varchar(20) YES None
place_country_code varchar(2) YES None
place_full_name varchar(110) YES None
text text YES None
user_created_at varchar(30) YES None
user_id varchar(20) YES None
user_statuses_count int(11) YES None
row_num int(11) NO 0

We don’t need the column row_num either. We can drop it.

alter table tweets drop column row_num

Working with dates

Find the date/time range of tweets.

select min(created_at), max(created_at) from tweets

min(created_at) max(created_at)
2020-05-05 00:00:00 2020-05-18 23:59:59

This is 2 weeks of data. Get the number tweets per day (date).

select date(created_at) as date, count(*) as n_tweets from tweets 
group by date order by date

date n_tweets
2020-05-05 158795
2020-05-06 165513
2020-05-07 157250
2020-05-08 144608
2020-05-09 139884
2020-05-10 118675
2020-05-11 161456
2020-05-12 187807
2020-05-13 204860
2020-05-14 205149
2020-05-15 200291
2020-05-16 135812
2020-05-17 122249
2020-05-18 181264

Get number of tweets per (date-)hour.

select date_format(created_at,'%Y-%m-%d %H:00') as hour, count(*) as n_tweets from tweets 
group by hour order by hour limit 10 offset 50

hour n_tweets
2020-05-07 06:00 2427
2020-05-07 07:00 1582
2020-05-07 08:00 1029
2020-05-07 09:00 1172
2020-05-07 10:00 2146
2020-05-07 11:00 4017
2020-05-07 12:00 4849
2020-05-07 13:00 5644
2020-05-07 14:00 6807
2020-05-07 15:00 2253

Let’s say we want to know how many tweets–on average–are sent every hour of the day. By average we mean that we want to average across dates.

We first create a table to hold the number of tweets every hour for a given date.

create table hourly as 
(select date(created_at) as date, hour(created_at) as hour, count(*) as n_tweets from tweets group by date, hour)
select * from hourly 
limit 10 offset 50

date hour n_tweets
2020-05-07 6 2427
2020-05-07 7 1582
2020-05-07 8 1029
2020-05-07 9 1170
2020-05-07 10 2145
2020-05-07 11 4017
2020-05-07 12 4848
2020-05-07 13 5644
2020-05-07 14 6807
2020-05-07 15 2253

Now that we have that much information, we can collapse across days, to get the average over all days for every hour.

select hour, avg(n_tweets) as avg_n_tweets from hourly
group by hour

hour avg_n_tweets
0 7216.6667
1 6810.3077
2 6507.5000
3 5618.9286
4 4602.3571
5 3402.6429
6 2604.7143
7 2635.2143
8 1767.0714
9 1971.9286
10 2760.2857
11 4372.0714
12 6599.0714
13 9132.2857
14 10620.2143
15 11015.3571
16 10912.5385
17 11129.5714
18 11575.4286
19 11492.0000
20 10513.3846
21 9824.4615
22 8441.4167
23 8309.6364

So that works. But let’s see if we can do this simpler!

We can do this in one command without creating a temporary table hourly by nesting SQL queries, rather than making ourselves these temporary tables. Look at the queries above to convince yourself that the query below is functionally equivalent.

select hour, avg(n_tweets) as avg_n_tweets from (
    select date(created_at) as date, hour(created_at) as hour, count(*) as n_tweets from tweets group by date,hour) as tmp
group by hour

hour avg_n_tweets
0 7216.6667
1 6810.3077
2 6507.5000
3 5618.9286
4 4602.3571
5 3402.6429
6 2604.7143
7 2635.2143
8 1767.0714
9 1971.9286
10 2760.2857
11 4372.0714
12 6599.0714
13 9132.2857
14 10620.2143
15 11015.3571
16 10912.5385
17 11129.5714
18 11575.4286
19 11492.0000
20 10513.3846
21 9824.4615
22 8441.4167
23 8309.6364

Alright, let’s drop our temp table!

drop table hourly

Distinct and counting

Get the number of distinct users in the data.

select count(distinct(user_id)) as n_users from tweets

n_users
797173

Get number of tweets per user and have the results in descending order of number of tweets.

select user_id, count(*) as n_tweets from tweets 
group by user_id order by n_tweets desc limit 10 offset 100

user_id n_tweets
701476526596997120 410
166488367 406
2876963253 403
9020932 399
14437914 399
19733334 393
34553489 391
44959878 391
126424795 389
22707692 389

Searching using wildcards

We can find tweets that contain urls. A simple and basic version of that is using wildcards.

A wildcard (* the shell) is a symbol that matches any number of characters, including 0 characters. In SQL, the importance two to know are

% - matches any number of characters, including no character. Which means that ‘%bla’ also matches strings that just start with ‘bla’. ? - matches one and exactly one arbitrary character. ‘?bla’ matches ‘bbla’ and ‘hbla’.

We use it with the SELECT LIKE command, like so:

select count(*) as n_tweets_with_url from tweets where text like '%https://%'

n_tweets_with_url
1661586
select * from tweets where text like '%https://%' 
limit 5 offset 100

created_at fips id place_country_code place_full_name text user_created_at user_id user_statuses_count
2020-05-05 00:00:06 39061 1257460021941678080 A Remote Workforce Makes Security Training More Critical | #coronavirus #covid19 #cybersecurity #privacy #security #compliance #databreach #biometrics #fingerprint #facialrecognition #smarthome #phishing #ransomware #CCPA #BIPA #IoT https://t.co/0d2IDtxu1r https://t.co/fN9I3AAf56 Fri Aug 20 19:12:14 +0000 2010 180888396 8759
2020-05-05 00:00:06 41059 1257460022205984771 Stay home. Don’t hang out with people who don’t live in your house unless you’re social distancing. https://t.co/ZPOxHuPxlG Wed May 17 00:23:52 +0000 2017 864637555618304001 499
2020-05-05 00:00:06 11001 1257460023057436682 From snacking to sibling rivalry, kids give us their honest opinions about #quarantinelife. Spoiler alert:they *REALLY* miss the playground. #flattenthecurve https://t.co/hML6bJp3zs https://t.co/KHuJzTS0yl Thu Aug 27 17:53:15 +0000 2009 69356392 10098
2020-05-05 00:00:06 21111 1257460023539781633 '60 Minutes' correspondent Lesley Stahl says she was 'scared' and hospitalized for a week fighting COVID-19 https://t.co/R7pLfvXfhP https://t.co/cQsisv6czN Mon Oct 03 16:09:47 +0000 2011 384392137 10879
2020-05-05 00:00:06 34013 1257460023925641217 Visit the Reeve Foundation COVID-19 Information Center for information about the #coronavirus and SCI. #TodaysCare https://t.co/8xFect3qZk Thu Apr 16 15:12:43 +0000 2009 31877693 40221

Merge/Join county, state

In table tweets the column fips is the fips code of the US county from which tweet likely originated.

FYI: There are about 3000 counties, every US county has a unique fips code. The first two digits in the FIPS code designate a state (state “1” being Alabama), and the last three digits specify a county within the state. For example, Santa Clara county is 6085 in CA, San Mateo is 6081, and 48269 is King county in TX.

Let us now add columns county , state corresponding to the fips in the tweets table, information for which is in the counties table. This means we need to join these two tables on fips.

Print the schema of the table counties.

describe counties

Field Type Null Key Default Extra
fips varchar(5) YES None
county varchar(40) YES None
state varchar(20) YES None

View a few records.

select * from counties limit 10 offset 100

fips county state
6061 Placer County California
26161 Washtenaw County Michigan
22071 Orleans Parish Louisiana
48355 Nueces County Texas
27003 Anoka County Minnesota
12081 Manatee County Florida
12021 Collier County Florida
42079 Luzerne County Pennsylvania
29510 St. Louis city Missouri
48039 Brazoria County Texas

The fips in table tweets are all of length 5 but fips in table counties are of length 4 & 5.

select distinct(length(fips)) from tweets

(length(fips))
5
select distinct(length(fips)) from counties

(length(fips))
4
5

Before we merge, we need to make sure the fips in table counties is of same length, with zero padding, as in table tweets. Here we left-pad the fips with 0 to make sure the length is 5. You can double-check this by running previous statement again.

update counties set fips = lpad(fips,5,'0')

We intend to bring the county , state columns from counties into tweets for every tweet. But are all the fips in tweets present in table counties ?

select count(distinct(fips)) as n_fips from tweets where fips not in (
    select fips from counties)

n_fips
28

So there are fips in tweets for which we have no record in counties. When we join, we have two options:

We will perform a inner-join for this tutorial, thereby keeping records in tweets_that have an entry in _counties. Can you see why we need to name the columns like tweets.*, counties.county, counties.state ?

create table tweets_counties as (
    select tweets.*, counties.county, counties.state from tweets inner join counties on tweets.fips = counties.fips)

Print the schema of the merged table.

describe tweets_counties

Field Type Null Key Default Extra
created_at datetime YES None
fips varchar(5) YES None
id varchar(20) YES None
place_country_code varchar(2) YES None
place_full_name varchar(110) YES None
text text YES None
user_created_at varchar(30) YES None
user_id varchar(20) YES None
user_statuses_count int(11) YES None
county varchar(40) YES None
state varchar(20) YES None

We can drop all tables except this new one as we now have everything in one table.

drop table tweets, counties

Rename table to tweets.

rename table tweets_counties to tweets

Let us find the average number of users per state.

select state, count(distinct(user_id)) as n_users from tweets group by state
limit 10

state n_users
Alabama 6581
Alaska 3718
Arizona 14715
Arkansas 2942
California 144352
Colorado 14636
Connecticut 6131
Delaware 1260
District of Columbia 28195

Finally, remove the table tweets.

drop table tweets

Well done! If you understand everything up to here you are in good SQL-shape! BTW, don’t hesitate to google any questions, stackoverflow and the mySQL online reference are your friend. Everybody does that, and those who say they don’t are lying.