Stanford PSYCH 290

Natural Language Processing & Text-Based Machine Learning in the Social Sciences

Jupyter

Start Jupyter session with instructions from here.

Within Jupyter, following commands can be useful:

cmd description
%load_ext sql Load the SQL extension on Jupyter.
%sql mysql:// Connect to MySQL server running locally. The username, password, host and port are assumed and need not be provided.
%sql select * from msgs limit 3 Run command select ... from Jupyter using the SQL extension. Every SQL command needs to be prefixed with %sql or %%sql.
%%sql
use ubuntu;
select * from msgs limit 3;
Run multiple commands separated by ‘;’ from Jupyter using the SQL extension. Same as %sql but %%sql allows us to run multiple commands in one cell.

DLATK

cmd description
!dlatkInterface.py \
–corpdb {database} \
–corptable {msgs_table} \
–correl_field user_id \
–add_ngrams -n 1
Extract 1-grams with group by field user_id. Produces table feat$1gram$msgs$user_id$16to16.
!dlatkInterface.py \
–corpdb {database} \
–corptable {msgs_table} \
–correl_field user_id \
–add_ngrams -n 1 \
–feat_occ_filter \
–set_p_occ 0.05
Extract 1-grams with group by field user_id. Produces table feat$1gram$msgs$user_id$16to16$0_05.
!dlatkInterface.py \
–corpdb {database} \
–corptable {msgs_table} \
–correl_field user_id \
–feat_table ‘{feat_1gram_table}’ \
–feat_occ_filter \
–set_p_occ 0.05
Same as above but operates on existing 1gram table instead of generating it first and then applying occurrence filter.
!dlatkInterface.py \
–corpdb {database} \
–corptable {msgs_table} \
–correl_field user_id \
–feat_table ‘{feat_2gram_occ05_table}’ \
–feat_colloc_filter \
–set_pmi_threshold 3
Apply PMI threshold on existing 2gram filters. Produces table feat$2gram$msgs$user_id$16to16$0_05$pmi3_0
!dlatkInterface.py \
–corpdb {database} \
–corptable {msgs_table} \
–correl_field user_id \
–add_ngrams -n 1 2 3 \
–combine_feat_tables 1to3gram \
–feat_occ_filter –set_p_occ 0.05 \
–feat_colloc_filter \
–set_pmi_threshold 3
Extract 1 to 3 grams individually and produce a combined table. Apply occurrence threshold and PMI threshold. Produces 6 tables – 1gram, 2gram, 3gram, 1to3gram, 1to3gram-with-occ, 1to3gram-with-occ-pmi.
!dlatkInterface.py \
–corpdb {database} \
–corptable {msgs_table} \
–correl_field user_id \
–add_lex_table \
-l mini_LIWC2015
Extract dictionary using the lexicon dlatk_lexica.mini_LIWC2015 and group by the field user_id. Produces table feat$cat_mini_LIWC2015$msgs$user_id$1gra.
!dlatkInterface.py \
–corpdb {database} \
–corptable {msgs_table} \
–correl_field user_id \
–correlate –rmatrix –csv –sort \
–feat_table ‘{feat_1gram_occ05_table}’ \
–outcome_table blog_outcomes \
–outcomes age gender \
–tagcloud –make_wordclouds \
–output_name {out_d}/{out_name}
Correlate 1grams with outcomes integer outcomes age & gender(0/1) and produce wordclouds.
!dlatkInterface.py \
–corpdb {database} \
–corptable {msgs_table} \
–correl_field user_id \
–correlate –rmatrix –csv –sort \
–feat_table ‘{feat_1to3gram_occ05_pmi3_table}’ \
–outcome_table {outcomes_table} \
–outcomes occu \
–categories_to_binary occu \
–tagcloud –make_wordclouds \
–output_name {out_d}/{out_name}
Same as above but outcome is categorical.
!dlatkInterface.py \
–corpdb {database} \
–corptable {msgs_table} \
–correl_field user_id \
–correlate –rmatrix –csv –sort \
–p_correction bonferroni \
–feat_table ‘{feat_1to3gram_occ05_pmi3_table}’ \
–outcome_table {outcomes_table} \
–categories_to_binary occu \
–outcomes occu \
–tagcloud –make_wordclouds \
–output_name {out_d}/{out_name}
Same as above but now apply Bonferroni correction.
!dlatkInterface.py \
–corpdb {database} \
–corptable {msgs_table} \
–correl_field user_id \
–correlate –csv \
–feat_table ‘{feat_1gram_occ05_table}’ \
–outcome_table {outcomes_table} \
–outcomes age gender \
–tagcloud –make_wordclouds \
–whitelist –lex_table LIWC2015 –categories ‘POSEMO’ \
–output_name {out_d}/{out_name}
Correlate 1grams with LIWC and produce word-clouds.
!dlatkInterface.py \
–topic_lexicon {topics_freq_table} \
–make_all_topic_wordclouds \
-tagcloud_colorscheme blue \
–output ‘output’
Generate topic word clouds.
!dlatkInterface.py \
–corpdb {database} \
–corptable msgs \
–correl_field user_id \
–add_lex_table -l {topics_cp_table} –weighted_lexicon
Extract topics using the specified topic table as weighted lexicon.
!dlatkInterface.py \
–corpdb {database} \
–corptable {msgs_table} \
–correl_field user_id \
–correlate –rmatrix –csv –sort \
–outcome_table {outcomes_table} \
–outcomes age gender \
–feat_table ‘{feat_topics_table}’ \
–topic_tagcloud –make_topic_wordclouds \
–topic_lexicon topics_fb2k_freq \
–tagcloud_colorscheme blue \
–output_name {out_d}/{out_name}
Correlate previously extracted topics with outcomes and generated word-clouds.
!dlatkInterface.py \
–corpdb {database} \
–corptable msgs \
–correl_field user_id \
–add_lex_table \
-l {w2v_table} \
–weighted_lexicon
Extract word2vec.
!dlatkInterface.py \
–corpdb {database} \
–corptable msgs \
–correl_field user_id \
–correlate –rmatrix –csv –sort \
–outcome_table {outcomes_table} \
–outcomes age gender \
–feat_table ‘{feat_w2v_table}’ \
–topic_tagcloud –make_topic_wordclouds \
–topic_lexicon {w2v_table} \
–tagcloud_colorscheme blue \
–output_name {out_d}/{out_name}
Correlate word2vec with outcomes and generate word clouds.
!dlatkInterface.py \
–corpdb {database} \
–corptable {msgs_table} \
–correl_field user_id \
–outcome_table {outcomes_table} \
–outcomes age \
–group_freq_thresh 500 \
–feat_table ‘{feat_topics_table}’ \
–combo_test_reg –model ridgecv –folds 10 \
–output_name {out_d}/{out_name} \
–csv
Predict a continuous outcome (age) using topics.
!dlatkInterface.py \
–corpdb {database} \
–corptable {msgs_table} \
–correl_field user_id \
–outcome_table {outcomes_table} \
–outcomes gender \
–group_freq_thresh 500 \
–feat_table ‘{feat_topics_table}’ \
–combo_test_classifiers –model lr –folds 10 \
–output_name {out_d}/{out_name} \
–csv
Predict a categorical outcome (gender) using topics.
!dlatkInterface.py \
–corpdb {database} \
–corptable {msgs_table} \
–correl_field user_id \
–outcome_table {outcomes_table} \
–outcomes age \
–group_freq_thresh 500 \
–feat_table ‘{feat_topics_table}’ \
–output_name {out_d}/{out_name} \
–combo_test_reg –model ridgecv –folds 10 \
–feature_selection magic_sauce \
–csv
Predict a continous variable (age) using topics with feature selection pipeline.
!dlatkInterface.py \
–corpdb {database} \
–corptable {msgs_table} \
–correl_field user_id \
–outcome_table {outcomes_table} \
–outcomes gender \
–group_freq_thresh 500 \
–feat_table ‘{feat_topics_table}’ \
–combo_test_classifiers –model lr –folds 10 \
–output_name {out_d}/{out_name} \
–feature_selection magic_sauce \
–csv
Predict a categorical variable (gender) using topics with feature selection pipeline.