Stanford PSYCH 290

Natural Language Processing & Text-Based Machine Learning in the Social Sciences

Jupyter

Start Jupyter session with instructions from here.

Within Jupyter, following commands can be useful:

cmd	description
%load_ext sql	Load the SQL extension on Jupyter.
%sql mysql://	Connect to MySQL server running locally. The username, password, host and port are assumed and need not be provided.
%sql select * from msgs limit 3	Run command `select ...` from Jupyter using the SQL extension. Every SQL command needs to be prefixed with `%sql` or `%%sql`.
%%sql use ubuntu; select * from msgs limit 3;	Run multiple commands separated by ‘;’ from Jupyter using the SQL extension. Same as `%sql` but `%%sql` allows us to run multiple commands in one cell.

DLATK

cmd	description
!dlatkInterface.py \ –corpdb {database} \ –corptable {msgs_table} \ –correl_field user_id \ –add_ngrams -n 1	Extract 1-grams with group by field user_id. Produces table `feat$1gram$msgs$user_id$16to16`.
!dlatkInterface.py \ –corpdb {database} \ –corptable {msgs_table} \ –correl_field user_id \ –add_ngrams -n 1 \ –feat_occ_filter \ –set_p_occ 0.05	Extract 1-grams with group by field user_id. Produces table `feat$1gram$msgs$user_id$16to16$0_05`.
!dlatkInterface.py \ –corpdb {database} \ –corptable {msgs_table} \ –correl_field user_id \ –feat_table ‘{feat_1gram_table}’ \ –feat_occ_filter \ –set_p_occ 0.05	Same as above but operates on existing 1gram table instead of generating it first and then applying occurrence filter.
!dlatkInterface.py \ –corpdb {database} \ –corptable {msgs_table} \ –correl_field user_id \ –feat_table ‘{feat_2gram_occ05_table}’ \ –feat_colloc_filter \ –set_pmi_threshold 3	Apply PMI threshold on existing 2gram filters. Produces table `feat$2gram$msgs$user_id$16to16$0_05$pmi3_0`
!dlatkInterface.py \ –corpdb {database} \ –corptable {msgs_table} \ –correl_field user_id \ –add_ngrams -n 1 2 3 \ –combine_feat_tables 1to3gram \ –feat_occ_filter –set_p_occ 0.05 \ –feat_colloc_filter \ –set_pmi_threshold 3	Extract 1 to 3 grams individually and produce a combined table. Apply occurrence threshold and PMI threshold. Produces 6 tables – 1gram, 2gram, 3gram, 1to3gram, 1to3gram-with-occ, 1to3gram-with-occ-pmi.
!dlatkInterface.py \ –corpdb {database} \ –corptable {msgs_table} \ –correl_field user_id \ –add_lex_table \ -l mini_LIWC2015	Extract dictionary using the lexicon `dlatk_lexica.mini_LIWC2015` and group by the field user_id. Produces table `feat$cat_mini_LIWC2015$msgs$user_id$1gra`.
!dlatkInterface.py \ –corpdb {database} \ –corptable {msgs_table} \ –correl_field user_id \ –correlate –rmatrix –csv –sort \ –feat_table ‘{feat_1gram_occ05_table}’ \ –outcome_table blog_outcomes \ –outcomes age gender \ –tagcloud –make_wordclouds \ –output_name {out_d}/{out_name}	Correlate 1grams with outcomes integer outcomes age & gender(0/1) and produce wordclouds.
!dlatkInterface.py \ –corpdb {database} \ –corptable {msgs_table} \ –correl_field user_id \ –correlate –rmatrix –csv –sort \ –feat_table ‘{feat_1to3gram_occ05_pmi3_table}’ \ –outcome_table {outcomes_table} \ –outcomes occu \ –categories_to_binary occu \ –tagcloud –make_wordclouds \ –output_name {out_d}/{out_name}	Same as above but outcome is categorical.
!dlatkInterface.py \ –corpdb {database} \ –corptable {msgs_table} \ –correl_field user_id \ –correlate –rmatrix –csv –sort \ –p_correction bonferroni \ –feat_table ‘{feat_1to3gram_occ05_pmi3_table}’ \ –outcome_table {outcomes_table} \ –categories_to_binary occu \ –outcomes occu \ –tagcloud –make_wordclouds \ –output_name {out_d}/{out_name}	Same as above but now apply Bonferroni correction.
!dlatkInterface.py \ –corpdb {database} \ –corptable {msgs_table} \ –correl_field user_id \ –correlate –csv \ –feat_table ‘{feat_1gram_occ05_table}’ \ –outcome_table {outcomes_table} \ –outcomes age gender \ –tagcloud –make_wordclouds \ –whitelist –lex_table LIWC2015 –categories ‘POSEMO’ \ –output_name {out_d}/{out_name}	Correlate 1grams with LIWC and produce word-clouds.
!dlatkInterface.py \ –topic_lexicon {topics_freq_table} \ –make_all_topic_wordclouds \ -tagcloud_colorscheme blue \ –output ‘output’	Generate topic word clouds.
!dlatkInterface.py \ –corpdb {database} \ –corptable msgs \ –correl_field user_id \ –add_lex_table -l {topics_cp_table} –weighted_lexicon	Extract topics using the specified topic table as weighted lexicon.
!dlatkInterface.py \ –corpdb {database} \ –corptable {msgs_table} \ –correl_field user_id \ –correlate –rmatrix –csv –sort \ –outcome_table {outcomes_table} \ –outcomes age gender \ –feat_table ‘{feat_topics_table}’ \ –topic_tagcloud –make_topic_wordclouds \ –topic_lexicon topics_fb2k_freq \ –tagcloud_colorscheme blue \ –output_name {out_d}/{out_name}	Correlate previously extracted topics with outcomes and generated word-clouds.
!dlatkInterface.py \ –corpdb {database} \ –corptable msgs \ –correl_field user_id \ –add_lex_table \ -l {w2v_table} \ –weighted_lexicon	Extract word2vec.
!dlatkInterface.py \ –corpdb {database} \ –corptable msgs \ –correl_field user_id \ –correlate –rmatrix –csv –sort \ –outcome_table {outcomes_table} \ –outcomes age gender \ –feat_table ‘{feat_w2v_table}’ \ –topic_tagcloud –make_topic_wordclouds \ –topic_lexicon {w2v_table} \ –tagcloud_colorscheme blue \ –output_name {out_d}/{out_name}	Correlate word2vec with outcomes and generate word clouds.
!dlatkInterface.py \ –corpdb {database} \ –corptable {msgs_table} \ –correl_field user_id \ –outcome_table {outcomes_table} \ –outcomes age \ –group_freq_thresh 500 \ –feat_table ‘{feat_topics_table}’ \ –combo_test_reg –model ridgecv –folds 10 \ –output_name {out_d}/{out_name} \ –csv	Predict a continuous outcome (age) using topics.
!dlatkInterface.py \ –corpdb {database} \ –corptable {msgs_table} \ –correl_field user_id \ –outcome_table {outcomes_table} \ –outcomes gender \ –group_freq_thresh 500 \ –feat_table ‘{feat_topics_table}’ \ –combo_test_classifiers –model lr –folds 10 \ –output_name {out_d}/{out_name} \ –csv	Predict a categorical outcome (gender) using topics.
!dlatkInterface.py \ –corpdb {database} \ –corptable {msgs_table} \ –correl_field user_id \ –outcome_table {outcomes_table} \ –outcomes age \ –group_freq_thresh 500 \ –feat_table ‘{feat_topics_table}’ \ –output_name {out_d}/{out_name} \ –combo_test_reg –model ridgecv –folds 10 \ –feature_selection magic_sauce \ –csv	Predict a continous variable (age) using topics with feature selection pipeline.
!dlatkInterface.py \ –corpdb {database} \ –corptable {msgs_table} \ –correl_field user_id \ –outcome_table {outcomes_table} \ –outcomes gender \ –group_freq_thresh 500 \ –feat_table ‘{feat_topics_table}’ \ –combo_test_classifiers –model lr –folds 10 \ –output_name {out_d}/{out_name} \ –feature_selection magic_sauce \ –csv	Predict a categorical variable (gender) using topics with feature selection pipeline.