Jupyter
Start Jupyter session with instructions from here.
Within Jupyter, following commands can be useful:
cmd | description |
---|---|
%load_ext sql | Load the SQL extension on Jupyter. |
%sql mysql:// | Connect to MySQL server running locally. The username, password, host and port are assumed and need not be provided. |
%sql select * from msgs limit 3 | Run command select ... from Jupyter using the SQL extension. Every SQL command needs to be prefixed with %sql or %%sql . |
%%sql use ubuntu; select * from msgs limit 3; |
Run multiple commands separated by ‘;’ from Jupyter using the SQL extension. Same as %sql but %%sql allows us to run multiple commands in one cell. |
DLATK
cmd | description |
---|---|
!dlatkInterface.py \ –corpdb {database} \ –corptable {msgs_table} \ –correl_field user_id \ –add_ngrams -n 1 |
Extract 1-grams with group by field user_id. Produces table feat$1gram$msgs$user_id$16to16 . |
!dlatkInterface.py \ –corpdb {database} \ –corptable {msgs_table} \ –correl_field user_id \ –add_ngrams -n 1 \ –feat_occ_filter \ –set_p_occ 0.05 |
Extract 1-grams with group by field user_id. Produces table feat$1gram$msgs$user_id$16to16$0_05 . |
!dlatkInterface.py \ –corpdb {database} \ –corptable {msgs_table} \ –correl_field user_id \ –feat_table ‘{feat_1gram_table}’ \ –feat_occ_filter \ –set_p_occ 0.05 |
Same as above but operates on existing 1gram table instead of generating it first and then applying occurrence filter. |
!dlatkInterface.py \ –corpdb {database} \ –corptable {msgs_table} \ –correl_field user_id \ –feat_table ‘{feat_2gram_occ05_table}’ \ –feat_colloc_filter \ –set_pmi_threshold 3 |
Apply PMI threshold on existing 2gram filters. Produces table feat$2gram$msgs$user_id$16to16$0_05$pmi3_0 |
!dlatkInterface.py \ –corpdb {database} \ –corptable {msgs_table} \ –correl_field user_id \ –add_ngrams -n 1 2 3 \ –combine_feat_tables 1to3gram \ –feat_occ_filter –set_p_occ 0.05 \ –feat_colloc_filter \ –set_pmi_threshold 3 |
Extract 1 to 3 grams individually and produce a combined table. Apply occurrence threshold and PMI threshold. Produces 6 tables – 1gram, 2gram, 3gram, 1to3gram, 1to3gram-with-occ, 1to3gram-with-occ-pmi. |
!dlatkInterface.py \ –corpdb {database} \ –corptable {msgs_table} \ –correl_field user_id \ –add_lex_table \ -l mini_LIWC2015 |
Extract dictionary using the lexicon dlatk_lexica.mini_LIWC2015 and group by the field user_id. Produces table feat$cat_mini_LIWC2015$msgs$user_id$1gra . |
!dlatkInterface.py \ –corpdb {database} \ –corptable {msgs_table} \ –correl_field user_id \ –correlate –rmatrix –csv –sort \ –feat_table ‘{feat_1gram_occ05_table}’ \ –outcome_table blog_outcomes \ –outcomes age gender \ –tagcloud –make_wordclouds \ –output_name {out_d}/{out_name} |
Correlate 1grams with outcomes integer outcomes age & gender(0/1) and produce wordclouds. |
!dlatkInterface.py \ –corpdb {database} \ –corptable {msgs_table} \ –correl_field user_id \ –correlate –rmatrix –csv –sort \ –feat_table ‘{feat_1to3gram_occ05_pmi3_table}’ \ –outcome_table {outcomes_table} \ –outcomes occu \ –categories_to_binary occu \ –tagcloud –make_wordclouds \ –output_name {out_d}/{out_name} |
Same as above but outcome is categorical. |
!dlatkInterface.py \ –corpdb {database} \ –corptable {msgs_table} \ –correl_field user_id \ –correlate –rmatrix –csv –sort \ –p_correction bonferroni \ –feat_table ‘{feat_1to3gram_occ05_pmi3_table}’ \ –outcome_table {outcomes_table} \ –categories_to_binary occu \ –outcomes occu \ –tagcloud –make_wordclouds \ –output_name {out_d}/{out_name} |
Same as above but now apply Bonferroni correction. |
!dlatkInterface.py \ –corpdb {database} \ –corptable {msgs_table} \ –correl_field user_id \ –correlate –csv \ –feat_table ‘{feat_1gram_occ05_table}’ \ –outcome_table {outcomes_table} \ –outcomes age gender \ –tagcloud –make_wordclouds \ –whitelist –lex_table LIWC2015 –categories ‘POSEMO’ \ –output_name {out_d}/{out_name} |
Correlate 1grams with LIWC and produce word-clouds. |
!dlatkInterface.py \ –topic_lexicon {topics_freq_table} \ –make_all_topic_wordclouds \ -tagcloud_colorscheme blue \ –output ‘output’ |
Generate topic word clouds. |
!dlatkInterface.py \ –corpdb {database} \ –corptable msgs \ –correl_field user_id \ –add_lex_table -l {topics_cp_table} –weighted_lexicon |
Extract topics using the specified topic table as weighted lexicon. |
!dlatkInterface.py \ –corpdb {database} \ –corptable {msgs_table} \ –correl_field user_id \ –correlate –rmatrix –csv –sort \ –outcome_table {outcomes_table} \ –outcomes age gender \ –feat_table ‘{feat_topics_table}’ \ –topic_tagcloud –make_topic_wordclouds \ –topic_lexicon topics_fb2k_freq \ –tagcloud_colorscheme blue \ –output_name {out_d}/{out_name} |
Correlate previously extracted topics with outcomes and generated word-clouds. |
!dlatkInterface.py \ –corpdb {database} \ –corptable msgs \ –correl_field user_id \ –add_lex_table \ -l {w2v_table} \ –weighted_lexicon |
Extract word2vec. |
!dlatkInterface.py \ –corpdb {database} \ –corptable msgs \ –correl_field user_id \ –correlate –rmatrix –csv –sort \ –outcome_table {outcomes_table} \ –outcomes age gender \ –feat_table ‘{feat_w2v_table}’ \ –topic_tagcloud –make_topic_wordclouds \ –topic_lexicon {w2v_table} \ –tagcloud_colorscheme blue \ –output_name {out_d}/{out_name} |
Correlate word2vec with outcomes and generate word clouds. |
!dlatkInterface.py \ –corpdb {database} \ –corptable {msgs_table} \ –correl_field user_id \ –outcome_table {outcomes_table} \ –outcomes age \ –group_freq_thresh 500 \ –feat_table ‘{feat_topics_table}’ \ –combo_test_reg –model ridgecv –folds 10 \ –output_name {out_d}/{out_name} \ –csv |
Predict a continuous outcome (age) using topics. |
!dlatkInterface.py \ –corpdb {database} \ –corptable {msgs_table} \ –correl_field user_id \ –outcome_table {outcomes_table} \ –outcomes gender \ –group_freq_thresh 500 \ –feat_table ‘{feat_topics_table}’ \ –combo_test_classifiers –model lr –folds 10 \ –output_name {out_d}/{out_name} \ –csv |
Predict a categorical outcome (gender) using topics. |
!dlatkInterface.py \ –corpdb {database} \ –corptable {msgs_table} \ –correl_field user_id \ –outcome_table {outcomes_table} \ –outcomes age \ –group_freq_thresh 500 \ –feat_table ‘{feat_topics_table}’ \ –output_name {out_d}/{out_name} \ –combo_test_reg –model ridgecv –folds 10 \ –feature_selection magic_sauce \ –csv |
Predict a continous variable (age) using topics with feature selection pipeline. |
!dlatkInterface.py \ –corpdb {database} \ –corptable {msgs_table} \ –correl_field user_id \ –outcome_table {outcomes_table} \ –outcomes gender \ –group_freq_thresh 500 \ –feat_table ‘{feat_topics_table}’ \ –combo_test_classifiers –model lr –folds 10 \ –output_name {out_d}/{out_name} \ –feature_selection magic_sauce \ –csv |
Predict a categorical variable (gender) using topics with feature selection pipeline. |