Stanford PSYCH 290

Natural Language Processing & Text-Based Machine Learning in the Social Sciences

HW4 – working with DLATK feature tables in R

You may submit the answer to these questions as a simple R-script, or as R markdown, or as an R-Jupyter notebook. For all of the questions below, please limit your analyses to the 978 authors with at least 500 tokens (see end of Jupyter tutorial 11 for a reminder on adding wordcounts to the outcomes table). See tutorial 12 (and 13) on how to connect your local RStudio to the SQL servers in the cloud (through the SSH tunnel).

  1. Please make an (author-level) cross-correlations table between the LIWC (POSEMO), labmt ( valence ) and NRC ( SENT ) features, and author age and gender. R-commands you will need: dbGetQuery, importFeat (from the R source script, see tutorial 10), merge, cor() – or something similar
    LIWC is a theory-based, LabMT an annotation-based, and NRC a machine-learning based model. Which one of these three doesn’t correlate highly with the other two?

  2. Please calculate a cohen’s d effect size (standard deviation difference) in how genders 0 and 1 use LIWC POSEMO and NRC Sent. FYI: R commands – cohen.d(…)
    Which dictionary distinguishes better between the genders?

  3. Following from the previous question, please produce 3 plots plotting LIWC (POSEMO), labmt ( valence ) and NRC ( SENT ) against author-age, and add a line-of-good fit (either linear or LOESS).
    R-commands you will need: dbGetQuery, importFeat (from the R source script), merge, qplot/ggplot, +geom_smooth() or +geom_smooth(method = “lm”), the + theme_Publication() (from the R source script)

  4. Can you regress labmt ( valence ) and NRC ( SENT ) against age, controlling for gender? Please provide standardized coefficients with p-values.
    R-commands you will need: dbGetQuery, importFeat (from the R source script), merge, lm(scale(language) ~ scale(outcome1) + scale(outcome2)) – scale standardizes your variables (mean = 0, SD = 1) –> yields betas

  5. Please extract the data-driven dictionary for age and gender (dlatk_lexica.dd_emnlp15_ageGender). Please make one histogram each of the predicted age, and of for the age in the outcomes table.

  6. How well does the age dictionary work? Please plot author age against dictionary-predicted age, and add a line-of-good fit (either linear or LOESS).
    R-commands you will need: dbGetQuery, importFeat (from the R source script), merge, qplot/ggplot, +geom_smooth() or +geom_smooth(method = “lm”), the + theme_Publication() (from the R source script)

  7. Repeat the above, but produce two lines of best fit, one for each author gender

  8. How many years is the dictionary ‘prediction’ off, on average? -> please compute the mean-absolute error of the dictionary age ‘predictions’? FYI: MAE = sum( abs(var1 - var2))
    Does this seem like a lot to you?

  9. Let’s investigate how well the gender prediction dictionary works. Please make a histogram of the predicted gender values, and group it by the true genders, such that you see overlapping histograms of predicted values, one for each (true) gender. What would be a good threshold on the predicted gender value to distinguish well between the genders? Please draw this as a line in your combined histogram.
    FYI: R: ggplot:: +geom_vline()

  10. Using the reasonable threshold on the continuous gender prediction that you’ve picked, compute a confusion matrix. You do this by first “declaring” the users above or below your treshold to have been classified as 1 or 0. Now that you’ve turned the continous gender score into a “gender classification”, you can compare these predicted genders against the true genders.
    Please report accuracy, precision and recall for your choice of the threshold, and the F1 score, which combines precision and recall conservatively (F1 = 2 / (1/recall + 1/precision)).
    R hints: df$a_bin <- ifelse (a < X, 1, 0), confusionMatrix(… reference = ground_truth)
    Does this seem like satisfactory prediction performance to you?