Stanford PSYCH 290

Natural Language Processing & Text-Based Machine Learning in the Social Sciences

PSYCH290 will be offered in the Spring 2025 quarter. The course application is HERE.**

For context, last year, we had about 28 applicants and could admit about ~12 students into the class.
Decisions will be made by March 28th, 2025. Prioritization is generally by seniority.

Basic Course Info

Instructor: Johannes Eichstaedt (eich) (he/him)
Teaching Assistant: Cedric Lim (limch)
Class: Tue / Thu – 12:00 PM - 1:20 PM PT (80 mins)
Office hours:
Johannes - TBD – Room 136 in Bldg. 420. Via zoom by quick email.
Cedric – TBD
Textbook: None. We will use papers/PDFs.
Prerequisites: Decent ability to code in R. Familiarity with multivariate regression and basic statistics of the social sciences. NOT required but helpful: Python, SSH, SQL (we will teach you what you need to know). Biggest requirement: knowing what science is, and wanting to learn.

** All-In-One DLATK colab: Feat Extraction, Correlation, Topic modeling**

The 2023 syllyabus is here. If you want to do something to prepare for the course, read Eichstaedt et al., 2020 from the readings folder. Ethics Advice for the course was previously contributed by Kathleen Creel.

Course Road Map


Week 1 - Intro to the course & SQL (Block 1)

Tuesday, 4/1 - Lecture 1 - Intro to course, why DLATK, intro to computing infrastructure
Thursday, 4/3 - Lecture 2 - Getting started with SQL workshop – PLEASE BRING YOUR LAPTOP TO CLASS

Tutorials:

Jupyter Tutorials:

These Jupyter notebook tutorials will be linked for you from here or canvas.

Homework:

Readings:

The readings folder is here.


Tutorials and Homeworks **are always released on Thursday after class the latest, and due the next Thursday before class. **


Week 2 - Intro to NLP (Block 1, 2)

Tuesday, 4/8 - Lecture 3 (W2.1) - The field of NLP, different kinds of language analyses
Thursday, 4/10 - Lecture 4 (W2.2) - meet DLATK & feature extraction (intro to new tutorials)

Tutorials:

Homework:

Readings: (“due” by W3.1)


Week 3 - Dictionaries: GI, DICTION, LIWC (Block 2)

Tuesday, 4/15 - Lecture 5 - Dictionary evaluation, and history
Thursday, 4/17 - Lecture 6 - DLATK lex extraction, GI, DICTION

Tutorials:

Homework:

Readings: (“due” by Thursday, 1/26)


Week 4 - LIWC, annotation-based and sentiment Dictionaries (ANEW, LABMT, NRC) (Block 2)

Tuesday, 4/22 - Lecture 7 - LIWC, word-annotation based dictionaries ANEW, LabMT
Thursday, 4/24 - Lecture 8 - DLATK lexicon correlations, sentiment dicts NRC

Tutorials:

Homework:

Readings:


Week 5 - Sentiment dictionaries and R (Block 2)

Tuesday, 4/29 - Lecture 9 - Types of Science with NLP, Intro to Open Vocab, power calculations
Thursday, 5/1 - Lecture 10 - data import, R and DLATK

Tutorials:

Homework

Homework 5 files:
Messages CSV
Outcomes CSV

Readings:


Week 6 - Introduction to open Vocab (Block 3)

Tuesday, 5/6 - Lecture 11 - Embeddings and Topics
Thursday, 5/8 - Lecture 12 - DLATK: 1to3gram feature extraction with occurrence filtering and PMI

Tutorials:

Homework:

Download Word cloud Powerpoint template

Readings:


Week 7 - Embeddings and Topic Modeling (Block 4)

Tuesday, 5/13 - Lecture 13 – LDA topics, and discussion of good studies
Thurday, 5/15 - Lecture 14 – DLATK for topics, and topic conceptual review

Tutorials:

Homework:

Readings: (“due” by Thursday, 2/23)


Week 8 - Intro to ML (Block 5)

Tuesday, 5/20 - Lecture 15 - Intro to Machine learning
Thursday, 5/22 - Lecture 16 - Final Projects Intro, Reddit Scraping, More Machine Learning

Tutorials:

Homework:

Readings:


Week 9 - ML: deep learning & pre-presentations (Block 5)

Tuesday, 5/27 - Lecture 17 - Deep Learning with Andy Schwartz Guest lecture by Andy Schwartz
Thursday, 5/29 - Lecture 18 - Final Project Pre-Presentations (please add to shared slide deck).

Tutorials:

Homework:

Readings: Read what’s relevant for your final projects!


Week 10 - Guest Lecture & LLMs

Tuesday, 6/3 - Lecture 19: Overview of Generative Large Language Models and their Applications Thursday, 6/5 - Lecture 20: Project presentations

Tutorials:

Homework:

Readings:


Week 11 ( Finals week ) - Course summary and project presentations

Tuesday, 6/10 - Lecture 21 - Project presentations

Tutorials:

Homework:

Readings: (read what’s relevant for your final projects)


Command logs for VTutorials


Working on a Linux cloud server

To work on our Linux cloud server if needed later in the course (for big datasets, > 2,000 rows “messages”/text samples, say), you’d need to learn how to SSH into a cloud server, tunnel ports, connect SQL graphical user interfaces, and then tunnel Jupyter into your local browser. This is explained in a sequence of tutorials:

overview of Block 1 tutorials

MAC:

WINDOWS:

less urgent:


Basic logistics

This site will be kept up-to-date.

Readings are prioritized (A > B > C), and are in our readings Google Drive folder.
VTutorials: Video tutorials are in the unlisted class youtube channel. Jupyter worksheets will be in your home folder. Homeworks will be there or posted here.
Lecture slides are linked from Canvas.
Communication will happen via our slack channel – please access via Canvas to set up for the first time.


Course background and scope

What is this course?

This is an applied course with emphasis on the practical ability to deploy computational text analysis over data sets ranging from hundreds to millions of text samples, and mine them for patterns and psychological insight. These text samples can be social media posts, essays, or any other form of autobiographical writing. The goal is to practice these methods in guided tutorials and project-based work so that the students can apply them to their own research contexts. The course will provide best practices, as well as access to and familiarity with a Linux-based server environment to process text, including the extraction of words and phrases, topics, and psychological dictionaries. It will also practice basic machine learning using these text features to estimate survey scores that are associated with the text samples.

In addition, we will practice how to further process and visualize the frequency of language variables in R for secondary analyses, with training on how to pull these variables directly into R from the database and server environment.

In its entirety, the course aims to provide training in an entire state-of-the-art pipeline of computational text analysis, from text as input to final data visualization and secondary analysis in R.

It will not focus on the mathematical theory behind these analyses or expect students to code their own implementation of text analyses. Familiarity with Python is helpful but not required. Basic familiarity with R is expected.

What, concretely, will we do in this course?

The course will heavily rely on a Python codebase (see dlatk.wwbp.org) that serves as a (fairly) user-friendly Linux-based front end to a large variety of Python-based NLP and machine learning libraries (including NLATK and Scikit-learn). The course will cover:

Who is this for?

What would we like you to learn?

The goal of the course is to empower students to carry out a variety of different text analysis methods independently, and to write them up for peer review. At the end of the course, the student should:


In weeks 1-8:

We will give class lectures on zoom Tuesdays and Thursdays (12:00-1:30pm). In addition, on your own time, every week, we have pre-recorded video tutorials for you. In these, we ask you to follow along as we walk you through running analyses, etc.

More or less every week there will be a homework set that is based on what’s shown in the tutorials. Please submit this homework, we will grade it. They will not be particularly hard.

In weeks 9 & 10:

There will still be lectures, and maybe tutorials.

We will split into teams of 3-4 students. We will either give you a data set, or work with you to get one for you around a particular interest or research question. You will go through the pipeline of methods you practiced in the course, and work together to write a final report that is a mock research paper: minimal introduction + methods + results with figures + discussion + supplement.

Homeworks: Assignments are given out on Thursday and are due the following Thursday before class.

Reading Types:

I know sometimes there are hard trade-offs you have to make with your time between courses and life. I’ve kept the reading list as short as I can for that reason.

So I’m using the following system to demarcate how critical a reading is:

A – This is essential reading, giving you the intellectual scaffold to understand the main points of the course. Without reading these, you may miss entire sets of concepts and insights.
B – this is very helpful reading, if you miss this; you may miss single concepts or insights.
C – these readings build out your understanding. If you skip these, you may miss details.

Grading: