A7: Analyzing textual data

Assignment 7 (optional) due 7/18

Overview

The goal of this assignment is to practice the fundamentals of text analysis in R tidyverse.

Accessing the A7 repository

  • Go to this link to accept and create your private A7 repository on GitHub. Your repository will be built in a few seconds. It follows the naming convention a7-<USERNAME>

  • Once your repository has been created, click on the link you see, which will take you to your repository.

  • Finally, clone the repository to your computer (or R workbench) following the process below.

Notice the repo you clone for this assignment is empty: add your data and code, and push them to your GitHub repo.

Cloning your a7 repository

After you have accessed the a7 repository (see above), follow the same steps you completed for a1to clone the repository.

General workflow

Your general workflow will be as follows:

  • Accept the repo and clone it (see above)
  • Make changes locally to the files in RStudio
  • Save your changes
  • Stage-Commit-Push: stage and commit your changes to your local Git repo; then push them online to GitHub. You can complete these steps using the Git GUI integrated into RStudio. DO not directly modify your online GitHub repo (if you do so, remember to pull first); instead, modify your local Git repo, then stage-commit-push your changes up to your online GitHub repo.

Assignment description

Overview & Context

This assignment gives you a chance to work with messy, real-world text data using tools from the tidyverse and tidytext. Your goal is to:

  • Choose a text source from a short list of vetted, functional sources
  • Clean and structure the data
  • Explore it visually or descriptively
  • Then apply either sentiment analysis or topic modeling
  • Reflect meaningfully on your process

The challenge here is not just technical but analytical as well : you need to make choices, defend them, and explain what your analysis shows (and what it doesn’t).


Learning Objectives

By completing this assignment, you will:

  • Practice importing and wrangling real-world text data
  • Use tidy workflows to clean, filter, and structure text
  • Visualize patterns using meaningful plots or tables
  • Perform either sentiment analysis or topic modeling
  • Reflect on your process and communicate your results clearly

Step 1: Choose a Text Source

Pick one from the following:

Suggested Data Sources

You can use any full-text source, but here are several curated options:

  • Gutenberg Books (via gutenbergr R package): Full-text novels and historical documents in the public domain.
  • Jane Austen Novels (via janeaustenr R package): Texts of six novels by Jane Austen.
  • State of the Union Speeches (via sotu R package): US presidential addresses through 2016.
  • US Economic News Articles (Kaggle): News articles related to US economic policy and events.
  • 2020 US Presidential Campaign Speeches (Github): A collection of speeches from the 2020 campaign.
  • Agora Election Speeches (Github): Annotated Greek election speeches from 2012–2023.
  • UN General Debate Corpus (Harvard Dataverse): Speeches by heads of state at the UN General Assembly.

Ensure that the text source is rich enough for tokenization and further analysis. Avoid fragmented or overly short entries.

You may use another source if it is stable and accessible within the assignment time. Avoid data hunting.


Step 2: Clean & Preprocess the Text

  • Tokenize the text
  • Convert to lowercase
  • Remove punctuation, symbols, stop words
  • Consider regex filters for custom cleaning
  • Decide on your unit of analysis: chapter, paragraph, speaker, etc.
  • Optionally use stemming or lemmatization

Step 3: Exploratory Analysis

Produce at least two visualizations or one plot + one summary table. These might include:

  • Word frequency bar plots
  • tf-idf comparison across documents
  • Most common bigrams or trigrams

Include short written commentary explaining what your analysis shows.


Step 4: Choose One of the following: Sentiment or Topic Modeling

Option A: Sentiment Analysis

  • Analyze sentiment across time, sections, or characters
  • Include at least one sentiment-based plot
  • Interpret what your result shows and what it leaves out

Option B: Topic Modeling

  • Convert tidy data into a Document-Term Matrix
  • Briefly justify your choice of number of topics
  • Plot topic-word or document-topic output
  • Interpret topics meaningfully (what they represent, how reliable they are)

Step 5: Reflect on the Process

In your write-up, answer:

  • What was hard? What worked well?
  • What problems did you face? How did you fix them?
  • What might you do differently next time?
  • Any insights about the data itself?

What to submit

Your GitHub repo should include everything you have used to produce your analyses, such as R scripts and/or R Markdown documents, original textual data (unless they are too large to be uploaded on GitHub – see A6 for details), etc. Make sure to stage-commit-push your original .Rmd file and its .md

data/
  - textsource.csv (or other format)
analysis.rmd
README.md (knitted from analysis.rmd)
analysis.html (knitted from analysis.rmd)
ai_log.html

In README.md you must:

  • explain the purpose of the repository
  • include an explanation of what your code does and how to use it, and list all libraries required to reproduce your analyses
  • include a description of the textual data
  • include all code, graphs, etc.
  • provide any other relevant information that the user needs to know in order to use your repo and replicate your results
  • quote all resources you consulted to complete the assignment
  • provide 1-2 paragraphs of reflections on what was hard/easy about this homework, what was enjoyable, problems you solved and how you solved them, helpful resources, etc. + list any collaborators and their role

To submit the assignment, push to your repository the last version of your assignment before the deadline. Then copy your repository URL (e.g., https://github.com/cfss-hmwks-s25/a7-jmclip) and submit it to Canvas under A7 before the deadline.

Rubric

Needs improvement: Not all elements listed in the instructions are addressed. Code does not run and/or has bugs. Code is short/elementary and poorly documented. No clear effort is made to pre-process the text for analysis, or no justification is provided for keeping content such as numbers, stopwords, etc. Results are poorly interpreted or misinterpreted. Visualizations do not include any element more than the basics. There is little attention to reproducibility issues and little consistency in the code’s style.

Satisfactory: Solid effort. Hits all the elements. Finished all components of the assignment with only minor deficiencies. Easy to follow (both the code and the output).

Excellent: Displays in-depth understanding of course materials, including data analysis and coding skills. Code to process and analyze the data is complex/refined. Visualizations are excellent. Explanation of the chosen technique is accurate with an assessment of the appropriate caveats for what the technique can and cannot do. Interpretation of the results is clear and in-depth and shows engagement with the content of the textual data. Code is reproducible. Uses a sentiment analysis or topic model example not directly covered in class or considerably expands on the provided examples.

Grading Criteria

CategoryDescription
CompletenessAll required steps completed (cleaning, EDA, analysis, reflection)
Code QualityClear, reproducible, well-commented code
Insight & AnalysisGood use of methods, thoughtful interpretation of results
CommunicationClear visualizations and logical structure in writeup
ReflectionHonest discussion of challenges, what worked/didn’t, what was learned

Missing pieces or unclear documentation will reduce your score. We’re looking for thoughtfulness, not just code.