Overview

Due by 11:59 pm on July 16th.

The goal of this assignment is to practice the fundamentals of text analysis in R tidyverse.

Accessing the hw08 repository

Notice the repo you clone for this assignment is empty: add your data and code, and push them to your GitHub repo.

Cloning your hw08 repository

After you have accessed the hw8 repository (see above), follow the same steps you completed for hw1 to clone the repository.

General workflow

Your general workflow will be as follows:

Assignment description

Goal: practice the fundamentals of text analysis in R (import and pre-process data, perform exploratory analyses, and perform sentiment analysis OR topic modeling).

Instructions:

Examples to follow: * Book Text Mining with R, especially the assigned Chapters; among the case studies, I recommend Chapter 9 but the other case studies also provide excellent insights * In-class materials

How much do you need to do?

Your main tasks are: import and pre-process the data, analyze them for general exploratory analysis, and then apply sentiment analysis OR topic modeling (do not do both, just pick one).

I expect you to use the class materials and the book Tidy Text Mining with R as templates to perform this type of analysis (do not reinvent the wheel). You can apply the templates to a novel corpus. You are also welcome to use one of the provided examples as your data source, as long as you expand on the provided code (e.g., if the readings perform sentiment analysis on a specific textual corpus, you can use the same corpus to perform topic modeling instead).

In all circumstances, make sure to quote your resources (assigned readings and additional online tutorials or resources you might rely on).

Suggested data sources

You can use any source of textual data. If you are not sure, here are some suggested texts you could use:

Submit the assignment

Your GitHub repo should include everything you have used to produce your analyses, such as R scripts and/or R Markdown documents, original textual data (unless they are too large to be uploaded on GitHub – see HW6 for details), etc. Make sure to stage-commit-push your original .Rmd file and its .md

In your README.md: * explain the purpose of the repository * include an explanation of what your code does and how to use it, and list all libraries required to reproduce your analyses * include a description of the textual data * provide any other relevant information that the user needs to know in order to use your repo and replicate your results * quote all resources you consulted to complete the assignment * provide 1-2 paragraphs of reflections on what was hard/easy about this homework, what was enjoyable, problems you solved and how you solved them, helpful resources, etc. + list any collaborators and their role

To submit the assignment, push to your repository the last version of your assignment before the deadline. Then copy your repository URL (e.g., https://github.com/cfss-hmwks-s23/hw08-jmclip) and submit it to Canvas under HW08 before the deadline.

Rubric

Needs improvement: Not all elements listed in the instructions are addressed. Code does not run and/or has bugs. Code is short/elementary and poorly documented. No clear effort is made to pre-process the text for analysis, or no justification is provided for keeping content such as numbers, stopwords, etc. Results are poorly interpreted or misinterpreted. Visualizations do not include any element more than the basics. There is little attention to reproducibility issues and little consistency in the code’s style.

Satisfactory: Solid effort. Hits all the elements. Finished all components of the assignment with only minor deficiencies. Easy to follow (both the code and the output).

Excellent: Displays in-depth understanding of course materials, including data analysis and coding skills. Code to process and analyze the data is complex/refined. Visualizations are excellent. Explanation of the chosen technique is accurate with an assessment of the appropriate caveats for what the technique can and cannot do. Interpretation of the results is clear and in-depth and shows engagement with the content of the textual data. Code is reproducible. Uses a sentiment analysis or topic model example not directly covered in class or considerably expands on the provided examples.