For this assignment, you will be analyzing texts based on their patterns of frequently used words and word clusters. You will gain experience with two powerful text analysis software tools popularly used in digital humanities text analyses:
You have a choice of texts and text corpora (collections of texts) to work with and observe what you can of their word distributions using the corpus analysis tools we have been practicing with. During this unit, I will help guide your selection of texts to help make a meaningful hypothesis about how, for example, one text compares with a larger group of related texts, or one group of texts compares with another.
When you have found some meaningful and interesting patterns that seem worth comparing between the texts, you will write about your findings in a post on one of the websites you have created for this class. (You may work with either site that you wish for this assignment: either GitHub Pages or Wordpress.) Write up a page that presents your comparison and provides images (screen captures) and links to share the source texts you used and the data you could gather. Work with the images to illustrate your essay, in which you point out interesting patterns to compare or contrast these documents in your distant reading
of them through the corpus analysis tools.
For this assignment, you will create a comparison set of texts to analyze for their words and ngram patterns. You have many options! My recommendation is to attempt a comparison of one long text or a group of related texts with a larger comparison set. You may also compare texts by particular writers against each other.
I am making texts available on our class’s shared GitHub repository, the introDH-Hub. Be sure you have cloned the repo and use git pull
to pull in files and file collections as I update them for this assignment. Text collections I have prepared are in the textFiles/
folder.
Working with the texts and collections in our introDH-Hub, here are some ideas for comparison sets to explore:
If you have an idea about a selection of texts to try comparing that is not in the collections of texts in our introDH-Hub, ask me (Dr. B) about whether you can use them for this assignment. You may need to clean some of the texts that you pull from internet sources to remove their headers, long sections of footnotes, anything that is not part of the main text of what you want to be analyzing. As long as you can save the documents as an electronic file in plain text, and cut out any unnecessary materials (like footnotes, headings, styling, etc) you can work with them using our corpus analysis tools.
.txt
at the end so that the corpus analysis software can read the file. Rich Text Format(
.rtf
), look for an option to convert it to plain textso you can save it with a
.txt
file extension.Explore what you can see in the corpus analysis tools. For Voyant Tools, try consulting the Voyant Tools Help Guide and read more about views like TermsBerry, or see how to edit Voyant’s default Stop Word list (the list of words it routinely excludes when screening documents). If you are not sure you are seeing anything worth comparing in the documents you selected, try changing your approach. Try changing the ngram minimum and maximum value. The minimum value of two may not show the most interesting patterns, so try starting it at 3. Try looking at how a particular term collocates or clusters with others. You can always choose a different set of documents from the collection and continue experimenting.
As you work on the the corpus analysis, take notes on things that surprise or interest you. Can you see a strong pattern that makes one text or group of documents obviously different from another? Is it a pattern you would have guessed when you started, or something surprising?
Spend some time curating, reviewing, and thinking about your data. Curating your data involves taking screencaptures and saving the image files with clear filenames so you know what you are looking at. (Remember, no spaces in filenames!) You will need to include some of your image files to help show your findings in the the web post you are writing for this assignment.
Write up a reflection post including images and screen captures from your analysis. Your post should present one pair of texts, or one trio of texts that you studied with this assignment. Present your findings: how do these texts compare and/or contrast with each other in what you could see of the distinct words and phrases they most frequently use?
Prepare your post as an HTML webpage on one of the websites you developed in the previous assignment (your choice: either GitHub Pages or your Wordpress or personal PSU site). As you write, work with your screenshot images to help show your findings. Each image you include needs to be followed by some explanatory text describing what is significant in the image. For GitHub Pages, I recommend using the figure
element as a container around the image
and figcaption
elements to help label your images.
If you prepared some texts for analysis that were not in my collection on introDH-Hub: please share these from your HTML page that you are developing for this assignment. You can link to them in your docs/ directory. If you are working with a big collection that is not in the introDH-Hub, you can just point us a reliable link to your source for the files.
To complete the assignment, post a link to the page on your website hosting your Text Corpus Analysis on Canvas at the appropriate assignment link.