NLP/L2

From WikiZMSI

< NLP

Accessing Text Corpora

For this part please use the instruction from Natural Language Processing with Python textbook chapter two.

Work with sections by reading the text and copying commands given in the tutorial:

  • 1.1 Gutenberg Corpus
  • 1.2 Web and Chat Text
  • 1.3 Brown Corpus
  • 1.4 Reuters Corpus
  • 1.5 Inaugural Address Corpus
  • 1.9 Loading your own Corpus
  • 4.1 Wordlist Corpora
  • 4.2 A Pronouncing Dictionary
  • 4.3 Comparative Wordlists
  • 4.4 Shoebox and Toolbox Lexicons

Exercises

  1. List Basic Corpus Functionality and give an example of usage for every function.
  2. Use the corpus module to explore austen-persuasion.txt. How many word tokens does this book have? How many word types?
  3. Read in the texts of the State of the Union addresses, using the state_union corpus reader. Count occurrences of men, women, and people in each document. What has happened to the usage of these words over time?
  4. Pick a pair of texts and study their differences in terms of vocabulary, vocabulary richness, genre, etc. Can you find pairs of words with quite different meanings across the two texts, such as monstrous in Moby Dick and Sense and Sensibility?