Welcome to the second installment of Using Academic Vocabulary, the focus of which is corpus analysis. This particular focus uses large bodies of data, of which one body is called a corpus. Using a large corpus allows a researcher to take a careful look at lexical usage that otherwise would not be apparent to language users.
You might well wonder what types of questions can be addressed by using corpus analysis. The answer is, of course, various types, which might include the following examples:
- ✓ Prepositions: Work for a company? At a company? In a company?
- ✓ Collocations: Ground floor or first floor?
- ✓ Lexical form: One way street? One-way street? Oneway street?
- ✓ Register: In academic writing, should I avoid "I" and "we"?
Class Material
Immediately below you'll find worksheets, both for our class activities and for practice in your spare time. The image to the right, you ask? That is, of course, a wordsmith.
Corpus @ Brigham Young University
Here you'll find a variety of corpora from different languages, genre, and areas. As you'll see on the webpage, Mark Davies is the eminent gentleman behind this massive undertaking.
When using the various BYU corpora, you might find the following hints helpful.
- ✓ Asterisk (star) = wild card
- ✓ Square brackets = lemmitization (various forms of word)
- ✓ Equals sign = synonyms
- ✓ Question mark = wild card (just one letter or symbol)
As you might have suspected, YouTube again has several very informative tutorials.
The Compleat Lexical Tutor
Another option is Tom Cobb's website called The Compleat Lexical Tutor.
The Michigan Corpus of Academic Spoken English (MICASE)
A further very useeful option is from the University of Michigan with its MICASE.
AntConc
Developed by Lawrence Anthony just down the street from us at Waseda, this is a very useful set of tools. Here is his webpage, on which you'll find lots of information in addition to the various types of software (including AntConc) that he has developed. For our purposes in this course, here is the AntConc webpage.
Lest this all seem beyond comprehension, Dr. Anthony has provided a series of tutorials available on YouTube.
AntConc 3.4.0 Tutorial 1: Getting Started
AntConc 3.4.0 Tutorial 2: Concordance Tool - Basic Features
I will readily admit that the Keylist tool was a mystery the first time that I tried it. Thankfully, this explains it nicely.
I'll leave it up to you to search for more helpful tutorials.
Building a Corpus
At some point you might find yourself in need of a specialized corpus. Of course, you could simply compile a very long text file all by yourself, which is of no great difficulty. Should you need to convert a text format from, say, PDF to simple text, one option is a free conversion website called Zamzar. A second option is to simply use the 'save as' routine.
Statistics and Such Cruel Things
A quick and dirty statistics note here: you will encounter the term log-likelihood when assessing whether the difference in frequency between your target text and a given corpus is statistically significant. As usual, the two significance levels of note in our field are p < .05 and p < .01. When pondering log-likelihood results, values in excess of 6.63 indicate statistically significant results at p < .01 and those in excess of 3.84 hit the p < .05 mark.