Important Note for ICE Corpora Users: Extra-corpus Text



All ICE corpora contain sections of text that we refer to as "extra-corpus" text. These sections are annotated with an opening symbol <X> and a closing symbol </X>. For example, a quotation from Shakespeare in any ICE corpus is marked in this way, since it is not contemporary English. In spoken texts, there are many instances of "extra-corpus" speakers, eg. speech by an American in the British corpus is enclosed in <X> and </X>.

So, for example, if you are studying British English in ICE-GB, you would obviously want to exclude speech by Americans, and indeed passages from Shakespeare. If you don't do this before you search the corpus, your frequency counts (and overall token counts) will in some cases be highly inflated. This applies to every ICE corpus.

If you are using Wordsmith, you can exclude extra-corpus text as follows:

Go to Settings > Adjust settings > Tags

In the dialog box that appears, click on ONLY PART OF FILE (at the bottom)

In the first box, type <X>, and in the one beside it, type </X>. This excludes everything between these two symbols in the corpus. The dialog box should look like this:


While you're at it, it's also a good idea to exclude editorial comments <&>...</&> and untranscribed text <O>... </O> (this is upper case O, not zero). Refer to the ICE manual for further details of these. The dialog box should then look like this:


It's important to check these settings every time you use Wordsmith to search any of the ICE corpora.




