WordList

One of the types of objects in Praat. An object of class WordList contains a sorted list of strings in a system-independent format. WordList objects can be used for spelling checking after conversion to a SpellingChecker object.

1. How to create a WordList object

You will normally create a WordList object by reading a binary WordList file. You'll use the generic Read from file... command from the Open menu.

See below under 3 for how to create such a file.

2. What you can do with a Wordlist object

The main functionality of a WordList is its ability to tell you whether it contains a certain string. If you select a WordList, you can query the existence of a specific word by using the Has word command. You supply the word and press OK. If the WordList does contain the word, the value "1" will be written to the Info window; otherwise, the value "0" will be written.

3. How to create a binary WordList file

You can create a binary (compressed) WordList file from a simple text file that contains a long list of words. Perhaps such a text file has been supplied by a lexicographic institution in your country; because of copyright issues, such word lists cannot be distributed with the Praat program. To convert the simple text file into a compressed WordList file, you basically take the following steps:

Read Strings from raw text file: "lexicon.iso"
Genericize
Sort
To WordList
Save as binary file: "lexicon.WordList"

I'll explain these steps in detail. For instance, a simple text file may contain the following list of words:

cook
cooked
cookie
cookies
cooking
cooks
Copenhagen
Købnhavn
München
Munich
ångström

These are just 11 words, but the procedure will work fine if you have a million of them, and enough memory in your computer.

You can read the file into a Strings object with Read Strings from raw text file... from the Open menu in the Objects window. The resulting Strings object contains 11 strings in the above order, as you can verify by viewing them with Inspect.

In general, the Strings object will occupy a lot of memory, and be slow to read in. For instance, a certain list of more than 300,000 Dutch word forms occupies 3.6 MB on disk, and will occupy at least 7 MB of memory after it is read in. The extra 3.4 MB arise because the Strings object contains a pointer to each of the strings, and each of the strings is in a separately allocated part of the memory heap. Moreover, it takes 8 seconds on an average 1999 computer to read this object into memory. For these reasons, we will use the WordList object if we need a sorted list for spelling checking.

If you select the Strings, you can click the To WordList button. However, you will get the following complaint:

String "Købnhavn" not generic. Please genericize first.

This complaint means that the strings are still in your computer's native text format, which is ISO-Latin1 for Unix and Windows computers, or Mac encoding for Macintosh computers.

So you press the Genericize button. You can see that the Strings object changes to

cook
cooked
cookie
cookies
cooking
cooks
Copenhagen
K\o/bnhavn
M\u"nchen
Munich
\aongstr\o"m

The strings are now in the generic system-independent format that is used everywhere in Praat to draw strings (see Special symbols).

You can again try to click the To WordList button. However, you will get a complaint again:

    String "Copenhagen" not sorted. Please sort first.

This complaint means that the strings have not been sorted in ASCII sorting order. So you click Sort, and the Strings object becomes:

Copenhagen
K\o/bnhavn
M\u"nchen
Munich
\aongstr\o"m
cook
cooked
cookie
cookies
cooking
cooks

The strings are now in the Unicode sorting order, in which capitals come before lower-case letters, and backslashes come in between these two series.

Clicking To WordList now succeeds, and a WordList object appears in the list. If you save it to a text file (with the Save menu), you will get the following file:

File type = "ooTextFile"
Object class = "WordList"

string = "Copenhagen
K\o/bnhavn
M\u""nchen
Munich
\aongstr\o""m
cook
cooked
cookie
cookies
cooking
cooks”

Note that the double quotes (") that appear inside the strings, have been doubled, as is done everywhere inside strings in Praat text files.

After you have created a WordList text file, you can create a WordList object just by reading this file with Read from file... from the Open menu.

The WordList object has two advantages over the Strings object. First, it won't take up more memory than the original word list. This is because the WordList is stored as a single string: a contiguous list of strings, separated by new-line symbols. Thus, our 300,000-word list will take up only 3.6 MB, and be read in 4 seconds.

However, disk storage and reading can again be improved by compressing the word list. We can take advantage of the sorting, by noting for each entry how many leading characters are equal to those of the previous entry. The list then becomes something equivalent to

Copenhagen
0 K\o/bnhavn
0 M\u"nchen
1 unich
0 \aongstr\o"m
0 cook
4 ed
4 ie
6 s
5 ng
4 s

You can save the WordList compressed in this way as a binary file with Save as binary file.... For our 300,000-word list, this file takes up only 1.1 MB and can be read into memory (with Read from file...) in a single second. When read into memory, the WordList object is again expanded to 3.6 MB to allow rapid searching.


© ppgb, September 13, 2017