CPT Word Lists - How To


1. Introduction
2. Change, Save, and Restore Settings
2.1 Default Directory 2.2 Font 2.3 Data Type 2.4 File Encoding 2.5 Operation 2.6 Save and Restore Settings
3. Convert Text File
3.1 Encoding/Transliteration/Letter Case 3.2 Normalization 3.3 Visual/Logical Order
4. Create Alphabet File
5. Create Word List
6. Sort Text Word List
6.1 Word Length 6.2 Reversed Word Letters
7. Create Dictionary
8. Compare Word Lists/Dictionaries
8.1 Compare Words 8.2 Spell Checking
9. Add Data to Word List/Dictionary
9.1 Add Words to Word List 9.2 Add Data to Dictionary
10. Extract Data from Word List/Dictionary
11. Create Parts of Phrases Dictionary
12. Frequency Counters and Multiple Source Files

1. Introduction

This document is a combination of user's guide and FAQ and it is an addendum to the complete reference "CPT Word Lists". Hopefully, you can find here in human readable form how to use the program.

Since you are not going to read the complete reference, as an introduction, we have to quote how the program works.
The input file is called "Source" and the output file is called "Target". When a third file is included (compare/add/delete operation), it is called "Base". Via the Source Data menu you have to select the input file and its characteristics. If a Base file is needed, it should be selected via Base Words menu. Using the Target Data menu, the output file options and the proper operations should be selected. And finally, via Target Data | Run the name of the output file is defined and the processing is started. An additional window with messages will be shown. If there are no error messages, the operation is OK.

Most of the important operations implemented in the program are described below but you have to check the detailed list as well.

2. Change, Save, and Restore Settings

2.1 Default Directory

One of the first things is to define your working directory. Use File | Default Directory. This will save you a lot of time for browsing the disk to select your input and output files.

2.2 Font

Use Source Data | Display Options dialog. The font defined there is used in all controls of the program.

2.3 Data Type

Use Source Data | Data Type for the input file, and Target Data | Data Type for the output file. If the Target file is CTree, use Target Data | CTree Options as well.

2.4 File Encoding

Use Source Data | File Encoding, Locale dialog for the input file, and Target Data | File Encoding, Locale for the output file. If you are not able to find the desired encoding, probably you have to try File | Set User Encoding dialog, and if you can set it there, start the File Encoding dialog again.

2.5 Operation

Use Target Data | Choose Action in order to define the desired operation.

2.6 Save and Restore Settings

There are many options you have to set for a particular task and it is a good idea to keep most of them some how. The first method is via File | Save Settings as Default. When the program is started it will read the last settings you have saved as default. The second one is via File | Save Settings in Projects, and choose a proper file name. To restore these settings use File | Read Settings from Projects and select your file.

 

3. Convert Text File

The operations described here can be done as auxiliary operations when performing more complex tasks. Before you start with the settings, check Projects directory for saved settings. If you are lucky to load the desired project, execute only steps D and H (read carefully 3.1.)

3.1 Encoding/Transliteration/Letter Case

A. Via Source Data | Data Type dialog set Plain Text.
B. Start Source Data | Advanced Text Options dialog. Clear all settings in all tabs.
C. Via Source Data | File Encoding, Locale dialog set the input encoding.
D. Open the input file via Source Data | Open (see the notes). If the file is not shown properly, check all settings in Source Data menu.
E. Via Target Data | Data Type dialog set Plain Text.
F. Via Target Data | File Encoding, Locale dialog set the output encoding.
G. Start Target Data | Choose Action dialog. Set Create New Target in General tab. In Options tab set optionally Change Letter Case to Lower/Upper and clear the other settings. Set only Encode Data in Target tab. Clear all settings in Affixes tab.
H. Define the output file and start the operation via Target Data | Run (see the notes). Check the messages and close the window.

The transliterators are implemented as custom converters (see CPT Converters) and as first step you should set the converter via File | Set User Encoding. You can use Source Data | Transliteration Bar to check the behavior of these converters.

3.2 Normalization

The steps are as above but step B is:

B. Start Source Data | Advanced Text Options dialog. In Normalize tab select the desired Unicode normalization. Clear everything in the other tabs.

The most often operation used is stripping accents (named 'No Accents' in the list). The Russian readers should be warned that according to the Unicode Standard the letter IO will be replaced by IE (as expected) but the short I will be replaced by normal I as well. The other useful operation is compatible composition (named 'Form KC' in the list). It will replace an accented letter in decomposed form (2 or more characters) by single character.

3.3 Visual/Logical Order

The steps are as in 3.1 but in File Encoding, Locale dialogs you should set properly the flag RTL in Logical Order. To change the ordering from visual to logical, in the Source dialog the flag should be off and the flag in the Target dialog should be on, and vice versa for the inverse operation.

 

4. Create Alphabet File

The alphabet files are used to create more sophisticated CTree dictionaries. Note that they are not mandatory. First of all you have to check the default file for your language in 'alphabet' directory - just open it as Source file (most of the files are encoded in UnicodeASCII.)

Probably, you are reading this paragraph because of the numerous warnings "word skipped". The most common problem is when the source word list contains characters, which are not letters from the alphabet. If you insist to include these words in the dictionary, use the program to create new alphabet file. Generally, the steps are as in 3.1.

 

5. Create Word List

The words are extracted from the input file, sorted and written into the output file. How to define what is a "word" is explained in Word Properties dialog. The Source data type usually is Plain Text or HTML. The Target data type is Text Word List or CTree Dictionary. We will describe here only two common examples.

Text Word List from Plain Text
The steps A to D are as in 3.1.
E. Via Target Data | Data Type set Text Word List.
F. Via Target Data | File Encoding, Locale set the output encoding.
G. Start Target Data | Word Properties. In Filter tab set Letters. In Letters tab set Locale Letters and Include Marks. The second variant is to set Locale Alphabet File in Filter tab (the file is in 'alphabet' directory). In Length tab set your preferences.
H. Start Target Data | Choose Action. Set Create New Target in General tab. In Options tab set Sort Word List Using to on, set JTree and Locale, optionally set Change Letter Case to Lower. In Target tab set Encode Data and Extract Word List.
I. Start the operation via Target Data | Run.

CTree from Text Word List
A. Via Source Data | Data Type set Text Word List.
...
E. Via Target Data | Data Type set CTree Dictionary.
F. Via Target Data | File Encoding, Locale set the the encoding and the locale of the CTree.
G. Start Target Data | Word Properties. In Filter tab set Whole Line as a Word and Strip Spaces.
H. Start Target Data | CTree Options. In Alphabet tab set Strict Alphabet and From File. If you have prepared own alphabet file, click on the button 'Set Alphabet File' and select it. Clear everything in Data tab. In Title tab enter what you wish. In File tab set only Packed Format.
I. Start Target Data | Choose Action. Set Create New Target in General tab. In Options tab set Sort Word List Using to on, set CTree and Locale, optionally set Change Letter Case to Lower/Upper. In Target tab set only Encode Data.
J. Define the output file (extension wlz) and start the operation via Target Data | Run.

How to add new words to your word list is described in 8.2 and 9.1.

 

6. Sort Text Word List

The sorting itself is part of the operation word list creation but here we will talk about two special types of sorting.

6.1 Word Length

Suppose you have a word list and you want it to be sorted for crossword reference by word length.
A. Via Source Data | Data Type set Text Word List.
B. Start Source Data | Advanced Text Options dialog and clear everything.
C. Via Source Data | File Encoding, Locale set the input encoding.
D. Open the input file via Source Data | Open.
E. Via Target Data | Data Type set Text Word List.
F. Via Target Data | File Encoding, Locale dialog set the output encoding.
G. Start Target Data | Word Properties dialog and set only Whole Line as a Word.
H. Start Target Data | Choose Action dialog. Set Create New Target in General tab. In Options tab set optionally Change Letter Case to Lower/Upper, set Sort Word List Using to on, JTree and Locale (note that you can not use CTree for this operation). In Target tab set Encode Data and Sort by Word Length. Clear all settings in Affixes tab.
I. Define the output file and start the operation via Target Data | Run.

6.2 Reversed Word Letters

Suppose you have a word list and you want it to be sorted by reversed word letters - to explore the word endings or the word list uses RTL script but it is in visual order. The steps from A to G are as above.
H. Start Target Data | CTree Options dialog and clear everything (we will just use the CTree engine for sorting).
I. Start Target Data | Choose Action dialog. Set Create New Target in General tab. In Options tab set optionally Change Letter Case to Lower/Upper, set Sort Word List Using to on, CTree, Locale, and Reverse Word Letters. In Target tab set Encode Data only. Clear all settings in Affixes tab.
J. Define the output file and start the operation via Target Data | Run.

 

7. Create Dictionary

To create a CTree with clues you will need a source text dictionary file, a tags file, and optionally an alphabet file. The standard format for text dictionary is Text Delimited. The tags are explained here. The default tags file is '<Locale>.tag' from the 'alphabet' directory. If you are not going to include tags, you can copy the 'default.tag' to '<Locale>.tag'.
The source text could be in Word-Clue Format as well, and no tags are processed from this input.

A. Via Source Data | Data Type set Text Delimited and your delimiter character, or one of the supported Word-Clue formats.
B. Via Source Data | File Encoding, Locale set the input encoding.
C. Open the input file via Source Data | Open.
D. Via Target Data | Data Type set CTree Dictionary.
E. Via Target Data | File Encoding, Locale set the encoding and the locale of the dictionary.
F. Start the dialog Target Data | Word Properties. In Filter tab set Whole Line as a Word and Strip Spaces.
G. Start the dialog Target Data | CTree Options. In Alphabet tab set Strict Alphabet and From File. If you have prepared own alphabet file, click on the button 'Set Alphabet File' and select it. If you don't want Strict Alphabet, set it to off. In Data tab as a minimum you should set Definitions/Clues. If you have tags file, it should be set here as well. In Title tab enter what you wish. In File tab set Packed Format and optionally Locked.
H. Start the dialog Target Data | Choose Action. Set Create New Target in General tab. In Options tab set Sort Word List Using to on, CTree and Locale. Set only Encode Data in Target tab. Clear all settings in Affixes tab.
I. Define the output file (extension dic) and start the operation via Target Data | Run. Check the messages and close the window.

Pictures

To include pictures in a CTree dictionary you have to do the following:
- use Text Delimited input format with clue tag starting with 'pic' and the picture file name as a clue:
picture description|0|0|0|pic pict.jpg
All picture files are expected to be in the same directory as the source file.
- use tags file which contains 'pic' clue tag;
- change picture settings in 'Picture Options' dialog;
- in 'CTree Options' dialog, 'Data' tab, check Has Pictures.
- create the dictionary via Target Data | Run.
The program will create 'attached' picture dictionary file having the same name as the CTree file, but with extension 'pdc'. When the CTree file is opened, the pdc file should be in the same directory, i.e. you have to keep them together.

To create a standalone picture dictionary (pdc) file, do the following:
- prepare a source file in Word-Clue format of the form:
picture 1|p001.jpg
picture 2|p002.jpg
...
All picture files are expected to be in the same directory as the source file. You could use Target Data | Create Picture Dictionary Source File from Directory to make the input file and then probably, you have to edit the descriptions manually.
- in Source Data | Data Type set Word-Clue Format (variant <word><delimiter><clue>NL);
- in Target Data | Data Type set Picture Dictionary;
- change picture settings in 'Picture Options' dialog;
- create the dictionary via Target Data | Run.

 

8. Compare Word Lists/Dictionaries

The Source word list is compared to the Base word list and the differences are written into the Target Log file. Actually, the Source and the Base could be CTree with clues dictionaries as well, but only the words are compared.

8.1 Compare Words

A. Via Source Data | Data Type set the input data type. The options are: Text Word Lists, CTree Dictionary, Text Delimited/Word-Clue and '1' as Field Number.
B. Via Source Data | File Encoding, Locale set the input encoding.
C. Open the input file via Source Data | Open.
D. Via Base Words | Open CTree Dictionary (or Open Text File) open the Base list.
E. Via Target Data | Data Type set Text Word List (it is the Log file).
F. Via Target Data | File Encoding, Locale set the output encoding.
G. Start the dialog Target Data | Word Properties. In Filter tab set Whole Line as a Word and Strip Spaces.
H. Start the dialog Target Data | Choose Action. In General tab set Compare Source to Base Words, select the type of the Log file and set No Source Words Sorting. In Options tab optionally set Change Letter Case to Lower. Set Sort Word List Using to off. Set only Encode Data in Target tab. Clear all settings in Affixes tab.
I. Define the output Log file and start the operation via Target Data | Run.

8.2 Spell Checking

When the Source is Plain Text or HTML and the Target Log file is 'Difference: Source not in Base' you can collect new words for your word list via spell checking.
A. Via Source Data | Data Type set the input data type to Plain Text or HTML.
B. Via Source Data | File Encoding, Locale set the input encoding.
C. Open the input file via Source Data | Open.
D. Via Base Words | Open CTree Dictionary (or Open Text File) open the Base list.
E. Via Target Data | Data Type set Text Word List (it is the Log file).
F. Via Target Data | File Encoding, Locale set the output encoding.
G. Start the dialog Target Data | Word Properties. In Filter tab set Letters and the optionally the flags in Letters tab as well.
H. Start the dialog Target Data | Choose Action. In General tab set Compare Source to Base Words, set the type of the Log file to 'Difference: Source not in Base', and set No Source Words Sorting to off. In Options tab optionally set Change Letter Case to Lower. Set Sort Word List Using to on, JTree, and Locale. In Target tab set Encode Data and Extract Word List. Clear all settings in Affixes tab.
I. Define the output Log file and start the operation via Target Data | Run.

Edit manually the new list to remove the errors and use the procedure 9.1 (input data type is Text Word List) to add the new words to your word list. When you are creating your word list from scratch, the procedures 8.2 and 9.1 are repeated many times for different source text files. It is a good idea to use in procedure 8.2 a combined word list, which contains the 'good' and the 'bad' words in order to reduce your efforts when you are editing the new words.

 

9. Add Data to Word List/Dictionary

The simplest adding operation is to put more words in an existing word list. The second one is adding global tags, and the most complex one is to add Text Delimited data to CTree with tags and clues.

9.1 Add Words to Word List

A. Via Source Data | Data Type set the input type. The options are: Text Word List, Plain Text or HTML (with word extraction), CTree without tags/clues, CTree with clues (only the words are added), Text Delimited/Word-Clue format with Field Number set to "1".
B. Via Source Data | File Encoding, Locale set the input encoding.
C. Open the input file via Source Data | Open.
D. Load the Base words Via Base Words | Open CTree Dictionary as Tree. The 'Open Text File' option also is possible.
E. Via Target Data | Data Type set Text Word List (it is the Log file).
F. Start the dialog Target Data | Word Properties. In Filter tab set Whole Line as a Word and Strip Spaces, or Letters if you are extracting words from Plain Text or HTML. In Length tab set your preferences.
G. Start the dialog Target Data | Choose Action. In General tab set Add Source to Base Words and the type of the Log file. Set No Source Word Sorting to on if the source is sorted word list, and set it to off if you are extracting words from Plain Text (this sorting will save you a lot of time). In Options tab set optionally Change Letter Case and Sort Word List Using to on if your are extracting words. In Target tab set Encode Data and Extract Word List if you are extracting words. Clear all settings in Affixes tab.
H. Define the output Log file and start the operation via Target Data | Run.

9.2 Add Data to Dictionary

The steps for adding data to CTree with clues are the same with the following details:

 

10. Extract Data from Word List/Dictionary

There are several ways to extract data from a Base dictionary. The easiest one is to use the Base Data | Dump in Text File. The second one is to use Base Data | Search Words dialog - just type a regular expression and click on 'Search and Save' button. If the Base is CTree with tags and clues there are two more ways as described below. Note that if the dictionary is protected via Locked or Password flags in CTree Options dialog, you will not be able to extract any data. Even the clipboard will be disabled if both flags are set.

Tags and Word Length Filters
A. Open the dictionary via Base Data | Open CTree Dictionary.
B. Via Target Data | Global Tags | Set New from Base start the dialog. In Tags tab select tags as filters. In Length tab select word lengths as additional filters.
C. Via Base Words | Extract in Text File define the output file and the data will be written in Text Delimited format.

Source Word List as Filter
A. Open the dictionary via Base Data | Open CTree Dictionary or Open Picture Dictionary.
B. Open the Source as Text Word List.
C. Via Base Words | Extract Source Words in Text File define the output file and the data will be written in Text Delimited format.

 

11. Create Parts of Phrases Dictionary

This is crossword style dictionary of phrases/proverbs, where the 'word' is one word from the phrase, and the 'clue' is the phrase itself but the word is replaced by "...". The format of the source file should be a phrase per line. For any word from the phrase the program will create a separate record (word | clue). The final crossword dictionary will be made in two runs.

Create Text Delimited (parts of phrases operation)
A. Via Source Data | Data Type set Text Word List (it is a phrase per line).
B. Via Source Data | File Encoding, Locale set the input encoding.
C. Open the input file via Source Data | Open.
D. Via Target Data | Data Type set Text Delimited.
E. Via Target Data | File Encoding, Locale set the same data as in B.
F. Start the dialog Target Data | Word Properties. In Filter tab set Whole Line as a Word and Strip Spaces. In Length tab set Minimal Word Length to 3 (to avoid creating of records for words of length 1 and 2).
G. Start the dialog Target Data | Choose Action. Set Create New Target in General tab. In Options tab set Sort Word List Using to off and Change Letter Case to Lower (recommended, valid for the words). In Target tab set Encode Data and Extract Word List. Clear all settings in Affixes tab.
H. Define the output file and start the operation via Target Data | Run.

Check the output file and remove the non-interesting records. You can also replace "..." by "_" as in American crosswords.

Add tags and create CTree
A. In Source Data | Data Type dialog set temporary Text Delimited and set the delimiter character, then set Word-Clue Format and select the first from the list.
B. In Source Data | File Encoding, Locale dialog the input encoding is as in A above.
C. Open the input file (it is the output file from the previous run) via Source Data | Open.
D. Via Target Data | Data Type set CTree Dictionary.
E. Via Target Data | File Encoding, Locale set the encoding and the locale of the CTree.
F. In the dialog Target Data | Word Properties the settings are as above.
G. Start the dialog Target Data | CTree Options. In Alphabet tab set Strict Alphabet to off or if you have an alphabet file, do the proper settings. In Data tab set Definitions/Clues and click on the button 'Set Tags File'. Select the file 'xw.tag' from 'samples' directory. In Title tab enter what you wish. In File tab set Packed Format and optionally Locked.
H. Via 'Target Data | Global Tags | Set new from file | Other file' select the 'xw.tag' file again and in the dialog select only the line '<xc> clue'.
I. Start the dialog Target Data | Choose Action. Set Create New Target in General tab. In Options tab set Sort Word List Using to on, CTree, and Locale if you have alphabet file, otherwise Byte or Unicode. Set Change Letter Case to Lower (recommended). Set Encode Data and Add Tags in Target tab. Clear all settings in Affixes tab.
J. Define the output file (extension dic) and start the operation via Target Data | Run.

 

12. Frequency Counters and Multiple Source Files

Suppose you have a collection of works of a classic writer and you are interested in statistics like which words/letters he used and how often. The frequency counters operation means that the words/letters are extracted from the source text file, counted, optionally sorted by the counters, and written into Text Delimited file (word/letter : counter). To make the example more real, we expect that your collection is in one directory containing tens of files. The important restriction is that all files should be in the same encoding.

A. Via Source Data | Data Type set Plain Text or HTML.
B. Via Source Data | File Encoding, Locale set the input encoding.
C. Open one of the files via Source Data | Open.
D. Via Target Data | Data Type set Text Delimited and ":" as Delimiter Character.
E. Via Target Data | File Encoding, Locale set the same data as in B.
F. Start the dialog Target Data | Word Properties. In Filter tab you can set Locale Alphabet File or Letters (and the additional options in Letters tab).
G. Start the dialog Target Data | CTree Options. Clear all settings because we are not going to create CTree file but to use CTree as engine. We have to note that the program uses only CTree engine for this operation.
H. Start the dialog Target Data | Choose Action. Set Create New Target in General tab. In Options tab set Sort Word List Using to on, CTree, and Locale. Optionally Change Letter Case. In Target tab set Encode Data, Extract Word List, and Frequency Counters. Set Words or Letters and optionally Sort by the Counters (means biggest counter on the top). Clear all settings in Affixes tab.
I. Check the menu item Target Data | All Files in Source Directory. This means that the rest of the files will be 'appended' to the Source file (see the notes as well).
J. Define the output file and start the operation via Target Data | Run. The processed files will be logged in the messages window.

top | cpt word lists