CPT Word Lists
- 1. Introduction
- 1.1 System Requirements 1.2 Features 1.3 Related Documents 1.4 About This Manual 1.5 Notation 1.6 Version
- 2. Overview of Data Processing
- 2.1 Source-Base-Target 2.2 Operations 2.3 Sorting Objects 2.4 Side Effects
- 3. Encoding, Byte and Unicode Modes
- 3.1 Converters 3.2 Unicode and Java
- 4. File Menu
- 4.1 Settings 4.2 User Defined Encoding Converter
- 5. Source Data Menu
- 5.1 Data Type 5.2 Advanced Text Options 5.2.1 Unicode Normalization 5.2.2 Custom RTL Conversion 5.2.3 Handle Hyphenation 5.3 File Encoding, Locale 5.4 Display Options 5.5 Transliteration Bar 5.6 Open 5.7 Search 5.8 Close
- 6. Base Words Menu
- 6.1 Pack Suffixes 6.2 Search Words
- 7. Target Data Menu
- 7.1 Data Type 7.2 File Encoding, Locale 7.3 Word Properties 7.4 CTree Options 7.5 Pictures 7.5.1 Picture Options 7.5.2 Test with a Picture 7.5.3 Create Picture Dictionary Source File 7.6 Tags 7.6.1 Global Tags 7.6.2 Tags Filters 7.7 Choose Action 7.7.1 General Tab 7.7.2 Options Tab 7.7.3 Target Tab 7.7.4 Affixes Tab 7.8 All Files in Source Directory 7.9 Run
- 8. Data Types and Operations
- 8.1 Source-Target 8.1.1 Operations without sorting 8.1.2 Operations with sorting 8.2 Base 8.3 Source-Base-Target 8.3.1 Base word lists 8.3.2 Base with tags/clues 8.3.2 Munched lists
- Appendix A: File Formats
- A.1 Alphabet Files A.2 Affixes Files A.3 Tags Files A.4 Tagged Word List A.5 Text Delimited A.6 Word-Clue Format A.7 IPA8 Files A.8 CTree Files
- Appendix B: Language Support
- B.1 Letter Case B.2 Writing Direction B.3 Character Shapes B.4 Composed Characters B.5 Huge Alphabet
CPT Word Lists is a collection of tools for processing word lists and text files, supporting Unicode and unlimited number of encodings via the Java converters. Its main goal is to create dictionaries for the other CPT programs but it is designed and can be used as completely independent program.
1.1 System RequirementsCPT Word Lists is available for Windows and Linux on PCs. Since it is Java program, some RTE should be pre-installed. The program is tested with Sun's Java 1.1 - 1.7. The MS JVM (jview) could be used on Windows as well.
You should have 3MB disk space for the program and the sample files. The RAM requirements are according to your OS in most cases (but for example, if you want to create a millions words dictionary with clues, your should have 2GB RAM.)
1.2. FeaturesThe set of operations over textual (plain text or HTML) files include:
- browsing/searching in any standard encoding including decomposition and bidi support;
- extracting words, calculating letter and word frequencies;
- flexible 'word' definition and filtering;
- changing the encoding and the letter case, transliterating;
- standard Unicode and custom normalizations;
- logical/visual order conversion for RTL scripts;
- simple spell checking and tagging.
- creating highly compressed dictionaries optionally including tags, definitions and pictures (one million words can be stored in browsable, less then 1MB file);
- several types of sorting including user defined order and alphabets (90 alphabets supplied);
- compare/add/delete functions over dictionaries;
- global assignment of tags and extracting subsets via selected tags;
- automatic or user defined suffixes packing;
- via user definitions: creating and expanding munched lists, creating and filtering tagged lists, translating tags in tagged lists, tagging;
- searching and extracting word patterns;
- creating inverted index for dictionary with definitions;
- three levels of protecting the dictionaries.
1.3 Related DocumentsFor the users of our crossword programs the top document in the hierarchy is "CPT - The Primer". This document has an addendum "CPT Word Lists - How To".
The fans of Unicode details will need the book
"The Unicode Standard" and the technical reports - see
The program supports the code points assigned in Unicode 5.1.0 (without supplemental characters) and as a reference you could use the book or UnicodeData.txt from the official version .
1.4 About This ManualChapters 2 and 3 introduce the basic concepts, 4, 5, 6, and 7 are detailed reference for the menus and dialogs. Chapter 8 is a summary of the operations and data types, and the file formats are explained in Appendix A. In Appendix B you can find a summary of the support for particular languages.
1.5 NotationThere are some acronyms and special terms used in the text, just in case to be explained, sorted by alphabetic 'natural' order, not by importance:
bidirectional - languages such as Arabic and Hebrew whose general flow of text proceeds horizontally from right to left, but numbers, English, and other left-to-right language text are written from left to right. "Bidi" is a short form of the term. RTL is used for right-to-left and LTR - for left-to-right.
BMP - Basic Multilingual Plane (i.e. Unicode without supplemental characters, the subset supported properly by this program.)
bpt - bit-per-tag, a type of tag.
btc - byte to character converter.
CLASSPATH - this could denote OS environment variable, or option when running JVM, showing the path(s), where the JVM will search for class/jar/zip/... files containing compiled Java programs or other data.
CPT - crossword power tools, our collection of programs.
ctb - character to byte converter.
CTree - binary dictionary file or internal sort engine.
fixed font - this font is used to mark menu items, check boxes, dialogs, etc.
grapheme - glyph or "conceptual character" that can be presented by one or more Unicode characters.
natural sorting - the order of words in printed dictionaries.
npt - number-per-tag, a type of tag.
NSM - non-spacing mark, combining diacritical mark.
RTE - run time environment, your Java support, used to run this program.
OS - the operating system you are running, Windows is used for Microsoft Windows 95/98/ME/NT/2K/XP/Vista/7, Linux is used for popular boxes like Red Hat/Fedora, SUSE, Ubuntu, etc.
ttf - true type font.
Unicode - the standard for a coded character set that aims to be 'Universal Character Set', synchronized with ISO/IEC 10646.
\uxxxx - Hexadecimal notation for representing Unicode characters, where "\u" is the escape sequence and "xxxx" are the hex digits.
You could look at the glossary at unicode.org as well.
1.6 VersionThe version of the program documented here is 1.4.2.
|2. Overview of Data Processing|
2.1 Source-Base-TargetThe data processing is file based. The input file is called 'Source', and the output file is called 'Target'. When a third file is included (e.g. compare/add/delete), it is called 'Base'. Via the Source Data menu the user have to select the input file and its characteristics. If Base file is needed, it should be selected via Base Words menu. Using the Target Data menu, the output file options and the proper operations should be selected. And finally, via Target Data | Run the name of the output file is defined and the processing is started.
2.2 OperationsThe operations including Source files are generally divided into two groups and the proper group is selected on the General tab of the Target Action dialog (select Target Data | Choose Action...). The first group is labeled 'Create New Target' and the operations need only input and output file. The specific operations from this group can be selected from Target and Affixes tabs. All the rest from the General tab are operations from the second group and they need Base file as well. The Options tab contains operations that are additional to any of the two groups mentioned above. Special attention should be given to check box 'Sort Word List Using', which has several sub-options and defines the sorting type and the main processing mode: 'one byte' (when Byte selected) or 'two bytes' (when Unicode or Locale is selected), and defines the sorting engine (see Sorting Objects below). The modes are explained in Encoding, Byte and Unicode Modes. Note that there are some specific operations, which can be determined by the input and output data type, and there are no special flags - see Data Types and Operations.
2.3 Sorting ObjectsThe sorting and other processing of word lists could be done using three main objects or 'engines': Arrays, JTrees, and CTrees. The Arrays are linear list of words (repetitions are possible before the sorting). The JTrees are RB-tree structures of unique words, implemented in Java, and CTrees are 'generally tree structures' (internally, 6 main variants), implemented in C language. All three objects support one byte mode and two bytes mode and all sorting types. JTrees and Arrays support locale type sorting via Java's locale collators. CTrees support locale sorting via user defined alphabets (see Alphabet Files). The Arrays do not support delete operation. The CTrees can not be used for sorting of text word list by word length.
Usually Arrays are read/written as text files called 'text word list'. When JTree is used as sorting engine, the target is text word list or picture dictionary. The CTrees are read/written as binary files called 'dictionaries' and can be dumped as text files as well. The dictionaries contain additional information about encoding, sorting type, locale, etc. They can contain tags, clues and pictures as well. In Unicode mode any character takes 2 bytes in the file, in byte mode the character can be presented by 1 byte, by 5 bits or more depending on the number letters used. The CTree dictionaries are most developed and they are the standard for all CPT programs as crossword dictionaries.
Note that the CTrees will not sort and handle properly graphemes presented by more than one Unicode character. You should compose (see Unicode Normalization) these characters or you should use JTree or Array in Locale mode as sorting engine.
2.4 Side Effects
With CPT Word Lists many of the operations could be done in a single run. For example, take as input Russian HTML file encoded in UTF-8 or Cp1251, extract the words, sort using Russian locale, compare the words to dictionary encoded in KOI8_R and write the difference in a text file encoded in ISO 8859-5. This is a sample of simple spell checking that hardly any other program can do for you.
Yet another example, we used CPT Word Lists to separate our Bulgarian word list in parts containing only nouns, verbs, and adjectives. Initially, we did not have any dictionary with tags to do the tagging and we did it other way. Via user defined affixes file (suffixes and tags in this case) and using 'Loose Match Mode' we created munched list (after many iterations). The munched list has been edited manually to correct the loose matching errors and this way we expanded the initial word list to all grammatical word forms. The rest was easy: in seconds we created tagged list and finally via filtering the tagged list we got the different parts. The linguists might not like this 'morphology analysis' done by suffixes only, but it works for this and other languages. The human time spent to create our affixes file is comparable to the time needed to create lexicon and rules file for PC-Kimmo or Xerox's finite state tools, but the computer time for the analysis is considerably shorter. Of course, the goals of our program and the other tools mentioned are different.
|3. Encoding, Byte and Unicode Modes|
The different language writings are supported by computers via hundreds of encoding schemes. The encodings are classified by the supported scripts and by the number bytes used per character. For example, the European languages can use one byte, while some of the Asian languages and Unicode schemes use two and more bytes.
The processing of different encoded texts in Java is supported via pair of converters (in recent Sun Java there is also a variant as combined single object.) The input converter is byte-to-character (btc), which translates bytes from the source encoding to Unicode characters. The output converter is character-to-byte (ctb) and its task is to translate Unicode characters to the target encoding. In CPT programs most of the names of encoding converters from Sun's Java international RTE are built in. There is a mechanism via 'User Defined Encoding' to use any available converter when it is not in the built in list. The same mechanism is used to select non-Sun converters, developed by Netscape, or by other programmers. In our dialog boxes, where the built in list is shown, the encodings are ordered as follows:
- In the top half are the converters using one byte per character and grouped by the script (e.g. Western/Latin, Cyrillic,...). On the end of this half is 'Reserved for User Defined 8-bit Converter'.
- In the bottom half are the converters using two and more bytes per character (Japanese, Chinese,.., Unicode). Again, on the end is 'Reserved for User Defined 16-bit Converter'.
The input and output files could be in any encoding, but internally, our programs work only in 'one-byte' and 'two-bytes Unicode' modes. This means that if Source and Target encodings selected by the user, are from the one-byte group and the sorting type is Byte, then the main processing is in one-byte mode. If the Source or Target encoding is from the two and more bytes group and the sorting type is Unicode or Locale, then the Source is converted to Unicode, the processing is done in Unicode, and finally, the text is converted from Unicode to the Target encoding. This is the general scheme for Arrays and JTrees. For CTrees, additionally, some space and time optimizations could be done via using 'Strict Alphabets' and/or UTF-8.
3.2 Unicode and Java
The Java strings are presented internally in Unicode with the extension of 'surrogate pairs', called UTF-16 or 'Unicode 2.0', while the pure 16-bit is called UCS-2 or 'original Unicode 1.1'. The Java characters are 16-bit 'code elements', not 'code points', and obviously one Java character can not hold a 'surrogate pair'. Single Java character can not hold letters in decomposed form as well, in this case the text should be composed - see Unicode Normalization.
There are several Unicode schemes for handling text in files (www.unicode.org, www.szybora.com,...). For reading/writing Unicode texts from/to files Sun developed converters, which do not change the 'Unicode encoding' but just the representation of characters as bytes:
- UnicodeLittle - Little Endian Unicode (UTF-16LE), prefaced by \uFFFE signature as byte order mark;
- UnicodeLittleUnmarked - same as the previous but without the marking signature (ctb only);
- UnicodeBig - Big Endian Unicode (UTF-16BE), prefaced by \uFEFF signature;
- UnicodeBigUnmarked - same as the previous but without the marking signature (ctb only);
- Unicode - Little or Big Endian Unicode, prefaced by the proper byte order mark;
- UTF-8 - the 7-bit ASCII text is one byte per character, while the other Unicode characters have variable length in bytes.
One of the built in converters in our program is UnicodeASCII, which converts Unicode characters to \uxxxx and vice verse (see User Defined Encoding Converter).
|4. File Menu|
The only traditional item in this menu is Exit. All others are for general settings.
Via Save Settings as Default you can save the settings you have done and on the next start of the program you will have them. The passwords and Source/Base/Target file names are never saved but the names of the auxiliary files from CTree Options dialog are saved.
Save Settings in Projects... is similar but in separate file in 'Projects' directory. There are many options you have to set for a particular task and it is a good idea to keep them in a file. After the installation, the 'Projects' directory will contain template settings for most of the groups of operations.
Via Read Settings from Projects... you can restore the saved settings.
Set User Encoding item is described below.
Via Default Directory you can set the default directory for the Open/Save file dialogs. If the default directory is not set, it will be the program installation directory.
Set Control Background Color is used to change the look and feel of the program. System, Custom A, Custom B are predefined colors. Custom... will start the Java color chooser dialog. Gradient when checked, will force the background of most controls to be painted via gradient.
4.2. User Defined Encoding ConverterThere are two positions in the CPT built in list of converter names reserved for user defined converter. The first is for 8-bit converter and the second is for 16-bit converter. You can set these converters and check/find all available via User Encoding dialog. The Java converters are explained in Encoding, Byte and Unicode Modes above. To start the dialog select from the menu File | Set User Encoding.
In the text fields User 8-bit Converter and Converter Display Name you have to enter the program name and the display name of the converter. The display name is free text but the program name should be known (see below). Via clicking the button on the right you can check if that converter is available. The software is testing the pair of ctb and btc converters and you can receive two error messages when the converter is not found. The same applies for the user-defined 16-bit converter.
The combo box under the label Encodings and Display Names shows the built in list of converter names and below are the display names. When the user converters are not defined, they will be shown as "User8" and "User16" as converter program names and "Reserved for User Defined ..." as display names. Again, you can check any converter from the list if it is supported by your Java RTE via clicking the button on the right.
The last combo box and the supporting "Run Scan" button can be used to scan the Java CLASSPATH and to find all available ctb converters. The list will be sorted by the names and will include the built in converters (like CyrMIK) in the current CPT program as well. Via the clipboard you can copy any converter name from the combo box and paste it into User ... Converter text field. This will ensure the correct name without "blind" checking. To test if the paired btc converter is available as well, you have to click on "Check Available" button. The answers are: "Yes", "btc only", "ctb only", and "No".
After checking the user-defined converter program name and entering the display name, you can click on the OK button and the setting is done.
If you want to remove the setting, delete the converter program name
from the text field and click on OK. Note that if the removed converter
has been used as default converter for the Source/Target file encoding,
you have to change these encodings as well, otherwise, latter you will
receive error message that User8/User16 converter is not found.
In addition to the Sun JRE converters, here are the converters built into this program:
- UnicodeASCII - the ctb converter translates any character outside the 7-bit ASCII range into Java hexadecimal notation (\uxxxx), the btc converter translates any \uxxxx string into Unicode character (if \u is not followed by four hexadecimal digits, this will be an error). The converter can be used as User16 converter only.
- KOI8_R is converter (btc only, used if the JVM is not Sun's i18n, which includes btc and ctb converters) for the Russian Cyrillic encoding defined in RFC 1489 (User8 converter).
- KOI8_U is btc converter for the Ukrainian Cyrillic encoding defined in RFC 2319 (User8 btc converter).
- CyrBG is btc converter for the old (Win 3.1) Bulgarian Cyrillic encoding (taken from ttf fonts - "Timok", "Hebar", ..., it is essentially Cp1251, but in the range from 0x80 to 0xbf there are differences, euro sign included at 0x88). It can be used as User8 converter only.
- CyrMIK is btc/ctb converter for the very old DOS Bulgarian Cyrillic MIK encoding (see www.szybora.com). It can be used as User8 converter only.
- CyrMK is btc converter for the old (Win 3.1) Macedonian Cyrillic encoding (taken from ttf fonts - "Macedonian Tms", "Macedonian Helv", ..., euro sign included at 0x80). It can be used as User8 btc converter.
- ISIRI3342 is btc/ctb converter for the 8-bit ISIRI 3342:1993 Iranian standard.
- VN1 is btc/ctb converter for the 8-bit TCVN 5712:1993 Vietnamese standard.
- CCMH_Cyrl is btc/ctb converter for Old Church Slavonic from 7-bit ASCII Latin to Unicode Cyrillic and back.
- CCMH_Glag is btc/ctb converter for Old Church Slavonic from 7-bit ASCII Latin to Unicode Glagolica and back.
- SR_Latn_Cyrl is btc/ctb converter for Serbian from Cp1250 Latin to Unicode Cyrillic and back.
- SH_Latn_Cyrl is btc/ctb converter for the old Serbocroatian from Cp1250 Latin to Unicode Cyrillic and back.
- MK_Latn_Cyrl is btc/ctb converter for Macedonian from Cp1250 Latin to Unicode Cyrillic and back.
CyrBG converts the Cyrillic letters 'I' with grave and 'i' with grave (0xAA and 0xBA) to \u040D and \u045D (Unicode Cyrillic letters with grave). CyrMK converts the small Cyrillic letters 'i' with acute and 'ie' with acute (0x26 and 0x23) to \u045D and \u0450 (Unicode Cyrillic letters with grave). The characters mentioned are not part of any other Cyrillic 8-bit encoding and if the text includes any of them, you can not convert it to Cp1251 or ISO8859_5, the only targets are Unicode and UTF-8.
The converters defined/reported as 'btc only', can be used just for encoding from text to text. When you create CTree it is not recommended to use user-defined converter as a target encoding because any time you open this CTree, the same user-defined converter should be set. Since not all CPT programs have User Encoding dialog, these dictionaries might be unreadable.
The last 5 converters are actually transliterators, which are not based on any official standard. They can be used as User16 converters (just a convention, for some operations like 'encode only' they can be used as User8 as well). All of them are included in the Transliteration Bar and you can easily check their behavior. CCMH_Cyrl and CCMH_Glag are explained in detail in the file 'cu_Transliteration.txt' in 'alphabet' directory, and here is a sample picture:
The selected text from the text area is transliterated 'Glag to CCMH 7-bit' and then 'CCMH 7-bit to Cyrl', and the result is shown in the Transliteration Bar.
|5. Source Data Menu|
The items from this menu are supporting all optional settings and operations for the source files. The essential items are Data Type and Open - you should have opened some input file to run any target processing.
5.1 Data TypeWhen this menu item is selected, you will see a dialog box with radio buttons for the input file data type.
Text Word List is the traditional format of a word per line.
Munched Word List is a word per line with an optional suffix mark. It is similar to the Unix ispell munched format but not the same. The affix marks should be defined in CPT Affixes file.
Tagged Word List is a word per line with optional tags, which should be defined in CPT Tags file or CPT Affixes file.
CTree Dictionary is binary file created by this program. For more details see the appendix .
CPT Affixes is text file defining affixes. For details see the appendix .
CPT Tags is text file defining tags. For details see the appendix .
Plain Text is a text (or any binary) file that will be scanned to extract words or to be transformed.
HTML could be in pure ASCII or some of the supported encodings like ISO8859_1, .., UTF-8. Note that the encoding defined in the file (charset tag) will be ignored. The encoding used will be the defined in Source File Encoding dialog. The HTML tags are not interpreted (e.g. "dir" attribute, css files, frames, etc.), the only interpretation is for the character entity references (after '&', HTML 4.01 list) and decimal or hexadecimal codes (after '#'). The operations over these files are for extracting words and "dehtml" - stripping the HTML tags.
Text Delimited is text "record per line format". Many database export utilities and CPT Word Lists produce such files. The text fields below the radio button are used to define the "delimiter character" of the fields in the record and the field number to be taken. To use TAB as delimiter character, enter the string "\t". If the field number is set to 0, this means 'take all fields' and the file is expected to be in 'text dictionary' format (see the appendix ).
Word-Clue Format is similar to Text Delimited,
but can contain a word and one or more clues, and no tags.
You should select one of the supported variants from the list
(see the appendix ).
Comments (--) means that there are lines starting with "--", which should be ignored. For the first variant any commented line should have "--", for all other variants the comment is a block starting and ending with "--" line.
Note that the delimiter character (for the first variant) and the field number are valid here as well. Field number 1 is the word, 2 is the clue, and 0 is 'text dictionary'.
5.2 Advanced Text OptionsTo start the dialog, select from the menu Source Data | Advanced Text Options.
5.2.1 Unicode NormalizationYou can use the Normalize tab to select normalization form and additional options. The idea is to transform the input text into one of the composed or decomposed forms, unifying Unicode characters for easier sorting and comparing of words. The standard forms are described in Unicode Technical Report #15. Here is a very brief summary.
In Unicode some "conceptual" characters or graphemes can be represented in two or more ways. The composition forms are "shrinking" the letter representation and the decomposition forms are "expanding" it. There are canonical and compatibility modes. Let's explain with an example: angstrom-sign (\u212b), A-ring (\u00c5), and A + ring (A + \u030a) are three variants of one letter glyph. The normalization of this letter is done as follows: If the input is angstrom-sign, A-ring, or A + ring: for canonical decomposition, the normalized form will be A + ring; for canonical composition, the normalized form will be A-ring.
The next example is with compatible forms. The Latin 'ffi' ligature has been used in many texts and there is single Unicode character ffi-ligature(\ufb03). In the new texts the people are using 'ffi' string and here comes the compatibility mode.
If the input is 'ffi' string:
for canonical/compatible decomposition/composition, the normalized form will be 'ffi' string.
If the input is ffi-ligature:
for canonical decomposition/composition, the normalized form will be ffi-ligature;
for compatible decomposition/composition the normalized form will be 'ffi' string;
When the input text contains ligatures or characters in decomposed form
(like A + ring) it should be normalized for the proper working of word
filters and CTree style of sorting. In most cases the preferred form
is some of the compositions.
The list in Unicode Normalization combo box contains the standard forms from Unicode Technical Report #15:
- Form C is canonical decomposition, followed by canonical composition;
- Form KC is compatibility decomposition, followed by canonical composition;
- Form D is canonical decomposition;
- Form KD is compatibility decomposition;
and several non-standard forms:
- Arabic Composition is composition of characters into ligatures;
- Thai Composition is composition of characters into syllable/ligature codes from the Unicode private area;
- Thai Decomposition is the reverse process of Thai Composition;
- No Accents is canonical decomposition, followed by accent marks removing. Jamo/Hangul are ignored (see below), the combining characters, which are not defined in the specification of canonical decomposition, are not removed;
and on the top is:
- None - select it to disable any normalization.
The check box Plus Excluded means: "compose all characters" (according to the specifications there is so called "Exclusion List" and the characters from that list should not be composed). This option can help you to obtain proper composition of some texts, e.g. in Yiddish, despite of the formal norms.
The check box Ignore Jamo/Hangul means: "do not decompose Hangul syllables into the Jamo alphabet and vice verse". This is an option because the Unicode standard specifies that Hangul to Jamo is a canonical decomposition. If you want to keep these characters unchanged, set the check box.
If Arabic Composition is selected, via the combo box Arabic Ligature Set you can choose subset of ligatures:
- Lam Alef is the minimal one supported by all Arabic fonts;
- Traditional Arabic is the set supported by the font having the same name;
- Unicode is the whole set defined in current supported version of the standard.
During composition of ligatures the forms (isolated, initial, medial, final) are defined according to the shaping rules. Note that the source text should be in logical order.
The check box Shape Letters As Well can be used to obtain full Arabic shaping of all characters in the text. If it is set, there is no need to set the Shaping flag in Display Options dialog.
If Thai Composition is selected, via the combo box Thai Set you can choose set of the custom codes:
- Syllables is the whole set;
- Single Cells is a reduced set, containing codes only for the base characters and the combining marks, which can form single composed cell or glyph.
The Thai Composition includes preprocessing with reordering of marks - the tone marks are moved after the other marks and when the set is Single Cells, the Sara Am is decomposed. We have to note that only the Syllables set will support the proper sorting and the word breaking. Do not expect any other program outside the CPT kit to understand Thai composed text. You should use Thai Decomposition, to export it.
The program uses tables built from UnicodeData.txt version 5.1.0 (BMP only) and the results by normalizers using other versions might differ from ours.
The normalization follows the Unicode encoding of the source text and always precedes all other conversions and filters.
For more details, see Appendix B.
5.2.2 Custom RTL ConversionThe options in Bidi tab can be used to override the default bidi processing (when Custom RTL Conversion is set).
The group of radio buttons Default Line Direction offers the following options:
- None - this is the default: the direction is defined by the first character from the line having LTR or RTL bidi direction;
- LTR - force left to right;
- RTL - force right to left.
The radio button To Logical Order when set, will force conversion of the input text from visual order to logical order.
The radio button To Visual Order when set, will force conversion of the input text from logical order to visual order using one of the options:
- Unicode Bidi - the standard bidi conversion;
- BDO (Bidi Override) - similar to HTML 'BDO' option: only the explicit Unicode RTL/LTR codes are interpreted.
The CPT bidi processing claims to conform to Unicode Technical Report #9 (rev. 15). There are many similar implementations and all are compatible, partially...
5.2.3 Handle HyphenationThis option is valid for input plain text files. If the check box is set, the software will follow the hyphenation rule: if before the <Unicode NL> there is <hyphen> and this sequence is preceded by at least two letters and followed by a letter, the sequence will be stripped. This way the letters of the hyphenated word will be concatenated into single word string. The rule is too simple to cover all variants of hyphenations in all languages and should be used with care.
For <Unicode NL> we consider one or more: '\r', '\n','\u0015','\u000b', '\u000c', and '\u2028'. The <hyphen> is defined by the radio buttons. Soft means any of: '\u00ad', '\u2010', and '\u058a'. Hard means '-'.
5.3 File Encoding, LocaleWhen you select Source Data | File Encoding, Locale the standard for CPT programs dialog will be shown:
Usually, you have to set just once these options and via File | Save Settings As Default to save them and don't bother any more.
The Encoding combo box contains the built in display names of Java converters. Here you have to select the proper encoding of the input file. On top of the list there is special item: "Java Default (...)" - it reflects the font.properties/fontconfig.bfc file of your Java RTE. It is the default if the encoding is not set explicitly.
RTL in logical order if set, will switch on the default bidi processing. This check box is enabled for locales, which could use RTL scripts.
Country and Language combo boxes are used to select the Locale (shown under the combo boxes) of the file. In some cases the locale of the file does not matter. It is used in Search in Source dialog and if the input data type is Munched or Tagged Word List (for tag filters). In these cases the alphabet subdirectory is scanned for the affixes file (<Locale>.aff).
5.4 Display OptionsVia menu Source Data | Display Options you will start the dialog for the display options in the text area and text fields of the search dialogs in CPT Word Lists.
The combo boxes on the top define the font used for the display. Font Family will show the list of the Java RTE supported fonts. For Java 2 versions (1.2 and above) and jview these are the native font names installed in your OS, and for Java 1.1 these are only the logical font names defined in "font.properties" file.
Via Style and Size combo boxes you can select the font properties as italic, and size 12, for example.
The Encoding combo box shows the default encoding. For the older Java 1.1 versions it can be used to change the encoding of the logical font without modifying "font.properties" file. For example, the current encoding is ANSI (Cp1252) and you can change it to Thai (Cp874 or MS874). Note that this will work if the Thai locale is installed on your Windows. Since the font-encoding problem is solved in Java 2 and jview, in these cases the combo box will be disabled and you have just to select the native font, which supports the desired script (Angsana New for the sample of Thai).
Under Linux this combo box is enabled for all Java versions
because it is used to define the keyboard/clipboard converter
(when one of the flags is set).
We do not provide keyboard drivers. The keyboard converter is used
to encode the 8-bit text from your Linux keyboard driver to Unicode.
Clipboard converter will be shown only if the Java is 1.3.1 and above. If you exchange data with applications supporting properly UTF8_STRING or COMPOUND_TEXT (all Java, KDE3 applications, Mozilla, etc.) you should switch off this flag. If switched on, the behavior is the same as for the keyboard converter.
When View Source check box is set, the initial part of the input file (when opened) will be shown in the CPT Word Lists text area. In the Lines to View text field you can enter how many lines to view. Showing just the initial part of the file is very convenient when the source is huge file and you have to wait long time, or you could have problem with the virtual memory.
The check box CTree Header if set, when the source is CTree, will enable the dictionary data (number words, encoding, title data, word lengths, letter frequencies, etc.) to be displayed before the word list.
The check box Right Alignment if set, will force right alignment of the text area and text fields of search dialogs. This option is provided for the RTL scripts, but it is completely independent of the bidi processing.
5.5 Transliteration BarThis check box menu item shows and hides an additional bar containing a text field, a button, and a combo box. You can select a transliteration operation from the combo box and when the button is pressed, the text in the field will be transliterated. The behavior of the controls depends on all settings you have done via the Source Data menu. The list of available operations is as follows:
- To \uxxxx and From \uxxxx are used to check the hexadecimal presentation of the characters (in the text field the default automatic conversion from \uxxxx is disabled). All to \uxxxx converts 7-bit ASCII codes as well.
- Source normalization and Source custom RTL will perform the operation as defined in Advanced Text Options dialog.
- To upper case and To lower case are for letter case conversion (the flag Special Casing is on.)
- CCMH 7-bit to Cyrl, Cyrl to CCMH 7-bit, CCMH 7-bit to Glag, and Glag to CCMH 7-bit perform the operations as described in CPT Converters.
- SR Latn to Cyrl, Cyrl to SR Latn, MK Latn to Cyrl, Cyrl to MK Latn, SH Latn to Cyrl and Cyrl to SH Latn also perform the operations as described in CPT Converters with the addition that several Latin characters, which are not in Cp1250 (like NJ \u01CA), are converted to their Cyrillic equivalent as well.
- ISO 9 Cyrl to Latn, ISO 9 Latn to Cyrl, Cyrl-BGN Cyrl to Latn, Cyrl-BGN Latn to Cyrl Cyrl-Names Cyrl to Latn, and Cyrl-Names Cyrl to Latn transliterations can be used only in this bar. Except 'ISO 9' the others are groups of transliterators for the languages: Belorussian (be), Bulgarian (bg), Macedonian (mk), Russian (ru), Serbian (sr), and Ukrainian (uk). The program uses the locale of the Source file (defined in 'Source File Encoding' dialog). If it is not in the list above, ru is used.
5.6 OpenUse this menu item to start the Open file dialog. Select the input file and it will be "opened" - shown in the text area (if View Source option is set), the file name will appear in the status bar, and the target operations will be enabled. The target operations will use the contents of the file, not the contents of the text area. After opening a file, do not change Source options (data type, encoding, etc.) - the result of the display and the target operations could be unpredictable. If you change some of the settings (except the font), the file should be reopened.
5.7 SearchThis menu item will start the Search in Source dialog box. The searching is only in the text displayed in the text area. Most of the settings defined for the Source are used in this dialog (font, locale, normalization, bidi). For example, if some Unicode Normalization has been applied to the Source, it will be applied to the search patterns (e.g. Arabic Composition).
There are two main modes of searching: strings and words.
Match String will enable the standard searching. In this mode you can use \uxxxx notation, no regular expressions, and the options:
- Ignore Case is implemented merely by converting the characters to lower case - no special casing.
- Compatible Decomposition means to apply 'Unicode Normalization Form KD' to the pattern and the source. If the source is 'Thai Composed', 'Thai Decomposition' will be applied as well.
- Ignore Non-spacing Marks is enabled if the previous option is set. It means to strip all non-spacing marks from the pattern and the source.
Match Word will switch to target word searching. This mode emulates extracting the words from the source. The word filters from Target Data | Word Properties (Target Words dialog) will be used while searching words. There are some restrictions. If Whole Line as a Word is checked, it will be ignored and Letters mode will be used instead. If one of the Locale ... is checked, the locale of the source will be used. In case of Locale Dictionary File not all automatic conversions are supported (you will see "... incompatible" message). To check your word filters settings just click sequentially on "Find Next" button (there is no need to enter search pattern). If you want to find words containing specific patterns you can enter search word pattern as well. In this case the \uxxxx notation and the regular expressions as described in Base Words | Search Words dialog can be used.
Not in Base Words will turn on the simple spell checking. It will be available if some Base words are opened. On clicking the "Find Next" button the next word from the Source will be searched in the Base words. If it is not found there, "Found at ..." message will appear. Yes, this sounds strange, but this way the spell checkers are working. You can not enter search pattern in this mode.
Using Down and Up radio buttons you can change the direction of the search. Via clicking with the mouse in the text area you can fix the current position (which is not shown when there is no selected text).
5.8 CloseWhen this menu item is selected the source file will be "closed" - the text area will be cleared and the target operations will be disabled.
|6. Base Words Menu|
This menu is used to select Base words for 'target operations with base words' like delete/add/compare or single 'base words operations' like search words, dump in text file, pack suffixes (CTree, 5-bit per character and 'long words' only). After opening some Base words the 'base word operations' will be enabled.
The item Open CTree Dictionary As... has three sub items:
- Tree (...) - use this for delete/add operations or high speed, the memory needed is usually very big, depending of the number words in the dictionary (the opening of CTree with clues could be very slow because of the additional conversions);
- Mixed (...) - use this for high speed for CTrees with clues (add/delete not permitted);
- Packed (...) - use this only for search/compare operations with low speed (not always) and very low memory needed.
Open Text File (target encoding, sort) is used to open text word list as base word list using the target settings - text files do not contain any information about the encoding, locale, etc. Note that Target Sort type will define the 'one-byte' or 'two-byte Unicode' mode.
Close menu item will free the memory used and will clear the additional status bar for the Base words opened.
Search Words will start the Search Base Words dialog explained below.
All Embedded Data is valid for the next three items below. When checked, all data of the dictionary will be dumped, not just the text file. If the Base dictionary is CTree with strict alphabet and tags, the alphabet and tags files will be dumped as well. If the Base is picture dictionary, all pictures will be dumped as well.
Dump in Text File will start the Save file dialog to select the file and the Base words will be written there as text file using the current CTree encoding.
Extract in Text File is similar, but it is valid for CTrees with tags/clues and if some tags selection is made via Global Tags dialog. The words having selected tags will be extracted into the file in text delimited format.
6.1 Pack SuffixesPack Suffixes is optional menu item for CTrees (5-bit per character, and 'long words' types) opened as Base words, which have no packed suffixes and opened in packed mode. This is slow operation that can reduce the nominal dictionary size in the range of 10% to 50%. After clicking the menu, you will be asked for the new file name and then the operation will start. You can get 'OK' or 'Failed' message. The second one means that there are no common suffixes found or the packing will not reduce the nominal CTree size.
'5-bit per character' CTree: this packing of suffixes is different from the packing, described in Target Action, Affixes tab, but the final format of the CTree is the same. When you have created CTree with packed suffixes and it is opened in mixed mode, the core word letters will be expanded as tree and the packed suffixes will remain packed. If opened in tree mode, all word letters will be expanded as tree and if add/delete operations are done, the dictionary will be saved without suffixes packed. If opened in packed mode, no change operations are allowed, and the file will keep the packed suffixes.
'long words' CTree: this packing of suffixes is just to reduce the file size, a new file is created only if its size is at least 10% smaller. When this CTree is opened, the suffixes are always unpacked.
In CPT Word Lists the Base words are opened and read in the RAM. If the CTree word list contains 1M or more words the virtual memory might be easily exhausted and this is the case when the mixed and packed modes should be used. The change (tree) mode is still available, but possibly with swapping - 1MB packed file is taking many MBs of RAM in tree mode. The text word list files are opened in 'Java Array' mode and the used memory should be a 'linear' function of the file size. The linear is in quotes because the different JVM have own opinion about the heap used and the garbage collection.
Here are some figures (with 160MHz CPU used when the first version of the program was developed):
2MB-text word list with 200K words takes 300KB as CTree (5-bit per character) file. Packing the suffixes will take 60 seconds and the CTree file will be about 270KB. When this file is opened in tree mode it will take 3MB of RAM, opened in mixed mode will take 1.5MB and opened in packed mode - 300KB.
12MB text word list with 1M words takes 900KB as CTree (5-bit per character) file. Packing the suffixes will take 5 minutes and the CTree file will be about 700KB. When this file is opened in tree mode it will take 13MB of RAM, opened in mixed mode will take 4MB and opened in packed mode - 1.5MB.
6.2 Search WordsTo start this dialog, select Base Words | Search Words from the menu, but you should have opened some Base words in advance.
In the text area of the dialog window, short information about the Base words will be shown. If the letter case of the word list is upper or lower, the searching will be done in ignore case mode, otherwise, the mode will be case sensitive. If the Base words are CTree with tags, the 'display codes' of the tags will be appended to the found word. In case of CTree with clues, any tags-clue couple will be shown on 2 separate indented lines.
The search pattern you can enter is restricted to subset of regular
expression syntax using the following special symbols:
|*||matches 0 or more characters|
|?||matches exactly one character|
||||starts alternative pattern to search|
|\char||is the 'escape' way to match the char, where char is any Unicode character|
|[set]||matches any one character from the set|
|[^set]||matches any one character not in the set, where set can have the following forms:|
|charString||includes the characters from the string|
|char1-char2||includes the characters from char1 to char2 in the ascending Unicode order|
Inside the set, the '[', '*', '?', '^', and '\' have no special meaning, ']' is not allowed and '-' is allowed only as a special character. This way, you can search for a string, having the character '?', using the pattern "*[?]*" ('?' is not interpreted as special). To search for '[' you have to use escaping ('\['). The Java hexadecimal notation (\uxxxx) for specifying char can be used as well. For example, "\u0031" means "1" or "DIGIT ONE".
Since the idea is to match words, not strings,
the meaning of '*' and '?' is the same as used in operating systems
for the file name expansion, but different from the classic regular expressions
used in egrep and friends. Here are two examples:
will match all words having 4 or more letters, starting with 'ase' but not including 'e' and 'x' as a forth character;
will match all 5 letters words, starting with 'ba' and ending with 'a', and will match all 6 letters words, starting with 'ga'.
The syntax of the search pattern is checked and you could receive error message even for well-formed expression due to restrictions in the implementation (e.g. protected CTrees).
In case you want to stop a search process producing big list, just click on the Stop button, which appears after the search starts. If you want to save the search result in a file, after entering the search pattern, instead of pressing <ENTER>, click on the left button, choose the file name and the result will be written there (if the CTree is not protected). If the dictionary contains pictures, they will be written as separate files as well.
When in the text area there are pictures shown, you can use the right mouse button to click on a picture with two possible popup menu items:
- Copy Image to copy it to the clipboard (available for Java 1.4 and above).
- Show Image - only when the picture is thumbnail. The original picture will be shown in a separate window where you have additional popup menu item: Save File.
|7. Target Data Menu|
The items from this menu are supporting all optional settings and operations for the target files. The essential items are Data Type, Choose Action and Run. You should have opened some input file to be able to run the target processing.
7.1 Data TypeWhen this menu item is selected, you will see a dialog box with radio buttons for the output file data types. They are subset of the input data types (except Alphabet File and Picture Dictionary) and you can refer the description in the Source Data Type.
Picture Dictionary is to create separate picture dictionary (not CTree with pictures).
Alphabet File is used to create an alphabet file based on the source words. You may need to edit this file because the current Java locale collator is used for the sorting. For details see the appendix .
The important note is that not all possible combinations of input data type and output data type are valid. The details are in Data Types and Operations below. If you have selected invalid combination or operation, you will see an error message.
7.2 File Encoding, LocaleThe dialog items are the same as in the corresponding Source dialog but are valid for the output file.
The difference is that the locale setting is used in more cases. If the word filters (Target Words dialog) include Locale Letters/Alphabet File set or CTree is defined to use Strict Alphabet, the alphabet subdirectory later will be scanned for alphabet file having name <Locale>. And similarly, for Locale Dictionary File - <Locale>.dic file will be needed.
UTF-8 and CTree
If the clues contain relatively small quantity of real Unicode characters (not basic Latin, or not in Cp1251), you can select UTF-8 encoding in order to reduce the file size. The dictionary will appear externally as Unicode, but the clues will be encoded in UTF-8. If the target locale is Cyrillic (be, bg, mk, ru, sr, uk) the clues will be encoded in a custom UTF-8C encoding, which is Cp1251 plus UTF-8.
7.3 Word PropertiesUse this menu item to start the Target Words dialog for defining the word filters or what should be considered as a 'word'.
Generally, the word is sequence of 'word-characters'. In the input text the words are bounded by 'non-word-characters'. As a minimum, 'non-word-characters' include line breaks. The idea of the filters is to define what to include in 'word-characters' class and the rest of Unicode characters are added to the other class.
From Filter tab you have to select the main filter.
Whole Line as a Word means that actually no filters will be applied (just the word length) and the 'word' will be the input text line or field from record in case of input Text Delimited/Word-Clue.
If Allow One Space is set, the program will cut the input line before the second space - this usually means that the first two words will be taken. Allow Two Spaces is similar - first three words. Strip Spaces will remove the spaces before the first letter and after the last letter of the word. Stop Characters - you can enter one or more characters in this field, the word will be cut at the first position where one of these characters is met.
Letters will switch on various filters - 'word-characters' class will include all Unicode characters having letter category (Lu, Ll, Lt, Lm, and Lo). Note that the check boxes in Letters tab could expand or shrink this class.
Locale Alphabet File if set, will define the 'word-characters' class to include only the characters from the alphabet locale file. For the format of the alphabet files see the appendix .
Locale Dictionary File if set, will define the 'word-characters' class to include only the characters from the dictionary file, which should be in 'alphabet' subdirectory and should have name <Locale>.dic. In this case, the dictionary and a recursive program are used to break the input character sequences into words. The current version contains only one dictionary: "th.dic" - for the Thai language, but you can create CTree word list for any locale and use it. For the options, see Dictionary tab below. This flag is valid when the Source is Plain Text or HTML.
In Length tab you can use Minimal/Maximal Word Length text fields to enter the limits of the word length counted in characters. The range is from 1 to 1000.
In Letters tab there are many check boxes to precise
the 'word-characters' class.
Since there are a lot of characters in Unicode, which are letters, you could use Locale Letters as excluding filter. When it is set, the program will reject all words having letters not in the alphabet defined by the current locale. If the alphabet file is not found, the program will use internal tables (see the appendix ) or will ignore this flag.
Include Marks will include all Unicode characters having mark category (Mn, Mc, Me) in 'word-characters'.
Include Apostrophe/Hyphen/Dot will include the corresponding ASCII character in 'word-characters' class.
Include Numbers will include all Unicode characters having number category (Nd, Nl, No) in 'word-characters'.
No All-Uppercase if set, will exclude words having all letters in upper case.
The Dictionary tab offers the following options:
- Packed Format - to open the CTree in packed format.
- Mark New Words - the words not in the dictionary will be marked with "***" (do not set this when creating CTree - the stars will be written).
- New Words Action - Locale Default - if you don't want to go into details, select this option, otherwise, read below.
- New Words Action combo box allows to select how to
form the new words when the program is not able to break completely
the input sequence. The locale characters are divided into
two classes: non-starters (ns) - punctuation and marks, and
starters (st) - letters, which can start a word.
The unbroken part can form new word separately or in combination
with the previous (<) or the next word (>).
The 'st' below denotes sequence having starter character as first one.
'ns' denotes sequence of non-starters only.
Here are the options:
- A. ns + st - single word;
- B. ns | st - two words;
- C. ns | st> - nonstarters in separate word and the following starters combined with the next word;
- D. <ns | st - nonstarters combined with the previous word and the following starters in separate word;
- E. <ns | st> - nonstarters combined with the previous word and the following starters combined with the next word;
- F. <ns + st - all combined with the previous word;
- G. ns + st> - all combined with the next word;
- H. <ns + st> - all combined with the previous and the next word;
- C,E Maximal Starters is defining the maximal number of starters that can be used for options C and E. If the number of starters is less than the maximum, the algorithm will use the preceding options (B and D).
7.4 CTree OptionsIf you have selected CTree Dictionary as target data type or CTree as sorting engine, use this menu item to start the CTree Options dialog and to define some additional and specific to CTrees settings. Note that CTrees have many parameters, which are set via all Target Data menu items, not only via this dialog.
When the options are set during dictionary file creation, they can not
be changed latter. If you need to change something, create new dictionary
using the old one as Source. As a rule, when recreating CTree,
the word filters should be switched off.
Some of the optional data and tables can be
seen only when Show CTree Header flag is set in Source | Display
Font Encoding dialog.
When Strict Alphabet is set, the ordered list of locale characters will be included in the dictionary and will be used as word filter and as sorting order. If the number of characters used is less then 33, internally, 5-bits per character will be used. This is the case for most alphabets when only upper or lower character case is used. Some of the specific CTree operations are implemented just for this 5 bits per character format.
The radio button From File means that the alphabet
characters should be taken from locale alphabet file,
which by default is expected to be in the alphabet directory.
If you want to use other file, click on the 'Set Alphabet File' button.
If you need to revert to the default, click on the button and in
Open file dialog select 'Cancel'.
Built in means that the program will use internal tables (3 alphabets in this version: Bulgarian-Cp1251, English-ASCII, and Russian-Cp1251).
The check box Lower Case < Upper Case means that the lower case letter should be before the upper case letter in the sort order. This is valid when both upper and lower case characters are used and the sorting type is Locale. Note that there is no similar option for Array and JTree - the standard Java collators use fixed tables.
Letter Frequency Counters means to maintain table containing the frequency of any unique letter in the words of the dictionary.
Crossword Words will switch on the 'crossword transliteration'. All non-letter symbols will be ignored. For English locale the accents will be removed and some non-English letters will be replaced. Additionally you have to select Change Letter Case and Lower in 'Target Action' dialog, 'Options' tab.
Word Length Counters means to maintain table containing number words counter for any word length in the words of the dictionary.
The check boxes denote to add additional data to any word
in the dictionary. When tags/clues are added, the dictionary is referred
as 'CTree with tags/clues' in this document.
The tags should be defined in CPT Tags file
(see the appendix ),
which by default is expected to be in the alphabet directory.
If you want to use other file, click on the 'Set Tags File' button.
The 'Set IPA8 File' button is used to switch on the 'IPA8' feature and to select the file. 'IPA8' means that you have clues having 'ipa8' tags and encoded in custom 8-bit encoding. For details see the appendix and the 'samples' directory.
Morphology Tags means to add 'morpho' tags.
User Tags means to add 'user' tags.
Topic Tags means to add 'topic' tags.
Definitions/Clues means to add 'clue' tags and definitions.
Strict Alphabet restrictions/conversions are never applied to the text
of the definitions. If you have selected UTF-8 as target file encoding, the dictionary will
be maintained externally as Unicode, but inside the file the definitions will
be in UTF-8. If the locale is Cyrillic one, the definitions will be written
in the custom UTF-8C encoding (based on Cp1251 plus UTF-8).
Has Pictures should be checked when you want to include pictures. Actually, the program will create additional picture dictionary file having the same name as the CTree but with extension 'pdc'.
Inverted Index means to create index of clue's usage in the main word list (i.e. for any clue a list of words, which use/refer this clue). The idea is to be able to browse the file in both directions. This inverted index is used in the other CPT programs like CPT Dictionary. For bigger dictionaries it is expensive operation - 150K words with clues will take 4 minutes and 40% more of the file size. Note also, that CPT Dictionary can create own inverted index in a separate file. If the dictionary is intended to be stored in read only media or if you want to save the time of all users, you could include the inverted index via this option.
When you add 'user', 'topic', or 'clue' tags, the Source data type should be Text Delimited (text dictionary format) or equivalent CTree.
You can use the text fields in this tab to enter optional user data, which will be written in the dictionary file. The text is not restricted by the alphabet or by the encoding of the dictionary (it is saved in Unicode).
Pack Suffixes is valid for 'long words' and '5-bit per character' CTree types (without clues). The program will try to reduce the CTree size via packing the common suffixes. If the operation fails, the original CTree will be saved (see Pack Suffixes above).
Long Words or Phrases will force internal structures, which are more efficient for packing long words or phrases.
Packed Format means to compress the data.
Locked means 'no more changes', it will disable the dumping, the editing, and the opening as Source. The copy to the clipboard will be restricted to single line for all CPT programs.
Protect with Password (and if password entered) will force the same protection scheme as above, but the password will be required to open the file. If you have forgotten the password, the file is lost.
7.5 PicturesPicture Options... and Test with a Picture... are intended for working with pictures. Here you can set and test various options when creating Picture Dictionary or CTree with pictures.
Generally, the following image formats are supported: JP2, JPC, PGM, PGX, PNM, PPM, JPG, GIF, BMP. The program will check if Sun "JAI Image I/O Tools" or "Java Advanced Imaging API" are installed and will use them. So, depending on the JVM, you could work also with: ICO, CUR, WMF, EMF, TIF, PNG, WBMP, PCX, FPX.
As thumbnails, only JPG, JPC, and GIF are supported.
To include a picture in CTree you have to use a clue tag starting with 'pic'
and the picture file name as a clue:
To include a picture in Picture Dictionary just use a line:
Probably this is the place to remind that for Picture Dictionary the sorting engine is JTree, and for a key, only one picture is maintained.
7.5.1 Picture Options
This is the dialog where you can set how the pictures will be included. The settings are global for all pictures processed for a dictionary file.
Include Unsupported means that the program will not reject any input picture.
Convert MS pictures to BMP will be shown only with MS JVM. Before any other processing the ICO, CUR, WMF, EMF will be converted to BMP.
Convert Pictures to when checked, will force the conversion of the input pictures to one of the JPEG 2000 formats JP2 or JPC. JPC gives a little bit smaller file size, because there are no custom color tables included.
Rate % defines the compression as ratio in percents between the picture size in
memory and the output file size. 100% is lossless.
The important difference between JPEG and JPEG 2000 is that for good quality of JPG files
the value of 75% is OK, while for JP2 files the same quality can be achieved with a value of 20%,
i.e. much better compression.
You can enter -1 to leave the program to choose the percents for you.
Gray is to convert images to gray scale.
Resize - check it to change the size of the picture according to the values
entered in Width and Height in pixels.
Area defines that the input height and width will be changed proportionally to meet the area defined by the given values.
Down is to scale down. The picture will be resized only if the input height and/or width are bigger than the given values.
Up is to scale up. The picture will be resized only if the input height and/or width are smaller than the given values.
Fixed - the output width and height will be exactly as the given values, this is the only one non-proportional resize.
Speed over Quality means algorithm which is faster.
Add Thumbnails - check it, if you want every picture to be accompanied by a thumbnail in the dictionary.
Include - the program will look for a thumbnail file
in the same place as the picture file, but for a name having 't' appended.
If the picture is pic100.bmp, the thumbnail is expected to be pic100t.jpg,
or pic100t.jpc, or pic100t.gif.
Make - the program will create the thumbnail.
JPG and JPC define the image format. With MS JVM you will see GIF instead of JPG.
7.5.2 Test with a Picture
The idea of this menu item is to check all settings in the previous dialog. You can use it to convert individual files as well.
First you have to select the picture file.
The program will process it and the following windows will be shown:
- message dialog showing the lengths in bytes of the input file (not the size in memory), the output file and the thumbnail;
- window with the converted picture, via right mouse click you can save the file;
- window with the thumbnail, via right mouse click you can save the file.
7.5.3 Create Picture Dictionary Source File
When Create Picture Dictionary Source File from Directory... is selected you will see 'Open Directory' dialog and then 'Save' dialog.
The program will take the names of all recognized picture files from selected directory, will create source file for picture dictionary, and will save it under the name you have selected on the second dialog. For example, if you have Picture0005.jpg, Picture0028.jpg in the directory, the saved file will contain:
There are two dialogs for the selections of tags for the operations like filtering, extraction or adding tags. The first (Global Tags) has access to all possible tags groups, while the second can operate on single group only. The dialogs are connected to different operations and support different selection schemes.
7.6.1 Global Tags
Use this dialog to select subsets of tags from the available tags groups for the operations 'add tags' and 'extract base words using tags selection and word length filters'. The menu item Target Data | Global Tags has three sub-items:
- Change current... will start the dialog with the existing settings;
- Set new from file | <Locale>.tag will take the default file from alphabet directory and then will start the dialog;
- Set new from file | Other file... will ask for a CPT Tags file name and then will start the dialog;
- Set new from Base... will take the tags from the current open Base CTree and then will start the dialog.
To clear the global tags setting select Set new from file | Other file... and then 'Cancel' in Open dialog.
Use the combo box to select one of the groups. The tags from this group will be shown in the list box below. Any tag is presented by the code in angle brackets and with the display name. For details see the appendix .
For the extraction operation you can select any subset of tags from any group. This is not true for 'add tags' operation. If the group contains npt (number per tag) assignments, you can select only one of them. If there are bpt (bit per tag) assignments, you can select all or subset of them. If there are '/subgroups/', you can select tags from one subgroup only. If the selection is not consistent, later, when you start the operation, you will see an error message.
'Select All' button can be used to select all items
in the current list.
Ignore Unselected npt/bpt is used for the extract
operation when the list contains npt type and bpt type tags.
If you need to select tags, let's say, from bpt group only,
you can check the box
and this will mean "ignore the unselected from the other group".
If not checked the filter will skip the words having unselected
tag from npt group even if it has the selected bpt tags.
This tab is for additional filters for the extract operation. Only the words having the selected word length will be extracted.
Select From List will enable the list box and you can
select exactly the lengths you wish.
Just to remind that you can do multiple selection via the mouse
and Ctrl key pressed.
All means no length filters. The rest of radio buttons from <7 to >=33 can be used to select a range of lengths.
7.6.2 Tags Filters
If the target operation is Filter Tagged Word List, you can use this dialog to set how the tags will be used as filters. The menu item Target Data | Tags Filters has three sub-items: Tags File (morpho)... and Tags File (topic)... are tags from the 'morpho' or 'topic' group (see the appendix ); Affixes File... - tags from the 'suffixes' section (see the appendix ). When you select one of the menu sub-items, the program will take the current Source locale and will read the corresponding tags or affixes file from the alphabet directory. If the file is OK, you will see the dialog.
The tags from the file are sorted and shown in the list. Any tag is preceded by a mark (0,1,2), which defines the role of the tag. On the top of the list a special entry is included - use it to define what to do with the words, which have no tags.
You should use the keyboard to change the mark of the current selected item from the list. On any pressing the keys Space or Enter or on any double click with the mouse the mark will be changed sequentially to 0,1,2. The keys Insert and 1 will set the mark to 1, the keys Delete and 2 will set the mark to 2, and key 0 will set it to 0. If Ctrl key is pressed as well, all marks in the list will be set.
Mark 0 means that this tag will be ignored (will not be considered as a filter).
Mark 1 means that if the word has this tag, it can be included in the target.
Mark 2 means that if the word has this tag, it will not be included in the target.
Since any word can have many tags, the value of the tag's mark is used
as a weight (precedence) of the tag. If the word has at least one tag marked
as 2 it will be excluded. If the word has tags marked as 1 but no tags
marked with 2, it will be included. If the word has tags marked with 0
only, it will not be included. If the word has no tags, the filter will
be the mark of the special element on the top of the list.
The tags filters used will be from the last started dialog.
7.7 Choose ActionThis is the most important and the most complex dialog. Here you have to define the target operation, the byte mode, and all operation details. The dialog is called Target Action and to start it select Target Data | Choose Action from the menu. Note that there are no consistency/dependency checks in this dialog. The checks are done when you start the target operation and the not relevant flags are ignored.
7.7.1 General TabHere you have to select the main operation group: Source-Target (Create New Target radio button) or Source-Base-Target (all other radio buttons):
For Source-Base-Target group the input files are the Source and the Base words opened. The output files are the modified Base words (same file name) and an optional Log File, which is text word list. If Save As dialog appears, it is for the Log File (with one exception for the munched lists).
Via the radio buttons Add Source to Base Words, Delete Source from Base Words, and Compare Source to Base Words select the desired operation. The check box Replace Mode is valid when adding data to CTree with tags/clues (the word and all its data are deleted first, and then the addition(s) will be done, note that if you want to change only one data element of a word, the other data elements for this word should be supplied in the Source as well).
Via the combo box Add/Delete/Compare Log File define the contents of the Log File. The options are:
- None - no Log File;
- Difference: Source not in Base - list of Source words not found in the Base words;
- Common: Source in Base - list of Source words, which are found in the Base words;
- Source Marked - all Source words with mark '-' when the word is not found in Base words and mark '=' - when found.
- Source Marked with Tags - all Source words found in the Base words will be tagged. The Base words should be CTree having tags.
No Source Word Sorting means that the operation of reducing all Source words to list of unique words will be omitted. This way the working memory needed will be reduced but the number of add/compare/delete operations will be increased. If you have to process huge Source file, set this flag and switch off the Log File. If you perform tagging operation and you want to see the original words sequence from the Source file, set this flag as well.
7.7.2 Options Tab
Change Letter Case and Upper/Lower are used to transform the characters to all upper or all lower case. This operation is common for word lists and is very essential for CTrees - it will reduce the number of used characters twice. Note that you should use this operation for CTree creation if you want to enable the single case mode even when all source words are already converted to lower/upper case. If you set Special Casing check box, the changing of the letter case will be according to the addendum of Unicode - SpecialCasing.txt. The special casing maps one lower case character to 2 or 3 upper case characters. These special lower case characters include the German es-zed, ligatures, many Greek characters and several accented Latin letters. For Lithuanian there is special case handling as well. The special Turkish mappings are handled by the standard casing and there is no need to set the flag for this locale.
Via Sort Word List Using you have to select the byte mode, type of sorting and the sort engine.
Byte will select one byte mode and binary sorting. Unicode will select two bytes mode and binary Unicode sorting. Locale will select two bytes mode and locale sorting. In this mode the Arrays and JTrees engines will use the standard Java collators to obtain the natural sorting. Since the locale sorting of CTrees is not exactly 'natural' when you have upper and lower case characters, you should use 'Strict Alphabet' to be able later in CPT Dictionary to obtain the proper natural order of the display list.
Via Array, JTree, and CTree select the corresponding sorting engine.
Reverse Word Letters can be used for sorting 'backwards' - the words are compared starting from the last character (e.g. Hebrew script stored in visual order). Actually, the words are handled internally by the engine in reversed form, but when they have to be displayed or exported to text file, they are reversed again.
7.7.3 Target TabThese options are for selecting specific operation from the Source-Target group.
Encode Data should be set when you want explicitly to change the encoding of the Source file or for operations that do not include sorting like normalization or the 'dehtml' operation.
Extract Word List is used when the target data type is text word list or CTree dictionary and we want to extract the unique words from the Source.
Frequency Counters is used when we want to count the unique words or letters from the Source. The target data type should be Text Delimited. For word frequency select Words, and for letter frequency select Letters. If you want a list sorted in descending order of the frequency counters, set Sort by Counters.
Sort by Word Length is valid only when the target data type is 'Text Word List' and the sorting engine in not CTree. If checked, the word length will be used as primary key for the sorting.
Add Tags can be set when the input file has type Word List or Text Delimited/Word-Clue (text dictionary) or CTree. This operation will add tags to all words from the Source according to the selection in Global Tags Setting dialog. If the source already contains tags, the new tags are added via simple bitwise-or operation without any consistency checks.
7.7.4 Affixes TabThe options here are valid for operations with affixes.
Expand Munched Word List should be set when the input file has type munched word list. This operation will "expand" the words from the Source according to the suffixes defined in the locale affixes file. The supported Target types are Text Word List (expanding only), or Tagged Word List (expanding plus tagging).
If Output Words with Affixes Only is set, the words from the input file without affixes will be ignored (used when Expand Munched Word List is set).
There is no separate operation flag for creating munched lists. When the Source is Affixes file, the Base is CTree 5-bit format, the operation will be creating munched list and the supported Target types are:
- Munched Word List - the munched list will be written into the target file, in most cases you should use this target type to be able to check the munched list.
- CTree - the munched list will be written as CTree with suffixes packed. Note that the Base Words | Pack Suffixes operation will make the same CTree format but the suffixes to be packed will be selected by the program, not from the Affixes file.
- Text Word List - the munched word list will be expanded into normal word list.
The encoding of the target file will be the same as the CTree opened as Base words (the target encoding will be ignored).
Loose Affixes Match will switch on the "loose match" algorithm when creating munched list. The idea is to assign a suffix to a root word even when not all word forms from the suffix definition are available in the CTree. If this flag is not set, the program will work in "strict match" mode - the suffix will be assigned when all word forms from the suffix definition are found in the tree below the root word.
Exclude Words Mode is addendum to "loose match" algorithm. The program will remove some words with the hope of better matching results.
7.9 RunWhen you have opened some input file, this menu item will be enabled. In most cases it will start the Save file dialog to select the output file and then the target operation will be started. Note that if the operation includes Base words, the output file is the Log File and the Base words file will be overwritten if changed.
The program will check the consistency of the selected options and this is the moment when most of the warnings and errors will be shown. If the options are accepted, the processing will start and new window with messages will appear. This new window will not be closed by the program and you have to close it after checking the messages. For most of the operations you can stop the processing via clicking on the Stop button on the Messages window. If you close the Messages window before the job has done, a notification will be shown when the operation is finished, but you will not see the possible errors or warnings.
|8. Data Types and Operations|
This is short summary of the implemented operations depending of the Source, Base, and Target data types. The optional Unicode normalization, change encoding, change letter case, change logical/visual order (bidi conversion) can be used for most of the operations. If some of these optional conversions are not supported for particular operation, they are explicitly noted.
In 'Projects' directory there are files with template settings for some of the operations.
8.1 Source-TargetCreate New Target flag should be set in General tab, Target Action dialog.
8.1.1 Operations without sortingThe words are not extracted from the input file (the only exception is the operation "Parts of Phrases".) Do not set the flag Sort Word List Using. You can set Encode or Change Letter Case or Add Tags.
Source: Plain Text, Target: Plain Text, Alphabet File.
Source: HTML, Target: Plain Text, Alphabet File - strip HTML tags (no special flag).
Source: CPT Tags, Target: Plain Text - dump codes of tags only (no special flag), no optional conversions.
Source: Text Word List, Target: Alphabet File, Text Word List or Text Delimited. The Target Text Delimited could mean two different operations. If Add Tags is set it is exactly adding tags operation. If it is not set and the Extract Word List is set, it is the special operation "Parts of Phrases". For any word (of length >= Minimal Word Length) in the phrase the program will create a record with that word and as a clue the phrase where the word is replaced by "...".
Source: Tagged Word List, Target: Tagged Word List - translate tags (no special flag), no bidi conversion. The default tags are 'morpho'. If you want to use 'topic' tags for the translation, remove 'map-morpho' and include only 'map-topic' in the Tags file.
Source: Text Delimited/Word-Clue Format (text dictionary), Target: Text Delimited - adding tags (Add Tags should be set), translate tags (no special flag, not for Word-Clue Format), convert Word-Clue Format to Text Delimited (no special flag).
8.1.2 Operations with sortingSort Word List Using and sort options flags should be set. The words will be extracted from the input file and they will be sorted.
Source: Text Word List, Target: Text Word List or CTree Dictionary (with or without Add Tags) or Text Delimited (frequency counters).
Source: Plain Text, Target (Extract Words set): Text Word List or CTree Dictionary or Text Delimited (frequency counters).
Source: CTree Dictionary, Target: Text Word List or CTree Dictionary or Text Delimited (letter frequency counters), the CTree can has tags or clues as well.
Source: CTree Dictionary with tags or clues, Target: CTree Dictionary with tags or clues. Add Tags is supported. This is the first case when the tags/clues from the Source CTree are used. The second case is Base with tags/clues, described below. In the other cases it is treated as Text Word List.
Source: Text Delimited/Word-Clue Format (not text dictionary), Target: Text Word List or CTree Dictionary or Text Delimited (frequency counters).
Source: Text Delimited/Word-Clue Format (text dictionary), Target: CTree Dictionary with tags/clues (with or without Add Tags). This is the standard mode for creating dictionaries. If you want to add pictures only Text Delimited could be used.
Source: Word-Clue Format - '<word><delimiter><clue>NL' variant, Target: Picture Dictionary. This is the only variant to create separate picture dictionary.
Source: HTML, Target: Text Word List or CTree Dictionary or Text Delimited (frequency counters).
Source: Tagged Word List (creating CTree with 'morpho' tags), Target: CTree Dictionary with tags.
Source: Tagged Word List (tags filters set), Target: Text Word List or CTree Dictionary or Text Delimited (frequency counters).
8.2 BaseThese are the operations on Base words, described in Base Words Menu. Note that if the 'All Embedded Data' is checked, additional files will be created.
Base words: CTree, output: Text Word List (dump), CTree (pack suffixes).
Base words: CTree with tags/clues, output: Tagged Word List (dump) or Text Delimited (extract by filters or by Source word list).
Base words: CTree (with or w/o tags/clues) or Text Word List - search and extract words (using the dialog), output: Text Word List or Tagged Word List (using 'display codes' of tags).
8.3.1 Base word listsOne of the Base words operations (add, delete, compare) flag should be set. The Source can be: Text Word List, Text Delimited/Word-Clue Format, CTree, HTML, and Plain Text. The words are extracted from the input file and optionally sorted. The Base is Text Word List, Picture Dictionary, or CTree (with or without tags, Add Tags is not supported for all cases). The Target is Log File - set type Text Word List in Target Data | Data Type dialog (even if you mean 'morpho' tagging operation).
8.3.2 Base with tags/cluesSource is Tagged Word List, Text Delimited/Word-Clue Format (text dictionary), CTree with tags/clues. Add Tags is not supported for all cases. Base is CTree with tags/clues, The Target is Log File - see the setting above, but there is one exception: if you mean tagging operation including user/topic/clue tags, set the Target type to Text Delimited. The Log File can contain additional notes when the tags differ. The operation adding words with tags only to the Base CTree is as follows:
- if the word does not exist, the input word and its tags are added;
- if the word exists and has no tags, the tags from the input word are added;
- if the word exists and has tags, the tags from the input are ignored.
The delete operation when the Source has tags/clues is not available. If you need to delete words from CTree with tags/clues, use as Source Text Word List. In this case any word from the Source list and its linked clues (if not used elsewhere) will be deleted. If you need to do changes in the word's data, use 'add with replace'.
8.3.3 Munched listsSource is Affixes file, Base is CTree 5-bit format, Target: Munched Word List, CTree, Text Word List (no special flag, see Affixes Tab above), no optional operations.
|Appendix A: File Formats|
A.1 Alphabet FilesThese files define locale alphabets and the corresponding sort order. They are used when the Locale Alphabet File or Locale Letters flags are set in word filters (Target Words dialog) or when using CTrees, the flags Strict Alphabet and From File are set (in CTree Options dialog, Alphabet tab). In the last case the alphabet file is included into the CTree file.
The locale alphabets usually reside in 'alphabet' directory. The names of the files are the same as ISO-639 language codes, optionally followed by "_" + ISO-3166 country code. For example, "el" and "el_GR" are locale names for Greek. The names could be four-letter script codes from ISO 15924:2004 as well. Note that you can override the defaults by setting explicitly the file in CTree Options dialog.
The files could be written in any of the supported encoding but the name of the converter should be given in the first line. Any line should contain upper case letter, optionally followed by lower case letter. When the lower case letter is not given, the program will add it using the Unicode tables. The order of letters in the file will define the sorting order, used by CTrees. The position of the upper case before the lower case implies the default rule that upper case letters are less than (before) the lower case. To change this rule, set the flag labeled 'Lower Case < Upper Case' in CTree Options dialog, Alphabet tab.
Generally, you can include almost any of the Unicode characters in this file (line breaks can not be part of the alphabet). Be careful, do not put unnecessary spaces, because the space is valid entry as well. Keep in mind that CTrees may not handle properly graphemes presented by more than one character (surrogates and decomposed characters) and when the number of characters used is more than 32, some of the special CTrees operations (5-bit compression and affixes) will not be available.
There are many alphabet files in the directory just waiting for you to be checked and changed. For any of the alphabets we have included short notes in the Readme.txt file.
CJK Locale Letters
Since it is difficult to prepare CJK locale alphabet files, here are the tables, used by the program:
Radicals: from \u2E80 to \u2FD5;
Symbols, Numerals: from \u3005 to \u303B;
Hiragana, Katakana: from \u3041 to \u31FF;
Ideographs: from \u3400 to \u9FC3;
CJK Compatibility: from \uF900 to \uFAD6;
Halfwidth Katakana: from \uFF66 to \uFF9F.
The ranges are the same but without Hiragana and Katakana.
Jamo: from \u1100 to \u11F9;
Compatibility Jamo: from \u3131 to \u327D;
Hangul Syllables: from \uAC00 to \uD7A3;
Halfwidth Hangul: from \uFFA0 to \uFFDC.
We do not pretend that these selections are the best choice, so, any suggestions and comments are welcome.
There are internal tables for other languages as well but we recommend you to use alphabet files to be sure that you will get what you want.
A.2 Affixes FilesThese files define word affixes with optional tags. The affixes files are used for the following operations: create munched list, create tagged list, filter tagged list, pack CTrees via user defined suffixes (see the restrictions). The files are locale specific and are expected to be in the 'alphabet' directory. The name of any file is the same as the locale plus extension "aff". For example, "gr.aff" is the name for Greek locale. The encoding used should be given in the first line. Any statement should take only one line. The data is structured in sections. On the top are the optional set variable definitions:
set c STLKPMNVC
set v AEIOU
The set variables are just one character and when used in the text, should be prefixed with dollar sign.
The suffixes section should start with the statement:
The section can contain one or more suffix definitions. The format of any suffix definition is like this:
where "<" is the start and ">" is the end mark, "sufdef" is any string of maximum 7 characters giving the suffix definition name. This name will appear in the munched list. "+TAG1+TAG2" is a list of global tags valid for all suffixes in the definition. The tags should have the plus sign as first (and only once used) character. They can be from the 'morpho' group in the CPT Tags file, but this is not mandatory. "ED" is a suffix from the definition, it will match any word ending with this string. "ING" is another suffix having private tag "+TAG3". "$vBLE" is a suffix with set variable - it will match all words ending with one of the letters from the set, followed by "BLE". The set variables are just syntax sugar (macros). You can use only one variable in the definition body. This variable will generate n separate suffix definitions, where n is the number letters in the set and the letter from the set will be appended to the name. In this case the names are: sufdefA, sufdefE, sufdefI, sufdefO, sufdefU. In the process of making munched list all suffixes from the definition should match in order to assign that definition. For example, the words from the input:
will give (in strict match mode):
in the munched list. And when the munched word is expanded (with tags included), the result will be this:
There is no ignore case mode and you may need to use Change Letter Case flag to match the input and the suffix definitions.
In the body of the definition you could use other suffix definitions (like inline subroutines) with the restrictions: only one per suffix line and the used definition is already specified, the definitions used are prefixed with right slash and are the last part of the line:
This definition will give "AS/asklike" in the munched list for the sample above.
You can hide some of the used definitions making them 'local definitions' or macros explicitly. The 'local definitions' are local (like set variables) for the affixes file and are not used in the munched list creation. The right slash should be the first letter in the name:
When referring 'local definitions', do not add additional slash.
The prefixes are not supported in this version and the prefixes section will be ignored.
The characters "<", ">", "/". "+", and "#" are reserved and should be used according to the context specified. Well, "#" is not specified yet - the comment part of the line should have it in the first position:
# end of Affixes.
The encoding used should be given in the first line. Any statement should take only one line. The '#' character is comment mark (till end of line). Below, italic font is used for non-terminals, "[...]" denotes optional item, and 'number_bits' is used to denote the number of bits available in the CTree for particular tag group.
One file can contain one or more tag groups. Any group has fixed number of bits in the CTree file. The maximum number of binary tag codes in a group is 2**number_bits. The tag group definition is as follows:
where group_id could be one of the following reserved names:
- morpho - morphology or grammar tags, 8 bits, usually used in Affixes file and tagged word lists;
- map-morpho - morpho tags translation (see the notes below);
- user - user tags, 2 (or 4) bits, usually used for crosswords;
- map-user - user tags translation;
- topic - topic tags (math, history, astronomy, etc.), 16 bits, can be used instead of morpho tags for tagged word lists operations;
- map-topic - topic tags translation;
- clue - definitions or clues tags, 4 bits, used in CTrees with clues.
- map-clue - clue tags translation;
macros is list of one or more macro definitions having the form:
</macro_name mtag_list >mtag_list is list of one or more tag code definitions having the form:
where number_per_tag_code is string containing one or more tags like Noun or Noun+PL. This code will take a number (one of all tag binary codes) from the group. The optional display_code is describing string like noun or noun,plural (if the group is map-..., it is not optional, but it is again number_per_tag_code, giving the translation of the tag). bit_per_tag_code is single tag taking one bit from the group. Note again that for any group there are 2**number_bits binary codes and only number_bits bits. macro_name is defined macro, which will be expanded.
tag_list is list of one or more tag code definitions having the form:
where subtag_group is sub-group definition, having the form:
The maximum number of sub-groups could be 2**(number_bits/2). For example, morpho group has 8 bits and this number is 16. Inside the sub-group there are maximum 2**(number_bits/2) binary codes, number_per_tag_codes and bit_per_tag_codes are assigned locally for the sub-group. As shown in the syntax definition, the sub-groups can not be nested.
The reserved characters <>^+/# should not be used in identifiers.
The semantics restrictions are as follows:
- 0 binary code in all groups is reserved for 'no tags';
- if bit_per_tag_codes are used in group or sub-group, they can be preceded by number_per_tag_codes and can not be followed by anything else;
- if subtag_groups are used, they can be preceded by number_per_tag_codes and can not be followed by anything else.
There is a restriction on the user group: in most cases it is 4 bits, but if the CTree is in 2-bytes per character format and all tags groups are included, the user group is not fully supported (the 2 most significant bits are ignored and it is actually 2 bits).
Notes on translating tags:
If you need to translate the tags in tagged word list, the group map-morpho should be defined. If it is not defined or if you just change the group_id of the existing morpho to map-morpho, the display-codes will appear in the output tagged list. Instead of map-morpho you can use map-topic, but the first one should be removed from the file.
For the translation of tags in text delimited file, all map-... groups should be defined.
If there were errors in the translation, the error lines in the output file will be prefixed by "***Err ".
Here are some samples with clue group (4 bits) and explanations:
Cp1252 # The tags file can contain more than one valid # 'clue' groups, but the last one will be used. # Below, 'matches' means the combination of # tags in the input tagged text that can be matched # by the tag group. <clue # all codes are 'number_per_tag' 0 # unused t1 tag1 # code binary 0001 t2 tag2 # 0010 ... t15 tag15 # 1111 > # the matches above are: +t1 or +t2 or ... +t15 # (only one tag code can be used per word) <clue # all codes are 'bit_per_tag' ^t1 tag1 # xxx1 (x means don't care value) ^t2 tag2 # xx1x ^t3 tag3 # x1xx ^t4 tag4 # 1xxx > # the matches above are: # +t1 or +t1+t2 or +t2 or ... +t1+t2+t3+t4 # (all tag combinations can be used per word) # note that 0 code is not included, because # it is taken from the first bit_per_tag <clue # mixed (number_per_tag have different semantics) 0 # unused t1 tag1 # xx01 t2 tag2 # xx10 t3 tag3 # xx11 ^t4 tag4 # x1xx ^t5 tag5 # 1xxx > # the matches above are: +t1 or +t2 or +t3 or +t1+t4 # or +t1+t5 or ... +t3+t4+t5 <clue # number_per_tag and subtags 0 # unused t1 tag1 # 0001 t2 tag2 # 0010 t3 tag3 # 0011 <st1 subtag1 # 01xx (subtag mask) st11 subtag11 # 0100 st12 subtag12 # 0101 st13 subtag13 # 0110 st14 subtag14 # 0111 > <st2 subtag2 # 10xx (subtag mask) st21 subtag21 # 10x0 st22 subtag22 # 10x1 ^st23 subtag23 # 101x > <st3 subtag3 # 11xx (subtag mask) ^st31 subtag31 # 11x1 ^st32 subtag32 # 111x > > # the matches above are: +t1 or +t2 or +t3 or +st1+st11 # or ... +st2+st21+st23 or ... +st3+st32 <clue # with macros </mac0 0 t1 t2 t3 > </mac t_1 t_2 t_3 t_4 > <st0 /mac0 > <st1 /mac > <st2 /mac > <st3 /mac > > # end of sample tags file
A.4 Tagged Word ListThe format of the file is a word with optional tags per line:
word1 +tag1+tag4 word2 word3+tag3 ... wordN+tag7+tag1+tag4
A.5 Text DelimitedThe format of the file is 'record' per line. The fields in the record are separated by special character - delimiter (usually '|', comma, etc.). As a delimiter you should select character, which is not part of the text of the fields (this is not always an easy task).
field11 | field12 | field13 field21 ... fieldN1|fieldN2
The text dictionary format should have the following strict field's order (even for RTL texts stored in visual order):
word | morpho-tags | user-tags | topic-tags | clue-tags clue
The unused fields on the end of the line can be omitted as in the previous example. For creation of CTree with clues you can have more then one line per word (all these lines should start with the same word).
A.6 Word-Clue FormatThis is a format with several variants intended to supply words and clues in more free style. The tags are not supported here but you can use the Add tags operation. You may need to define several clues per word because one clue is limited to 1024 characters.
The variants are presented via simplified syntax, where NL means New Line character (or Carriage Return and New Line), 2NL means one or more empty lines, TAB means Horizontal Tabulation character, <word> and <clue> are strings without NL. The leading and ending spaces are stripped. The NL in a multiline clue is replaced with space. The TAB and multiple spaces in a string are replaced with one space.
This is single line per word-clue. The delimiter is defined by the user. Here is a sample with ',' as delimiter:
word1, clue1 word2 word3, clue31 word3, clue32 ... wordN, clueN
<word>NL, <clue1>NL, <clue2>NL, ...2NL
The word-clue is presented by several lines. Any line defines a separate clue. Here is a sample with commented word-clue (the flag Comments (--) should be set):
word1 clue1 --start of comment word2 clue21 clue22 --end of comment ... wordN clueN1 clueN2 clueN3
<word>NL, <clue1>NL..., TAB<clue2>NL..., ...2NL
In this variant one clue can be presented by several lines. To start a new clue, use TAB (actually, the first clue can also start with TAB, but it is not required and will be ignored). Here is a sample:
word1 clue11 clue11 clue12 clue12 ... wordN clueN1
%h <word>NL, %dNL, <clue1>NL, <clue2>NL, ...2NL
This is the popular "%h%d" text dictionary format used in Linux, but for a clue you can use only one line.
%h word1 %d clue11 clue12 ... %h wordN %d clueN1 clueN2
%h <word>NL, %dNL, <clue1>NL..., TAB<clue2>NL..., ...2NL
This again is the "%h%d" format but with several lines per clue and a TAB to start a new clue:
%h word1 %d clue11 clue11 clue12 clue12 clue12 ... %h wordN %d clueN1
DSL is source text format for ABBY Lingvo dictionaries. All control statements are ignored and only the plain data is processed (i.e. the encoding should be set as usual in Source File Encoding dialog.)
word1 clue11 clue12 word2 word3 clue21 clue22 wordN clueN1Note that word2 and word3 have the same clues in this format.
A.7 IPA8 FilesWhen you have clues, which are pronunciations, you can use the 'IPA8' feature to reduce the CTree size or to enter more easily the IPA codes. This feature is not supported if the CTree is in Unicode (where you are free to use all codes from the IPA block). When the CTree has clues encoded in custom 'IPA8', the text is converted into Unicode before the display in the text area of the searching modules ('Search in Base Words' dialog and the CPT Dictionary program). All other modules are ignoring the 'IPA8' feature and the encoded text is processed as any other clue text.
The 'IPA8' file has almost the same format as the Alphabet file, but defines
a custom 8-bit IPA encoding and it itself is encoded in UTF-8 or UnicodeASCII.
Any line should contain 2 characters.
The first character is the 8-bit code and the second character is
the corresponding Unicode character (usually, from the IPA block).
Since the IPA is using the base Latin characters as well,
you are not required to define these characters, but you
can overwrite them. For example, if 'k' is not redefined,
it will not be recoded, or you can define 'I' to be
'Latin letter small capital I' (\u026a).
Note that if you redefine characters outside the 7-bit ASCII range,
they have to be included with the corresponding Unicode code,
for example, if you want the Cyrillic small letter sha (\u0448)
to be the IPA letter esh (\u0283), the file should include the line:
- the clue should have only 'number_per_tag_code' starting with "ipa8";
- the 'IPA8' encoded text in the clue should be enclosed in square brackets ('[' and ']'), and the codes of the brackets should not be redefined;
- the IPA8 file should not redefine characters, which are not part of the source file encoding (this is global CPT requirement - these characters will be shown as '?').
For a complete template set of files see the 'samples' directory.
A.8 CTree FilesHere are internal details for CTree files, which probably will not be interesting for most of the users, but can help to understand the numerous options of the dictionary creation.
A.8.1 Packing VariantsCombining the variants described here you can reach very high compression ratio - several times better than the general-purpose compression utilities can give. Note that our dictionaries are browsable even in packed format (see the Notes above). Let's start with statistics of sample word lists:
The 'text' row shows the source file size (all file sizes are in bytes). The 'sq' row shows the file size produced by the Unix sq utility (part of ispell). The '-' mark in sq rows means that the source file is in Unicode and no data is available. The 'CTree' row shows the file size produced by CPT Word Lists, using 'Strict Alphabet' of number letters shown. The '+' mark means 'the same file zipped'. The row 'ratio' is 'text size/CTree size'. The column '5000k' denotes artificial word list, generated using English alphabet. It is included to show how big the compression ratio could be (the total ratio of text/CTree+ = 44638.86 is remarkable).
A.8.1.1 LettersRegardless of the encoding, if 'Strict Alphabet' is used, the letters are packed using the fixed alphabet as lookup table. If the number different letters is less than 33, any character is coded in 5 bits, if less than 63 - in 6 bits, if less than 256 - in 1 byte, if less than 8191 - in 13 bits, and in all other cases - in 2 bytes. If the alphabet is not given, one byte or two bytes are used depending of the encoding. 5-bit, 6-bit and 13-bit per character formats are supported for pure word lists (CTrees without tags and clues).
Note that the clues are always encoded in single byte or in Unicode. If you have clues using IPA codes (supported by Unicode only), there is 'IPA8' feature to keep the single byte encoding of the clues in the CTree file.
A.8.1.2 WordsThe sorted words are packed using the simple idea that the repeated initial parts of the words have only one copy in the file. This is the standard packing for all CTree files. If the dictionary contains many long words or it is list of phrases, set Long Words or Phrases flag in CTree Options dialog, Data tab. This way you can eliminate the worst cases of words packing but the 5,6,13-bit per character formats will not be available.
A.8.1.3 SuffixesThe common suffixes are removed from the word list and stored separately. This is implemented only for '5-bit per character' and 'long words' variants. If the 'words' are phrases or the encoding used is Unicode, the size reduction can reach 50%. For the most word lists the reduction is 10% or less. When the program reports that the suffix packing is OK, but the real file size is not much smaller, this could be because of the additional packing described below.
A.8.1.4 FilesOptionally, the content of the file is additionally packed using internal general purpose compressing routine - see Packed Format in CTree Options.
A.8.2 Contents VariantsThe dictionary can include word list, optional header data, optional data fields and optional clues/definitions. If it has pictures, they are kept in additional picture dictionary file, and a clue is a key for the picture file.
A.8.2.1 HeaderThis part includes title data, word length counters and letter frequency counters. If alphabet, tags or packed suffixes are used, they are included here as well.
A.8.2.2 Word ListThis is the minimal content of the file. The words are stored in several formats, depending on the letter packing.
A.8.2.3 Words and Data FieldsAny word can have attached data field. It could be counter, or tags bits. The size of the field is 1,2,3, or 4 bytes, depending of the data contents. These dictionaries are referred as 'CTree with tags'.
A.8.2.4 Full DictionaryAny word has attached list of data couples. The data couple contains a data field and a clue field. The clue field contains clue tags bits and pointer to the clue. The clues/definitions itself are stored in second word list without 'alphabet packing' (the restrictions of the strict alphabet are not applied, but the clue length is limited to 1024 characters). The main word list and the clues are saved in single dictionary file. These dictionaries are referred as 'CTree with clues'. If you have selected 'inverted index creation', the index will be included in the file as well. The program CPT Dictionary can also create inverted index, but it is stored in a separate file.
|Appendix B: Language Support|
B.1 Letter Case
The Armenian and modern European languages using Cyrillic, Greek and Latin scripts have upper and lower case letters in the alphabets. All other alphabets are considered to have caseless letters and the conversion instructions are ignored. The standard case conversion is supported via built in tables based on BMP of Unicode 5.1.0 plus Turkish dotless 'i'. The title case letters are converted to lower/upper using double lookup in the tables. The special casing, which maps one lower case character to 2 or 3 upper case characters, is supported as well (see Options Tab).
B.2 Writing Direction
The default standard support is for the left-to-right direction. The vertical direction is not supported. Here, the special support for Hebrew/Yiddish and all Arabic languages using right-to-left direction will be described.
We have to start with two terms: 'logical order' and 'visual order'. The logical order is the order, in which characters and words are read and written. The visual order is the order, in which the characters appear on display or on paper. The RTL texts can be stored in the memory (RAM or file) in visual or in logical order. The Unicode standard defines the logical order as the default. On the other side, some Hebrew web pages use the visual order as 'de-facto standard'.
When the RTL text is stored in visual order, it is processed as LTR text. In this case you can set the check box Right Alignment in Display Options and the text will appear naturally for the script. If you want to create CTree in visual order, for the proper sorting, you have to set the check box Reverse Word Letters in Target Tab.
When the RTL text is stored in logical order, it is supported by the special bidi processing, specified in Unicode Technical Report #9. To switch on the bidi, you have to set the check box RTL in logical order (see File Encoding, Locale) and/or choose the options described in Custom RTL Conversion.
The main goal of the bidi processing in this program is to supply transparent conversion to/from visual/logical order to the other text processing modules. Our text area display control actually is not bidi enabled. If the RTL text is in logical order, it is automatically converted and stored there in visual order. This way it appears naturally, but the multi-line selection and the searching direction are internally LTR. This implies that in some cases you should check Ignore Non-spacing marks for the searching. On the other side, you can select smoothly exactly what you see on the screen without the bidi jumping. Our text field display control has experimental bidi support without jumping selection but when you edit/enter text using '\uxxxx' notation, the cursor position might not be handled properly. The entering of some forms of regular expressions might also be difficult because of the bidi processing. Some of the problems could be solved via changing the alignment: the right alignment implies default RTL line direction and the left alignment - LTR. You can use Ctrl+B keys to switch off/on the bidi display (but not the bidi processing of the final text). The communication of both display controls with the clipboard is always in logical order and in Unicode encoding.
When the program has to process Text Delimited, it will force the user delimiter character to be treated as paragraph separator by the bidi processing. The other 'deviations' from the standard bidi conversion are: the mirroring (step L4) is done before the reordering (step L2) and the reordering of combining marks (step L3) is not performed (otherwise, the process will not be reversible.) We have to note also that the 'reverse' bidi (visual to logical order) is not standardized.
B.3 Character Shapes
The special shaping support is for the Arabic script according to the rules specified in Unicode. The shaping of displayed text (without ligatures) is controlled via the check box Shaping, described in Display Options. The shaping of processed text is controlled via 'Arabic Composition' and Shape Letters... described in Unicode Normalization. It includes ligaturization as well. The shapes of the Syriac script are not defined in Unicode and are not supported yet. There are some Arabic shapes, which are still missing in Unicode as well.
The program has a simple algorithm for the handling of decomposed characters having non-spacing marks. It is on when Shaping is checked. You should use it only when the font is not handling properly the combining marks.
B.4 Composed Characters
All standard Unicode compositions/decompositions are supported as described in Unicode Normalization. Here we will give more information about the non-standard ones.
The 'Arabic Composition' processing uses tables
stored in "ar_lig.dic", which is CTree with tags and 'clues'
(you can dump, edit and recreate it). The standard
Unicode composition is not able to handle Arabic
ligatures and their shapes and this was the reason to create
special modules and a dictionary. The 'words' in the dictionary
are the decomposition codes of the ligatures and the 'clues'
are the codes of the ligature shapes as specified in Unicode.
The 'clue tags' define
the shape type: 'ini' for initial, 'iso' for isolated,
'med' for medial and 'fin' for final.
The 'morpho tags' define the desired subset of the ligatures.
You can find out which ligature glyphs are supported
by particular font just opening the dictionary in the
source text area (the ttf "Arial Unicode MS" contains most of used glyphs).
For example, to see the composition in work, set Source locale to
'fa' (Persian, Farsi), Unicode Normalization to 'Arabic Composition',
'Shaping' from Display Options dialog should be off
and then open 'ar_lig.dic' in the Source text area.
All lines having tag 'iso' should have the same glyphs on the left
side and on the right side (except if there is tag 'nfa' - not Farsi).
And if you set the locale to 'ar', the lines having tag
'nsa' (not standard Arabic) will not be composed, but all others,
having 'iso' tag, should be composed properly. To see the source
characters from the file, switch off the composition and reopen.
The decomposition is handled properly via the standard Unicode compatible decomposition.
You can create text word list or CTree in Arabic composed form plus letter shaping using Unicode as target encoding and this way it will not need additional processing any time when it is browsed. To create such CTree with 'Strict Alphabet' you have to use the whole "ar_ALL" from the alphabet directory, or if you don't bother for the file size and the sorting, just set no 'Strict Alphabet'.
The 'Thai Composition/Decomposition' processing uses tables
stored in "th_lig.dic", which is CTree with tags and 'clues'.
The 'words' are 4187 sequences of Thai characters aimed
to cover most of syllables. The 'clues' are codes from the
Unicode private area. The 'morpho tags' are the Unicode character classes
as 'Lo', 'Nd', 'Po', 'Mn', and one additional: 'Lns' which stands
for 'letter non-starter' (see Dictionary tab in
The 'clue tags' define the subset: 'co' for all composed or 'Syllables',
and 'sc' for 'Single cells'.
The goal of introducing this custom Thai composition is to make easier handling the native Thai dictionary sorting and the breaking of spaceless sentences into words. You can process Thai texts without composition, but the sorting of leading vowels will not be handled. You can also replace the locale dictionary "th.dic" with another, not in composed form, but the breaking of words will be hundred times slower. The changing of "th_lig.dic" is not recommended because you will loose all composed data (including "th.dic") and you will need the program used to create the codes. If you really need to change it, you have to decompose everything with the old CTree before using the new one.
To create Thai composed CTree without many efforts, specify as target encoding Unicode and no 'Strict Alphabet'. If Locale Alphabet is used together with Thai composed text, you have to create new alphabet file with the codes from "th_lig.dic". To browse Thai composed CTree without explicit decomposition, you have to set the check box Shaping, described in Display Options.
There is one more special composition - 'Yiddish Composition'.
It is obtained via the standard compatible composition with
the check box Plus Excluded set. To create word lists or CTree
in Yiddish composed form use Unicode as target encoding and
the supplied "yi" alphabet.
The decomposition is handled properly via the standard Unicode compatible decomposition.
B.5 Huge Alphabet
The CJK (Chinese, Japanese, Korean) scripts use
thousands of characters. Since there is no restriction
of the number of letters, these languages are not considered
"complex" and there is no special additional support.
If you use codes from the Unicode Surrogate Block, they will be treated as unrecognized characters without any support.
top of page | cpt word lists how to