NAME
wordtree - perform set operations on lists
SYNOPSIS
wordtree [-v] [-r] [-c] -u -d <file-to-process>
wordtree [-v] [-r] [-x ext] [-i] -d <wordlist> ... -f
<file-to-process> ...
wordtree [-v] [-r] [-x ext] [-i] -s -d <wordlist> ... -f
<file-to-process> ...
wordtree [-v] [-r] [-x ext] -t [number] -d <tail-file> ...
-f <file-to-process> ...
DESCRIPTION
wordtree can perform set operations on lists of words. For
example one can print a list of words which can be found in
textfile but not in databasefile. Since wordtree uses a
tree structure to hold the chars of words, checking if a
word is in a given language (or in a set) is reasonably
fast. The files must contain one word on a line without
leading or trailing spaces or punctuation marks. It makes no
difference if the words in these files are sorted or not.
At first all words from database files are loaded into the
tree (option -d) . Then checking is performed on all files
given by the -f option. All checking is done with lowercase
letters. For example "heLLo" and "HEllo" are treated as the
same word. The language specific chars [
] are also supported.
wordtree can read its default options from a personal
options file ~/.wordtreerc wordtree understands also the
long form arguments. Type wordtree --help to get a full
listing.
OPTIONS
-d filename
Use filename as input list to build the database. All
following arguments which are not a flag are treated as
additional files to read. If they contain words not
already in the database tree, these words are added.
-f filename
The file filename is processed. All following arguments
which are not a flag are treated as additional files to
process. The resulting output is placed in the file
filename.extension. The extension defaults to out and
can be changed with the -x option.
-u Print a sorted uniq list of all words in database.
Since all words are stored in a tree, sorting is simply
done by recursive traversing the tree and printing all
words in it. No real sort operation is needed because
the words are already inserted in the right order. In
contrast to all other operations which place their
result in a file the result is printed to stdout.
-c Count the occurences for each word. This option is
only valid in conjunction with the -u flag. Sometime
its useful to pipe the output thru the command sort -r.
-t Tail mode on. Print those words which have a postfix
listed in database. If followed by a number a check
matches only if the length of a found postfix is not
shorter than this number.
-s Split mode on. Search words from textfile which are
formed of two words from the database.
-i Invert mode on. Print a list of words that are in both
the database and the text file. Normal operation is to
print those words that are NOT already in the database.
This flag does nothing if used together with -t. If
used in conjunktion with -s, wordtree print a list of
words that are not formed of two words from the data-
base.
-r Toggle revert mode flag. If revert mode is on all
words from the database files are inserted in reverse
printing. This is useful to find pairs of words in
order to form a palindromes by concatenating them.
-x extension
Append <dot>extension to the filename of processed
files to form the name of the results file. The default
extension is out.
ENVIRONMENT
The HOME variable is used to determine the file to read
default options from. If the file .wordtreerc exists also
in the current directory, this one is used instead.
BUGS
Flags cannot be concatenated. For example -vcud will not
work.
wordtree is very hungry about memory. Due to the design (a
tree with degree 47) every char from the word lists need
about 6 to 10 bytes in memory. Several optimizations have
been done already. The efficiency is better with larger
wordlists, because with a tree there is some kind of
compression to the prefixes of the words. Nevertheless if
the hole tree fits into your memory, the time for accepting
a word is only affected by the length of the word, regard-
less how much words are in the database. With other words
the cost function is O(l)+O(n)*m where l is the length of
the database, n is equal to the length of the word being
checked and m is the number of checks.
SEE ALSO
tricount(1), sort(1), uniq(1)
AUTHOR
Thomas Pantzer, pantec@informatik.uni-leipzig.de, 1999
wordtree is inspired by a turbo pascal program written by
Thomas Pantzer in 1993