NAME

     wordtree - perform set operations on lists


SYNOPSIS

     wordtree [-v] [-r] [-c] -u -d <file-to-process>

     wordtree [-v] [-r]  [-x  ext]  [-i]  -d  <wordlist>  ...  -f
     <file-to-process> ...

     wordtree [-v] [-r] [-x ext] [-i] -s  -d  <wordlist>  ...  -f
     <file-to-process> ...

     wordtree [-v] [-r] [-x ext] -t [number] -d  <tail-file>  ...
     -f <file-to-process> ...



DESCRIPTION

     wordtree can perform set operations on lists of  words.  For
     example  one can print a list of words which can be found in
     textfile but not in databasefile.   Since  wordtree  uses  a
     tree  structure  to  hold  the chars of words, checking if a
     word is in a given language (or  in  a  set)  is  reasonably
     fast.   The  files  must  contain one word on a line without
     leading or trailing spaces or punctuation marks. It makes no
     difference  if  the  words in these files are sorted or not.
     At first all words from database files are loaded  into  the
     tree  (option  -d) . Then checking is performed on all files
     given by the -f option.  All checking is done with lowercase
     letters.  For example "heLLo" and "HEllo" are treated as the
     same word. The language specific chars [
     ] are also supported.

     wordtree can  read  its  default  options  from  a  personal
     options  file  ~/.wordtreerc  wordtree  understands also the
     long form arguments. Type wordtree  --help  to  get  a  full
     listing.



OPTIONS

     -d filename
          Use filename as input list to build the  database.  All
          following arguments which are not a flag are treated as
          additional files to read. If  they  contain  words  not
          already in the database tree, these words are added.


     -f filename
          The file filename is processed. All following arguments
          which are not a flag are treated as additional files to
          process. The resulting output is  placed  in  the  file
          filename.extension.  The  extension defaults to out and
          can be changed with the -x option.

     -u    Print a sorted uniq list of  all  words  in  database.
          Since all words are stored in a tree, sorting is simply
          done by recursive traversing the tree and printing  all
          words  in  it. No real sort operation is needed because
          the words are already inserted in the right  order.  In
          contrast  to  all  other  operations  which place their
          result in a file the result is printed to stdout.


     -c    Count the occurences for each  word.  This  option  is
          only  valid  in  conjunction with the -u flag. Sometime
          its useful to pipe the output thru the command sort -r.


     -t    Tail mode on. Print those words which have  a  postfix
          listed  in  database.   If followed by a number a check
          matches only if the length of a found  postfix  is  not
          shorter than this number.


     -s    Split mode on. Search words from  textfile  which  are
          formed of two words from the database.


     -i    Invert mode on. Print a list of words that are in both
          the  database and the text file. Normal operation is to
          print those words that are NOT already in the database.
          This  flag  does  nothing  if used together with -t. If
          used in conjunktion with -s, wordtree print a  list  of
          words  that  are not formed of two words from the data-
          base.


     -r    Toggle revert mode flag. If  revert  mode  is  on  all
          words  from  the database files are inserted in reverse
          printing. This is useful to  find  pairs  of  words  in
          order to form a palindromes by concatenating them.



     -x extension
          Append <dot>extension  to  the  filename  of  processed
          files to form the name of the results file. The default
          extension is out.




ENVIRONMENT

     The HOME variable is used to  determine  the  file  to  read
     default  options  from.  If the file .wordtreerc exists also
     in the current directory, this one is used instead.


BUGS

     Flags cannot be concatenated. For example  -vcud   will  not
     work.

     wordtree is very hungry about memory. Due to the  design  (a
     tree  with  degree  47)  every char from the word lists need
     about 6 to 10 bytes in memory.  Several  optimizations  have
     been  done  already.  The  efficiency  is better with larger
     wordlists, because  with  a  tree  there  is  some  kind  of
     compression  to  the  prefixes of the words. Nevertheless if
     the hole tree fits into your memory, the time for  accepting
     a  word  is only affected by the length of the word, regard-
     less how much words are in the database.  With  other  words
     the  cost  function  is O(l)+O(n)*m where l is the length of
     the database, n is equal to the length  of  the  word  being
     checked and m is the number of checks.


SEE ALSO

     tricount(1), sort(1), uniq(1)



AUTHOR

     Thomas  Pantzer,  pantec@informatik.uni-leipzig.de,  1999
     wordtree  is  inspired  by a turbo pascal program written by
     Thomas Pantzer in 1993