/*
 * regex - Regular expression pattern matching  and replacement
 *
 *    By: Chas Leis
 *        Bowman Data
 *
 * Added escape and posix-like character classes defined by
 *
 *  Character classes are denoted using the syntax "[:classname:]"
 *  within a set declaration, for example "[[:space:]]" is the set
 *  of all whitespace characters. Character classes are only available
 *  if the flag regbase::char_classes is set. The available character
 *  classes are:
 *
 *  alnum   Any alpha numeric character.
 *  alpha   Any alphabetical character a-z and A-Z.
 *              Other characters may also be included depending upon the locale.
 *  blank   Any blank character, either a space or a tab.
 *  cntrl   Any control character.
 *  digit   Any digit 0-9.
 *  graph   Any graphical character.
 *  lower   Any lower case character a-z. Other characters may also be
                     included depending upon the locale.
 *  print   Any printable character.
 *  punct   Any punctuation character.
 *  space   Any whitespace character.
 *  upper   Any upper case character A-Z. Other characters may also be
                     included depending upon the locale.
 *  xdigit  Any hexadecimal digit character, 0-9, a-f and A-F.
 *  word    Any word character - all alphanumeric characters plus the underscore.
 *  unicode Any character whose code is greater than 255, this applies to
                     the wide character traits classes only.
 *
 *  There are some shortcuts that can be used in place of the character
 *  classes, provided the flag regbase::escape_in_lists is set then you can use:
 *  \w in place of [:word:]
 *  \s in place of [:space:]
 *  \d in place of [:digit:]
 *  \l in place of [:lower:]
 *  \u in place of [:upper:]
 *  \x in place of [:xdigit:]     personal extension
 *
 *   Added the Optiona operator '?'  ( seems to work ok. )
 *
 *   Made changes to the "user interface" routines.
 *   Removed CharacterIterator class.  Use straight char pointers
 *
 * -------------------------------------------------------------
 *
 *
 * Previously by:  Ozan S. Yigit (oz)
 *                 Dept. of Computer Science
 *                 York University
 *
 * Original code available from http://www.cs.yorku.ca/~oz/
 * Translation to C++ by Neil Hodgson neilh@scintilla.org
 * Removed all use of register.
 * Converted to modern function prototypes.
 * Put all global/static variables into an object so this code can be
 * used from multiple threads etc.
 *
 * These routines are the PUBLIC DOMAIN equivalents of regex
 * routines as found in 4.nBSD UN*X, with minor extensions.
 *
 * These routines are derived from various implementations found
 * in software tools books, and Conroy's grep. They are NOT derived
 * from licensed/restricted software.
 * For more interesting/academic/complicated implementations,
 * see Henry Spencer's regexp routines, or GNU Emacs pattern
 * matching module.
 *
 * Modification history removed.
 *
 * Interfaces:
 *      RegExp::Compile:        compile a regular expression into a NFA.
 *
 *          char *RegExp::Compile(s)
 *          char *s;
 *
 *      RegExp::Search:        execute the NFA to match a pattern.
 *
 *          int RegExp::Search(s)
 *          char *s;
 *
 *      RegExp::ModifyWord      change RegExp::Search's understanding of what a "word"
 *          looks like (for \< and \>) by adding into the
 *          hidden word-syntax table.
 *
 *          void RegExp::ModifyWord(s)
 *          char *s;
 *
 *      RegExp::Substitute: substitute the matched portions in a new string.
 *
 *          int RegExp::Substitute(src, dst)
 *          char *src;
 *          char *dst;
 *
 *      re_fail:    failure routine for RegExp::Search.
 *
 *          void re_fail(msg, op)
 *          char *msg;
 *          char op;
 *
 * Regular Expressions:
 *
 *      [1]     char    matches itself, unless it is a special
 *                      character (metachar): . \ [ ] * + ^ $
 *
 *      [2]     .       matches any character.
 *
 *      [3]     \       matches the character following it, except
 *                      when followed by a left or right round bracket,
 *                      a digit 1 to 9 or a left or right angle bracket.
 *                      (see [7], [8] and [9])
 *                      It is used as an escape character for all
 *                      other meta-characters, and itself. When used
 *                      in a set ([4]), it is treated as an ordinary
 *                      character.
 *
 *      [4]     [set]   matches one of the characters in the set.
 *                      If the first character in the set is "^",
 *                      it matches a character NOT in the set, i.e.
 *                      complements the set. A shorthand S-E is
 *                      used to specify a set of characters S upto
 *                      E, inclusive. The special characters "]" and
 *              "-"     have no special meaning if they appear
 *                      as the first chars in the set.
 *                      examples:        match:
 *
 *                              [a-z]    any lowercase alpha
 *
 *                              [^]-]    any char except ] and -
 *
 *                              [^A-Z]   any char except uppercase
 *                                       alpha
 *
 *                              [a-zA-Z] any alpha
 *
 *      [5]     *       any regular expression form [1] to [4], followed by
 *                      closure char (*) matches zero or more matches of
 *                      that form.
 *
 *      [6]     +       same as [5], except it matches one or more.
 *
 *      [6a]    ?       same as [5], except it matches zero or one.
 *
 *      [7]             a regular expression in the form [1] to [10], enclosed
 *                      as \(form\) matches what form matches. The enclosure
 *                      creates a set of tags, used for [8] and for
 *                      pattern substution. The tagged forms are numbered
 *                      starting from 1.
 *
 *      [8]             a \ followed by a digit 1 to 9 matches whatever a
 *                      previously tagged regular expression ([7]) matched.
 *
 *      [9] \<          a regular expression starting with a \< construct
 *          \>          and/or ending with a \> construct, restricts the
 *                      pattern matching to the beginning of a word, and/or
 *                      the end of a word. A word is defined to be a character
 *                      string beginning and/or ending with the characters
 *                      A-Z a-z 0-9 and _. It must also be preceded and/or
 *                      followed by any character outside those mentioned.
 *
 *      [10]            a composite regular expression xy where x and y
 *                      are in the form [1] to [10] matches the longest
 *                      match of x followed by a match for y.
 *
 *      [11]    ^       a regular expression starting with a ^ character
 *              $       and/or ending with a $ character, restricts the
 *                      pattern matching to the beginning of the line,
 *                      or the end of line. [anchors] Elsewhere in the
 *                      pattern, ^ and $ are treated as ordinary characters.
 *
 *
 * Acknowledgements:
 *
 *  HCR's Hugh Redelmeier has been most helpful in various
 *  stages of development. He convinced me to include BOW
 *  and EOW constructs, originally invented by Rob Pike at
 *  the University of Toronto.
 *
 * References:
 *      Software tools                  Kernighan & Plauger
 *      Software tools in Pascal        Kernighan & Plauger
 *      Grep [rsx-11 C dist]            David Conroy
 *      ed - text editor                Un*x Programmer's Manual
 *      Advanced editing on Un*x        B. W. Kernighan
 *      RegExp routines                 Henry Spencer
 *
 * Notes:
 *
 *  This implementation uses a bit-set representation for character
 *  classes for speed and compactness. Each character is represented
 *  by one bit in a 128-bit block. Thus, CCL always takes a
 *  constant 16 bytes in the internal nfa, and RegExp::Search does a single
 *  bit comparison to locate the character in the set.
 *
 * Examples:
 *
 *  pattern:    foo*.*
 *  compile:    CHR f CHR o CLO CHR o END CLO ANY END END
 *  matches:    fo foo fooo foobar fobar foxx ...
 *
 *  pattern:    fo[ob]a[rz]
 *  compile:    CHR f CHR o CCL bitset CHR a CCL bitset END
 *  matches:    fobar fooar fobaz fooaz
 *
 *  pattern:    foo\\+
 *  compile:    CHR f CHR o CHR o CHR \ CLO CHR \ END END
 *  matches:    foo\ foo\\ foo\\\  ...
 *
 *  pattern:    \(foo\)[1-3]\1  (same as foo[1-3]foo)
 *  compile:    BOT 1 CHR f CHR o CHR o EOT 1 CCL bitset REF 1 END
 *  matches:    foo1foo foo2foo foo3foo
 *
 *  pattern:    \(fo.*\)-\1
 *  compile:    BOT 1 CHR f CHR o CLO ANY END EOT 1 CHR - REF 1 END
 *  matches:    foo-foo fo-fo fob-fob foobar-foobar ...
 *
 *
 */