/* * regex - Regular expression pattern matching and replacement * * By: Chas Leis * Bowman Data * * Added escape and posix-like character classes defined by * * Character classes are denoted using the syntax "[:classname:]" * within a set declaration, for example "[[:space:]]" is the set * of all whitespace characters. Character classes are only available * if the flag regbase::char_classes is set. The available character * classes are: * * alnum Any alpha numeric character. * alpha Any alphabetical character a-z and A-Z. * Other characters may also be included depending upon the locale. * blank Any blank character, either a space or a tab. * cntrl Any control character. * digit Any digit 0-9. * graph Any graphical character. * lower Any lower case character a-z. Other characters may also be included depending upon the locale. * print Any printable character. * punct Any punctuation character. * space Any whitespace character. * upper Any upper case character A-Z. Other characters may also be included depending upon the locale. * xdigit Any hexadecimal digit character, 0-9, a-f and A-F. * word Any word character - all alphanumeric characters plus the underscore. * unicode Any character whose code is greater than 255, this applies to the wide character traits classes only. * * There are some shortcuts that can be used in place of the character * classes, provided the flag regbase::escape_in_lists is set then you can use: * \w in place of [:word:] * \s in place of [:space:] * \d in place of [:digit:] * \l in place of [:lower:] * \u in place of [:upper:] * \x in place of [:xdigit:] personal extension * * Added the Optiona operator '?' ( seems to work ok. ) * * Made changes to the "user interface" routines. * Removed CharacterIterator class. Use straight char pointers * * ------------------------------------------------------------- * * * Previously by: Ozan S. Yigit (oz) * Dept. of Computer Science * York University * * Original code available from http://www.cs.yorku.ca/~oz/ * Translation to C++ by Neil Hodgson neilh@scintilla.org * Removed all use of register. * Converted to modern function prototypes. * Put all global/static variables into an object so this code can be * used from multiple threads etc. * * These routines are the PUBLIC DOMAIN equivalents of regex * routines as found in 4.nBSD UN*X, with minor extensions. * * These routines are derived from various implementations found * in software tools books, and Conroy's grep. They are NOT derived * from licensed/restricted software. * For more interesting/academic/complicated implementations, * see Henry Spencer's regexp routines, or GNU Emacs pattern * matching module. * * Modification history removed. * * Interfaces: * RegExp::Compile: compile a regular expression into a NFA. * * char *RegExp::Compile(s) * char *s; * * RegExp::Search: execute the NFA to match a pattern. * * int RegExp::Search(s) * char *s; * * RegExp::ModifyWord change RegExp::Search's understanding of what a "word" * looks like (for \< and \>) by adding into the * hidden word-syntax table. * * void RegExp::ModifyWord(s) * char *s; * * RegExp::Substitute: substitute the matched portions in a new string. * * int RegExp::Substitute(src, dst) * char *src; * char *dst; * * re_fail: failure routine for RegExp::Search. * * void re_fail(msg, op) * char *msg; * char op; * * Regular Expressions: * * [1] char matches itself, unless it is a special * character (metachar): . \ [ ] * + ^ $ * * [2] . matches any character. * * [3] \ matches the character following it, except * when followed by a left or right round bracket, * a digit 1 to 9 or a left or right angle bracket. * (see [7], [8] and [9]) * It is used as an escape character for all * other meta-characters, and itself. When used * in a set ([4]), it is treated as an ordinary * character. * * [4] [set] matches one of the characters in the set. * If the first character in the set is "^", * it matches a character NOT in the set, i.e. * complements the set. A shorthand S-E is * used to specify a set of characters S upto * E, inclusive. The special characters "]" and * "-" have no special meaning if they appear * as the first chars in the set. * examples: match: * * [a-z] any lowercase alpha * * [^]-] any char except ] and - * * [^A-Z] any char except uppercase * alpha * * [a-zA-Z] any alpha * * [5] * any regular expression form [1] to [4], followed by * closure char (*) matches zero or more matches of * that form. * * [6] + same as [5], except it matches one or more. * * [6a] ? same as [5], except it matches zero or one. * * [7] a regular expression in the form [1] to [10], enclosed * as \(form\) matches what form matches. The enclosure * creates a set of tags, used for [8] and for * pattern substution. The tagged forms are numbered * starting from 1. * * [8] a \ followed by a digit 1 to 9 matches whatever a * previously tagged regular expression ([7]) matched. * * [9] \< a regular expression starting with a \< construct * \> and/or ending with a \> construct, restricts the * pattern matching to the beginning of a word, and/or * the end of a word. A word is defined to be a character * string beginning and/or ending with the characters * A-Z a-z 0-9 and _. It must also be preceded and/or * followed by any character outside those mentioned. * * [10] a composite regular expression xy where x and y * are in the form [1] to [10] matches the longest * match of x followed by a match for y. * * [11] ^ a regular expression starting with a ^ character * $ and/or ending with a $ character, restricts the * pattern matching to the beginning of the line, * or the end of line. [anchors] Elsewhere in the * pattern, ^ and $ are treated as ordinary characters. * * * Acknowledgements: * * HCR's Hugh Redelmeier has been most helpful in various * stages of development. He convinced me to include BOW * and EOW constructs, originally invented by Rob Pike at * the University of Toronto. * * References: * Software tools Kernighan & Plauger * Software tools in Pascal Kernighan & Plauger * Grep [rsx-11 C dist] David Conroy * ed - text editor Un*x Programmer's Manual * Advanced editing on Un*x B. W. Kernighan * RegExp routines Henry Spencer * * Notes: * * This implementation uses a bit-set representation for character * classes for speed and compactness. Each character is represented * by one bit in a 128-bit block. Thus, CCL always takes a * constant 16 bytes in the internal nfa, and RegExp::Search does a single * bit comparison to locate the character in the set. * * Examples: * * pattern: foo*.* * compile: CHR f CHR o CLO CHR o END CLO ANY END END * matches: fo foo fooo foobar fobar foxx ... * * pattern: fo[ob]a[rz] * compile: CHR f CHR o CCL bitset CHR a CCL bitset END * matches: fobar fooar fobaz fooaz * * pattern: foo\\+ * compile: CHR f CHR o CHR o CHR \ CLO CHR \ END END * matches: foo\ foo\\ foo\\\ ... * * pattern: \(foo\)[1-3]\1 (same as foo[1-3]foo) * compile: BOT 1 CHR f CHR o CHR o EOT 1 CCL bitset REF 1 END * matches: foo1foo foo2foo foo3foo * * pattern: \(fo.*\)-\1 * compile: BOT 1 CHR f CHR o CLO ANY END EOT 1 CHR - REF 1 END * matches: foo-foo fo-fo fob-fob foobar-foobar ... * * */