How to write a fast catch-all RegExp

In How to write a safe catch-all RegExp I suggested to use (?:w|W)* for matching any character in a regular expression. It’s certainly true and safe, and the same stands for its siblings (?:s|S)* and (?:d|D)*.

If you want to match a large text, these expressions are not the best. I’ve prepared a simple test page where the GeSHi’s engine file, which is almost 120KB, is going to be matched by the regular expression you input.

In Firefox 2 the performance is quite good, about 100 ms on my PC, but in Internet Explorer 7 it takes more than 7 minutes !!!

The best catch-all regular expression is [wW]*, which employs about 50 ms in FF2, and 0 ms in IE7 !!! (yes, zero milliseconds)

Chili 2.0 Released Today

UPDATE: Chili 2.1 has been released

Changes

  • added support for a much better recipe format
  • optimized regular expressions for speed
  • moved project hosting to Google Code: jquery-chili-js
  • removed support for the previous recipe format
  • removed support for metaobjects
  • removed support for nearly obsolete features

Links

A new recipe format

Chili 2.0 supports a new recipe format which is a bit more structured than the old one and will make hard-to-highlight languages a thing of the past. BTW, old recipes won’t be accepted by Chili 2.0. (although a simple manual conversion is possible)

How to convert old recipes to the new format

First of all let’s see an example of how to convert a recipe in the old format to the new one.

What follows is a piece of recipe (for JavaScript) in the old format

{[ .js-old | .hilite( =javascript= ) ]}

And here is the same piece in the new format

{[ .js-new | .hilite( =javascript= ) ]}

As you see, they are very similar. Here are all the differences:

  1. property names beginning with an underscore are reserved words
  2. _name didn’t exist before
  3. _case replaces the old ignoreCase; _case is TRUE if the language is case sensitive, and FALSE otherwise; _case defaults to FALSE
  4. _main replaces the old steps
  5. _match replaces the old exp
  6. _style didn’t exist before

_style makes a big difference with respect to the past: CSS styles can now be embedded into the recipe. Separate stylesheets are no longer supported by the autoloading engine, but you can load them by yourself, if you prefer to keep them apart.

Another important difference about CSS is that now Chili builds the class associated to a step, by prefixing the _name to the step name, separated by __ (a double underscore). So, for example, the class for the multiline comment will be js__ml_comment.

More expressive power with the new recipe format

Besides the minor changes described in the previous section, the new format is much more powerful thanks to the major improvement I’m going to describe here.

The old recipe format supported an optional replacement property of a step, by means of which you could customize how the highlighting was applied to the matched text. Such a feature was useful when you captured subexpressions and wanted to highlight them separately.

Now _replace replaces replacement (sic) and it’s still optional.

As before, if you don’t specify a _replace property, Chili will default to <span class=”$0″>$$</span>, where $0 and $$ refer to the name of the current step and the matched text respectively.

As before, _replace can also be a different string expression, like in the following step, extracted from the MySQL recipe

{[ .replace-string | .hilite( =javascript= ) ]}

What to note in the above example:

  1. you can specify a style for each of the involved classes, by using properties of an object
  2. you must use a span for applying style to a text run
A big improvement

In Chili 2.0, _replace can also be a function, like in the following step, extracted from the new HTML recipe

{[ .replace-function | .hilite( =javascript= ) ]}

What to note in the above example:

  1. when _replace is a function, it receives match and submatches as arguments, and inside
  2. there is a valid this object which contains a magic x method, by which
  3. a string can be escaped for HTML (like open and close), or
  4. a string can be transformed by an expression (like content)

Chili 2.0 recipes are modular

A Chili 2.0 recipe contains blocks (like _main), which contain steps (like tag_start). An expression can be built for referencing each module, be it a recipe, a block, or a step. For example, /tag_attrs is the tag_attrs block in the current recipe.

One method to highlight them all

The JavaScript code inside a _replace function can use the x method of this.

x takes two arguments: a subject to process, and an optional module to use.

x returns the subject escaped for HTML if no module is given, or the module is not available, else it returns the result of applying the module to the subject using Chili 2.0.

If the ChiliBook option recipeLoading is true, any unavailable module will be automatically loaded.

The new HTML recipe

As an example, here is the new HTML recipe

{[ .html-new | .hilite( =javascript= ) ]}

What to note in the above example:

  1. a _replace function can be used for applying a recipe to a text run inside another recipe, like the script an style steps, where highlighting of script and style elements is delegated to js and css recipes respectively
  2. a _replace function can be used for isolating the parsing of a text run, like the tag_attrs step, where highlighting of name/value pairs happens only in the context of tag attributes
Module paths

A module path is an expression that identifies a Chili 2.0 module. A path has three components (though some can be hidden) separated by a / (forward slash), each with a specific meaning: recipe / block / step. (white space added for clarity)

Here is a list of all the combinations in a module path:

  • recipe
    a module path like css refers to the entire css recipe
  • recipe / block
    a module path like css/definition refers to the definition block of the css recipe
  • recipe / block / step
    a module path like css/definition/property refers to the property step of the definition block of the css recipe
  • / block
    a module path like /definition refers to the definition block of the current recipe
  • / block / step
    a module path like /definition/property refers to the property step of the definition block of the current recipe
  • / / step
    a module path like //property refers to the property step of the current block of the current recipe

As you see, leading slashes have a meaning.

Remember
  • a recipe module invocation tries to match all the steps of the _main block
  • a block module invocation tries to match all its steps
  • a step module invocation tries to match just itself

Help request

I think that Chili 2.0 is pretty good at highlighting, but it needs more fine recipes to succeed. For this release I’ve rewritten some from scratch, and converted some others. I’m not a good programmer in languages other than the ones for which I rewrote a recipe. But if you are and have time and will, then you could write a Chili 2.0 recipe for your favorite language, together with a couple of working samples, and send all to me. I’d be very happy to add your contributed recipes to the project as soon as they are available.

Rewritten
  • CSS
  • HTML
  • JavaScript
  • PHP
Converted
  • C++
  • C#
  • Delphi
  • Java
  • LotusScript
  • MySQL

Setup and Examples

Here is the start page for Chili 2.0 where you’ll find setup instructions and some examples.