This document explains the purpose of search expressions and describes its syntax.
On HTML or tag-typed documents (style sheets, XML documents, etc.), the most common type of text search involves the definition of a start text segment and an end text segment. Typically, one is looking for a text segment within a specific tag (ex. the search for script segments) or for a text segment that contains the definition of a parameter (ex. SRC="embedded_picture"). Simple text search cannot accomplish such a task. Regular expressions is a powerful search language for any type of search task, but building a regular expression requires programming skills. A simpler version of search expression technology has been designed especially for tag-typed documents.
Search expressions are constructed to search for text segments that start by a well-defined word chain, and end by another word chain. The search expressions are most efficient in searching for parameter definitions, and for tagged texts within a tag-type document. They are not appropriate for other types of documents, nor for searching within the natural language sections of any document.
Tag-type documents contain typically:
Tagged text, such as <TAG1 bla bla bla </TAG1>
Parameter definition such as: PARAM1 = "value2"
Tags and parameters are case insensitive in most documents, <TAG and <tag are itentical tags; src = "value" and SRC="value" both represent the same SRC parameter definition.
To find any text contained within a specific tag, one can define an expression with a start search word (the start tag), an "any" wildcard character (AWC) followed by an end search word. For the above example and using * as the AWC, the search expression would be: <TAG1*</TAG1>.
To find a parameter definition is somewhat more challenging because an optional whitespace (i.e.: a space, a tab, a new line, etc) can be placed before and/or after the equal character. The value of a parameter is commonly placed between double quotes, but the use of simple quotes is also acceptable. If the value part does not contain a whitespace, we need no quotes. The following examples define exactly the same parameter and are valid in HTML:
There are circumstances where a whitespace is required, for example, between two parameter definitions as in:
<IMG width=60 SRC="image.gif" height=20>
The search expression is designed for such above-mentioned tasks by defining a whitespace wildcard character (WWC) to separate words (or single character words) that form a search word chain. Continuing the last example, and using the caret character as the WWC, the search expression would be: SCR^=^"http://*". The start search word chain contains the sequence of the following words: SRC, = and "http://. The end search word chain contains only a one character word: ". The WWC here is defined as a none-to-many WWC, meaning that no whitespace, or many whitespaces, can separate each search word. If at least one whitespace is required, for example before the SRC word and after the last " character, a one-to-many WWC should be created. In this case, the same WWC can be used; it is placed at the beginning and/or at the end of a search word chain, it means that at least one whitespace is required. To continue with the previous example, a better, more discriminatory search expression would be: ^SRC^=^"http://*"^. To search for parameter definitions that do not contain quotes in their expression, the related search expression would be: ^SRC^=^http://*^. If some parameters require at least one whitespace separating a search word chain, the same WWC could be used twice as in the following example: ^SRC^^=^^"http://*"^.
Using only two wildcard characters: an AWC and a WWC, search expressions can be created for searching for tagged texts, and parameter definitions, in most common search applications.
Once a text is found, some modifications and/or deletions can be applied to it. If many search expressions are used on the same text for the purpose of modifying or deleting the found text, the order of the execution of such operations is important. For example, using search expression1 and deleting the found text, then using search expression2 and deleting the found text, may have different results than if we execute expression2 before expression1. Therefore, the search expression should be executed by priority.
There are situations where some of the search words contain the same character as a wildcard character. The search expression provides means to redefine those wildcard characters.
If many search expressions were created, one would like to keep all of them and use a flag to disable some of them; the search expression syntax accommodates such a flag.
The search expression contains a start search word chain, an AWC and an end search word chain; there is no space in between. If more that one AWC is found, only the first and the last AWC is taken into account and, count as only one AWC, and any text in between is ignored. For example <TAG*hell*heaven*</TAG> and <TAG*</TAG> are equivalent. If the AWC is absent, the search will be made for the word chain only.
A search word chain contains words (i.e. sequence of characters without a space or WWC; a one character word is permissible) separated by WWCs. In other word, the WWCs are search word separators. By default, a WWC define the presence of none-to-many whitespaces that can be found between the preceding word and the next word of a search word chain. A WWC at the beginning and/or end of a word chain means that at least one whitespace is required in that position(s); i.e. a one-to-many whitespace wildcard. If there are two or more consecutive WWCs, they are interpreted collectively as a one-to-many whitespace wildcard.
For example, with the search word chain: ^One^two, the following text will be targeted (shown in bold):
but not
A search expression is preceded by a priority parameter, it may be followed by a modifier parameter, each one is separated by a space.
The priority parameter contains only one character and it defines the execution priority of the search expression. The ASCII code representing this character defines the priority level; the lower the value, the higher the priority. For example, 2 comes before 3, and C comes after 3.
The modifier parameter contains the 0 to 3 consecutive characters (no space in between). Here are the possibilities:
To enable a disabled expression, just delete the last character of the modifier parameter. Here is an example of a search expression with its two parameters:
3 SRC_=_http://+_ _++
This expression has priority 51 (ASCII of 3), its WWC is + and its AWC is _ and it is disabled.
As one can see, for searching tagged texts and/or parameter definitions in an HTML document, search expressions are perfectly adapted for such a task and easy to use.
Marcel St-Amant
BigFeet Software
bigfeet@videotron.ca