aboutsummaryrefslogtreecommitdiff
path: root/core/text/regex/optimizer/doc.odin
blob: 279cbafba9e5836c97ee53ceb3be89f145acaff0 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
/*
An optimizer for the regular expression AST.

Acts upon the AST of a parsed regular expression pattern, transforming it in-place without moving to a
compilation step.

Where possible, it aims to reduce branching as much as possible in the
expression by reducing usage of `|`.


Here is a summary of the optimizations that it will do:

* Class Simplification               : `[aab]` => `[ab]`
                                       `[aa]`  => `[a]`

* Class Reduction                    : `[a]`    => `a`
* Range Construction                 : `[abc]`  => `[a-c]`
* Rune Merging into Range            : `[aa-c]` => `[a-c]`

* Range Merging                      : `[a-cc-e]` => `[a-e]`
                                       `[a-cd-e]` => `[a-e]`
                                       `[a-cb-e]` => `[a-e]`

* Alternation to Optional            : `a|`  => `a?`
* Alternation to Optional Non-Greedy : `|a`  => `a??`
* Alternation Reduction              : `a|a` => `a`
* Alternation to Class               : `a|b` => `[ab]`
* Class Union                        : `[a0]|[b1]` => `[a0b1]`
                                       `[a-b]|c`   => `[a-bc]`
                                       `a|[b-c]`   => `[b-ca]`

* Wildcard Reduction                 : `a|.`    => `.`
                                       `.|a`    => `.`
                                       `[ab]|.` => `.`
                                       `.|[ab]` => `.`

* Common Suffix Elimination : `blueberry|strawberry` => `(?:blue|straw)berry`
* Common Prefix Elimination : `abi|abe` => `ab(?:i|e)`

* Composition: Consume All to Anchored End
	`.*$` =>     <special opcode>
	`.+$` => `.` <special opcode>


Possible future improvements:

- Change the AST of alternations to be a list instead of a tree, so that
  constructions such as `(ab|bb|cb)` can be considered in whole by the affix
  elimination optimizations.

- Introduce specialized opcodes for certain classes of repetition.

- Add Common Infix Elimination.

- Measure the precise finite minimum and maximum of a pattern, if available,
  and check against that on any strings before running the virtual machine.

*/
package regex_optimizer