aboutsummaryrefslogtreecommitdiff
path: root/core/text/regex/doc.odin
blob: ba29659dc0057e73256cd08365f15609d4ec17b0 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
/*
A complete suite for using Regular Expressions to match and capture text.

Regular expressions are used to describe how a piece of text can match to
another, using a pattern language.

Odin's regex library implements the following features:

	Alternation:           `apple|cherry`
	Classes:               `[0-9_]`
	Classes, negated:      `[^0-9_]`
	Shorthands:            `\d\s\w`
	Shorthands, negated:   `\D\S\W`
	Wildcards:             `.`
	Repeat, optional:      `a*`
	Repeat, at least once: `a+`
	Repetition:            `a{1,2}`
	Optional:              `a?`
	Group, capture:        `([0-9])`
	Group, non-capture:    `(?:[0-9])`
	Start & End Anchors:   `^hello$`
	Word Boundaries:       `\bhello\b`
	Non-Word Boundaries:   `hello\B`

These specifiers can be composed together, such as an optional group:
`(?:hello)?`

This package also supports the non-greedy variants of the repeating and
optional specifiers by appending a `?` to them.

Of the shorthand classes that are supported, they are all ASCII-based, even
when compiling in Unicode mode. This is for the sake of general performance and
simplicity, as there are thousands of Unicode codepoints which would qualify as
either a digit, space, or word character which could be irrelevant depending on
what is being matched.

Here are the shorthand class equivalencies:
	\d: [0-9]
	\s: [\t\n\f\r ]
	\w: [0-9A-Z_a-z]

If you need your own shorthands, you can compose strings together like so:
	MY_HEX :: "[0-9A-Fa-f]"
	PATTERN :: MY_HEX + "-" + MY_HEX

The compiler will handle turning multiple identical classes into references to
the same set of matching runes, so there's no penalty for doing it like this.



	``Some people, when confronted with a problem, think
	  "I know, I'll use regular expressions." Now they have two problems.''

	     - Jamie Zawinski


Regular expressions have gathered a reputation over the decades for often being
chosen as the wrong tool for the job. Here, we will clarify a few cases in
which RegEx might be good or bad.


**When is it a good time to use RegEx?**

- You don't know at compile-time what patterns of text the program will need to
  match when it's running.
- As an example, you are making a client which can be configured by the user to
  trigger on certain text patterns received from a server.
- For another example, you need a way for users of a text editor to compose
  matching strings that are more intricate than a simple substring lookup.
- The text you're matching against is small (< 64 KiB) and your patterns aren't
  overly complicated with branches (alternations, repeats, and optionals).
- If none of the above general impressions apply but your project doesn't
  warrant long-term maintenance.

**When is it a bad time to use RegEx?**

- You know at compile-time the grammar you're parsing; a hand-made parser has
  the potential to be more maintainable and readable.
- The grammar you're parsing has certain validation steps that lend itself to
  forming complicated expressions, such as e-mail addresses, URIs, dates,
  postal codes, credit cards, et cetera. Using RegEx to validate these
  structures is almost always a bad sign.
- The text you're matching against is big (> 1 MiB); you would be better served
  by first dividing the text into manageable chunks and using some heuristic to
  locate the most likely location of a match before applying RegEx against it.
- You value high performance and low memory usage; RegEx will always have a
  certain overhead which increases with the complexity of the pattern.


The implementation of this package has been optimized, but it will never be as
thoroughly performant as a hand-made parser. In comparison, there are just too
many intermediate steps, assumptions, and generalizations in what it takes to
handle a regular expression.

*/
package regex