My favorites | Sign in
Project Home Downloads Wiki Issues Source
Repository:
Checkout   Browse   Changes   Clones  
Changes to /doc-src/manual/offside.rst
2a75d04bd087 vs. d93e5d662c5d Compare: vs.  Format:
Revision d93e5d662c5d
Go to: 
Project members, sign in to write a code review
/doc-src/manual/offside.rst   2a75d04bd087 /doc-src/manual/offside.rst   d93e5d662c5d
1 1
2 .. index:: offside rule, whitespace sensitive parsing 2 .. index:: offside rule, whitespace sensitive parsing
3 .. _offside: 3 .. _offside:
4 4
5 Line--Aware Parsing and the Offside Rule 5 Line--Aware Parsing and the Offside Rule
6 ======================================== 6 ========================================
7 7
8 From release 3.3 Lepl includes support to simplify parsing text where lines 8 From release 3.3 Lepl includes support to simplify parsing text where newlines
9 and whitespace are significant. For example, in both Python and Haskell, the 9 and leftmost whitespace are significant. For example, in both Python and
10 relative indentation of lines changes the meaning of a program. There are 10 Haskell, the relative indentation of lines changes the meaning of a program.
11 also many simpler cases where a matcher should be applied to a single line (or
12 several lines connected with a continuation character).
13
14 At the end of this section is an :ref:`example <python_example>` that handles 11 At the end of this section is an :ref:`example <python_example>` that handles
15 indentation in a similar way to Python. 12 indentation in a similar way to Python.
16 13
17 There is nothing special about spaces and newline characters, of course, so in 14 Lepl also supports many simpler cases where a matcher should be applied to a
18 principle it was always possible to handle such grammars in Lepl, but in 15 single line (or several lines connected with a continuation character).
19 practice doing so was frustratingly complex. The new extensions make things 16
17 There is nothing special about spaces and newlines, of course, so in principle
18 it was always possible to handle these in Lepl, but in practice doing so was
19 sometimes frustratingly complex. The extensions described here make things
20 much simpler. 20 much simpler.
21 21
22 Note that I use the phrase "offside rule" in a general way (only) to describe 22 Note that I use the phrase "offside rule" in a general way (only) to describe
23 indentation--aware parsing. I am not claiming to support the exact parsing 23 indentation--aware parsing. I am not claiming to support the exact parsing
24 used in any one language, but instead to provide a general toolkit that should 24 used in any one language, but instead to provide a general toolkit that should
25 make a variety of different syntaxes possible. 25 make a variety of different syntaxes possible.
26 26
27 .. warn:: 27 .. warning::
28 28
29 This has changed significantly in Lepl 5. It is now implemented by adding 29 This has changed significantly in Lepl 5. It is now implemented by adding
30 additional tokens into the token stream. It also has new configuration 30 additional tokens into the token stream. It also has new configuration
31 options and slightly changed matcher names. For more details of the 31 options and slightly changed matchers. For more details of the changes see
32 changes see `lepl5`_. 32 :ref:`Lepl 5 <lepl5>`.
33 33
34 Introduction
35 ------------
36
37 Two distinct approaches to "line aware" parsing are available. The first
38 allows you to simply match lines. The second allows lines to be grouped into
39 blocks whose relative indentation is significant. Both require the use of
40 tokens.
41
42 .. _lines:
43 .. index:: lines(), LineStart(), LineEnd(), Line() 34 .. index:: lines(), LineStart(), LineEnd(), Line()
44 35
45 Simple Line--Aware Parsing (Lines Only) 36 Simple Line--Aware Parsing (Lines Only)
46 --------------------------------------- 37 ---------------------------------------
47 38
48 This is configured with `.config.lines() <api/redirect.html#lepl.core.config.ConfigBuilder.lines>`_. The ``LineStart()`` and 39 If line-aware parsing is enabled using `.config.lines()
49 ``LineEnd()`` tokens are added to the token stream so that you can match wen 40 <api/redirect.html#lepl.core.config.ConfigBuilder.lines>`_ (with no
50 lines start and end. 41 parameters) then two tokens will be added to each line: ``LineStart()`` at the
42 beginning and ``LineEnd()`` at the end. Neither token will return any result,
43 but they must both be matched for the line as a whole to parse correctly.
51 44
52 For example, to split input into lines you might use:: 45 For example, to split input into lines you might use::
53 46
54 >>> contents = Token(Any()[:,...]) > list 47 >>> contents = Token(Any()[:,...]) > list
55 >>> line = LineStart() & contents & LineEnd() 48 >>> line = LineStart() & contents & LineEnd()
56 >>> lines = line[:] 49 >>> lines = line[:]
57 >>> lines.config.lines() 50 >>> lines.config.lines()
58 >>> lines.parse('line one\nline two\nline three') 51 >>> lines.parse('line one\nline two\nline three')
59 [['line one\n'], ['line two\n'], ['line three']] 52 [['line one\n'], ['line two\n'], ['line three']]
60 53
61 Since you will often want to define lines, the ``Line()`` matcher simplifies 54 Since you will often want to define lines, the ``Line()`` matcher simplifies
62 this a little:: 55 this a little::
63 56
64 >>> contents = Token(Any()[:,...]) > list 57 >>> contents = Token(Any()[:,...]) > list
65 >>> line = Line(contents) 58 >>> line = Line(contents)
66 >>> lines = line[:] 59 >>> lines = line[:]
67 >>> lines.config.lines() 60 >>> lines.config.lines()
68 >>> lines.parse('line one\nline two\nline three') 61 >>> lines.parse('line one\nline two\nline three')
69 [['line one\n'], ['line two\n'], ['line three']] 62 [['line one\n'], ['line two\n'], ['line three']]
70 63
71 .. note:: 64 .. note::
72 65
73 The contents of the ``Line()`` matcher should be tokens (they can, of 66 The contents of the ``Line()`` matcher should be tokens (they can, of
74 course, be specialised, as described in `lexer`_). 67 course, be specialised, as described in :ref:`lexer`).
75 68
76 .. index:: ContinuedLineFactory(), Extend() 69 .. index:: ContinuedLineFactory(), Extend()
77 70
78 Continued and Extended Lines 71 Continued and Extended Lines
79 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 72 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
80 73
81 Sometimes you may want to have a matcher that "continue" over multiple lines. 74 Sometimes you may want to have a matcher that continues over multiple lines.
82 You can do this by combining ``Line()`` matchers, but there is also a matcher 75 You can do this by combining ``Line()`` matchers, but there is also a matcher
83 for the common case of a "continuation character". For example, if ``'\'`` is 76 for the common case of a "continuation character". For example, if ``\`` is
84 used to mark a line that continues on then:: 77 used to mark a line that continues then::
85 78
86 >>> contents = Token('[a-z]+')[:] > list 79 >>> contents = Token('[a-z]+')[:] > list
87 >>> CLine = ContinuedLineFactory(r'\\') 80 >>> CLine = ContinuedLineFactory(r'\\')
88 >>> line = CLine(contents) 81 >>> line = CLine(contents)
89 >>> lines = line[:] 82 >>> lines = line[:]
90 >>> lines.config.lines() 83 >>> lines.config.lines()
91 >>> lines.parse('line one \\\nline two\nline three') 84 >>> lines.parse('line one \\\nline two\nline three')
92 [['line', 'one', 'line', 'two'], ['line', 'three']] 85 [['line', 'one', 'line', 'two'], ['line', 'three']]
93 86
94 The idea is that you make your own replacement for ``Line()`` that works 87 The idea is that you make your own replacement for ``Line()`` that works
95 similarly, but can be continued if it ends in the right character. 88 similarly, but can be continued if it ends in the right character (the
89 continuation character is actually a regular expression which is why it's
90 written as ``r'\\'`` --- the backslash must be escaped).
96 91
97 Another common use case is that some matching should ignore lines. For this 92 Another common use case is that some matching should ignore lines. For this
98 you can use ``Extend()``: 93 you can use ``Extend()``:
99 94
100 >>> contents = Token('[a-z]+')[:] > list 95 >>> contents = Token('[a-z]+')[:] > list
101 >>> parens = Token('\(') & contents & Token('\)') > list 96 >>> parens = Token('\(') & contents & Token('\)') > list
102 >>> line = Line(contents & Optional(Extend(parens))) 97 >>> line = Line(contents & Optional(Extend(parens)))
103 >>> lines = line[:] 98 >>> lines = line[:]
104 >>> lines.config.lines() 99 >>> lines.config.lines()
105 >>> lines.parse('line one (this\n extends to line two)\nline three') 100 >>> lines.parse('line one (this\n extends to line two)\nline three')
106 [['line', 'one'], ['(', ['this', 'extends', 'to', 'line', 'two'], ')'], ['line', 'three']] 101 [['line', 'one'], ['(', ['this', 'extends', 'to', 'line', 'two'], ')'], ['line', 'three']]
107 102
108 .. _blocks: 103 .. _blocks:
109 .. index:: blocks(), BLine(), Indent() 104 .. index:: Block(),
110 105
111 Offside Parsing (Blocks of Lines) 106 Offside Parsing (Blocks of Lines)
112 --------------------------------- 107 ---------------------------------
113 108
114 This is similar to the line--aware parsing above, but adds the tokens 109 This extends the line--aware parsing above. In broad terms:
115 ``Indent()`` (instead of ``LineStart()``) and ``LineEnd()`` to the token 110
116 stream. It is configured with `.config.blocks() <api/redirect.html#lepl.core.config.ConfigBuilder.blocks>`_. 111 * Any space at the start of the line is included in the ``LineStart()``
112 token.
113
114 * The ``Block()`` matcher will check the start of the first line and set a
115 "global" variable to that indentation level.
116
117 * Each ``LineStart()`` will check the variable set by ``Block()`` and only
118 match if the indentation level agrees with the space at the start of that
119 line.
120
121 Together these modifications mean that all the ``LineStart()`` tokens in a
122 single block must have the same indentation. In other words, all lines in
123 a ``Block()`` are indented the same.
124
125 Since ``Line()`` continues to work as before, using the modified
126 ``LineStart()`` described above, we can think of the text as being structured
127 like this::
128
129 Block(Line()
130 Line()
131 Block(Line()
132 Line()
133 Block(Line()
134 Line())
135 Line()
136 Block(Line()))
137 Line())
138
139 Each line is a separate ``Line()`` and groups of indented lines are collected
140 inside ``Block()``.
141
142 Configuration
143 ~~~~~~~~~~~~~
144
145 To enable the block--based parsing specify the ``block_policy`` or
146 ``block_indent`` parameters in `.config.lines() <api/redirect.html#lepl.core.config.ConfigBuilder.lines>`_.
147
148 The ``block_policy`` decides what indentations are acceptable. The default,
149 ``constant_indent()`` expects each block to be indented an additional, fixed
150 number of spaces relative to previous lines. Other options include
151 ``explicit()``, which will accept any indent (and so is typically used
152 following a line with a special syntax, like ending in ``":"``) and
153 ``to_right()`` which will accept any indent as long as it is larger than what
154 went before.
117 155
118 The ``Indent()`` token consumes initial spaces on the line and is used by two 156 The ``block_indent`` is used with the default ``constant_indent()`` policy and
119 new matchers, ``BLine()`` and ``Block()`` to define how blocks of lines are 157 sets the indentation amount.
120 nested relative to each other. They work together as shown in the following
121 "picture"::
122 158
123 BLine() 159 A ``tabsize`` parameter can also be specified --- any tab at the start of the
124 BLine() 160 line is replaced with this many spaces.
125 Block(BLine()
126 BLine()
127 Block(BLine()
128 BLine())
129 BLine()
130 Block(BLine()))
131 BLine()
132 161
133 In other words: each line is in a separate `BLine() 162 Example
134 <api/redirect.html#lepl.offside.matchers.BLine>`_ and groups of indented lines 163 ~~~~~~~
135 are collected inside `Block()
136 <api/redirect.html#lepl.offside.matchers.Block>`_ elements. Each `Block()
137 <api/redirect.html#lepl.offside.matchers.Block>`_ sets the indent required for
138 the `BLine() <api/redirect.html#lepl.offside.matchers.BLine>`_ elements it
139 contains.
140 164
141 Because blocks can be nested we typically have a recursive grammar. For 165 Because blocks can be nested we typically have a recursive grammar. For
142 example:: 166 example::
143 167
144 >>> introduce = ~Token(':') 168 >>> introduce = ~Token(':')
145 >>> word = Token(Word(Lower())) 169 >>> word = Token(Word(Lower()))
146 170
147 >>> statement = Delayed() 171 >>> statement = Delayed()
148 172
149 >>> simple = BLine(word[:]) 173 >>> simple = Line(word[:])
150 >>> empty = BLine(Empty(), indent=False) 174 >>> empty = Line(Empty(), indent=False)
151 >>> block = BLine(word[:] & introduce) & Block(statement[:]) 175 >>> block = Line(word[:] & introduce) & Block(statement[:])
152 176
153 >>> statement += (simple | empty | block) > list 177 >>> statement += (simple | empty | block) > list
154 >>> program = statement[:] 178 >>> program = statement[:]
155 179
156 >>> program.config.blocks(block_policy=2) 180 >>> program.config.blocks(block_policy=2)
157 >>> parser = program.get_parse_string() 181 >>> parser = program.get_parse_string()
158 182
159 >>> parser(''' 183 >>> parser('''
160 ... abc def 184 ... abc def
161 ... ghijk: 185 ... ghijk:
162 ... mno pqr: 186 ... mno pqr:
163 ... stu 187 ... stu
164 ... vwx yz 188 ... vwx yz
165 ... ''') 189 ... ''')
166 [[], 190 [[],
167 ['abc', 'def'], 191 ['abc', 'def'],
168 ['ghijk', 192 ['ghijk',
169 ['mno', 'pqr', 193 ['mno', 'pqr',
170 ['stu']], 194 ['stu']],
171 ['vwx', 'yz']]] 195 ['vwx', 'yz']]]
172 196
173 The core of the parser above is the three uses of `BLine() 197 The core of the parser above is the three uses of ``Line()``. The first,
174 <api/redirect.html#lepl.offside.matchers.BLine>`_. The first, ``simple``, is 198 ``simple``, is a statement that fits in a single line. The next, ``empty``,
175 a statement that fits in a single line. The next, ``empty``, is an empty 199 is an empty statement (this has ``indent=False`` because we don't care about
176 statement (this has ``indent=False`` because we don't care about the indent of 200 the indentation of empty lines). Finally, ``block`` defines a block statement
177 empty lines). Finally, ``block`` defines a block statement as one that is 201 as one that is introduced by a line that ends in ":" and then contains a
178 introduced by a line that ends in ":" and then contains a series of statements 202 series of statements that are indented relative to the first line.
179 that are indented relative to the first line.
180 203
181 So you can see that the `Block() 204 So you can see that the `Block()
182 <api/redirect.html#lepl.offside.matchers.Block>`_ matcher's job is to collect 205 <api/redirect.html#lepl.offside.matchers.Block>`_ matcher's job is to collect
183 together lines that are indented relative to whatever came just before. This 206 together lines that are indented relative to whatever came just before. This
184 works with `BLine() <api/redirect.html#lepl.offside.matchers.BLine>`_ which 207 works with ``Line()`` which matches a line if it is indented at the correct
185 matches a line if it is indented at the correct level. 208 level.
186 209
187 .. index:: ContinuedBLineFactory()
188 .. _python_example: 210 .. _python_example:
189 211
190 Continued and Extended Lines 212 Continued and Extended Lines
191 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 213 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
192 214
193 As with simple line--aware parsing, we would sometimes like a line to continue 215 As with simple line--aware parsing, we would sometimes like a line to continue
194 over several lines if it ends with a certain matcher. We can make a similar 216 over several lines if it ends with a certain matcher. We can make a similar
195 matcher to `BLine() <api/redirect.html#lepl.offside.matchers.Line>`_ that 217 matcher to `Line() <api/redirect.html#lepl.offside.matchers.Line>`_ that
196 continues over multiple lines using `ContinuedBLineFactory() 218 continues over multiple lines using `ContinuedLineFactory()
197 <api/redirect.html#lepl.offside.matchers.ContinuedLineFactory>`_. 219 <api/redirect.html#lepl.offside.matchers.ContinuedLineFactory>`_.
198 220
199 It is also possible to use ``BExtend()`` to allow some matchers to ignore line 221 It is also possible to use ``Extend()`` to allow some matchers to ignore line
200 breaks. 222 breaks.
201 223
202 Using these two matchers we can write a simple, Python--like language: 224 Using these two matchers we can write a simple, Python--like language:
203 225
204 * Blocks are defined by relative indentation 226 * Blocks are defined by relative indentation
205 * The `\` marker indicates that a line extends past a line break 227 * The ``\`` marker indicates that a line extends past a line break
206 * Some constructions (like parentheses) automatically allow a line 228 * Some constructions (like parentheses) automatically allow a line
207 to extend past a line break 229 to extend past a line break
208 * Comments can have any indentation 230 * Comments can have any indentation
209 231
210 (To keep the example simple there's only minimal parsing apart from the 232 (To keep the example simple there's only minimal parsing apart from the
211 basic structure - a useful Python parser would obviously need much more work). 233 basic structure - a useful Python parser would obviously need much more work).
212 234
213 :: 235 ::
214 236
215 word = Token(Word(Lower())) 237 word = Token(Word(Lower()))
216 continuation = Token(r'\\') 238 continuation = Token(r'\\')
217 symbol = Token(Any('()')) 239 symbol = Token(Any('()'))
218 introduce = ~Token(':') 240 introduce = ~Token(':')
219 comma = ~Token(',') 241 comma = ~Token(',')
220 hash = Token('#.*') 242 hash = Token('#.*')
221 243
222 CLine = ContinuedBLineFactory(continuation) 244 CLine = ContinuedLineFactory(continuation)
223 245
224 statement = word[1:] 246 statement = word[1:]
225 args = BExtend(word[:, comma]) > tuple 247 args = Extend(word[:, comma]) > tuple
226 function = word[1:] & ~symbol('(') & args & ~symbol(')') 248 function = word[1:] & ~symbol('(') & args & ~symbol(')')
227 249
228 block = Delayed() 250 block = Delayed()
229 blank = ~BLine(Empty(), indent=False) 251 blank = ~Line(Empty(), indent=False)
230 comment = ~BLine(hash, indent=False) 252 comment = ~Line(hash, indent=False)
231 line = Or((CLine(statement) | block) > list, 253 line = Or((CLine(statement) | block) > list,
232 blank, 254 blank,
233 comment) 255 comment)
234 block += CLine((function | statement) & introduce) & Block(line[1:]) 256 block += Line((function | statement) & introduce) & Block(line[1:])
235 257
236 program = (line[:] & Eos()) 258 program = (line[:] & Eos())
237 program.config.blocks(block_policy=explicit) 259 program.config.lines(block_policy=explicit)
238 parser = program.get_parse_string() 260 parser = program.get_parse_string()
239 261
240 When applied to input like:: 262 When applied to input like::
241 263
242 # this is a grammar with a similar 264 # this is a grammar with a similar
243 # line structure to python 265 # line structure to python
244 266
245 if something: 267 if something:
246 then we indent 268 then we indent
247 else: 269 else:
248 something else 270 something else
249 # note a different indent size here 271 # note a different indent size here
250 272
251 def function(a, b, c): 273 def function(a, b, c):
252 we can nest blocks: 274 we can nest blocks:
253 like this 275 like this
254 and we can also \ 276 and we can also \
255 have explicit continuations \ 277 have explicit continuations \
256 with \ 278 with \
257 any \ 279 any \
258 indentation 280 indentation
259 281
260 same for (argument, 282 same for (argument,
261 lists): 283 lists):
262 which do not need the 284 which do not need the
263 continuation marker 285 continuation marker
264 # and we can have blank lines inside a block: 286 # and we can have blank lines inside a block:
265 287
266 like this 288 like this
267 # along with strangely placed comments 289 # along with strangely placed comments
268 but still keep blocks tied together 290 but still keep blocks tied together
269 291
270 The following structure is generated:: 292 The following structure is generated::
271 293
272 [ 294 [
273 ['if', 'something', 295 ['if', 'something',
274 ['then', 'we', 'indent'] 296 ['then', 'we', 'indent']
275 ], 297 ],
276 ['else', 298 ['else',
277 ['something', 'else'], 299 ['something', 'else'],
278 ], 300 ],
279 ['def', 'function', ('a', 'b', 'c'), 301 ['def', 'function', ('a', 'b', 'c'),
280 ['we', 'can', 'nest', 'blocks', 302 ['we', 'can', 'nest', 'blocks',
281 ['like', 'this'] 303 ['like', 'this']
282 ], 304 ],
283 ['and', 'we', 'can', 'also', 'have', 'explicit', 'continuations', 305 ['and', 'we', 'can', 'also', 'have', 'explicit', 'continuations',
284 'with', 'any', 'indentation'], 306 'with', 'any', 'indentation'],
285 ], 307 ],
286 ['same', 'for', ('argument', 'lists'), 308 ['same', 'for', ('argument', 'lists'),
287 ['which', 'do', 'not', 'need', 'the'], 309 ['which', 'do', 'not', 'need', 'the'],
288 ['continuation', 'marker'], 310 ['continuation', 'marker'],
289 ['like', 'this'], 311 ['like', 'this'],
290 ['but', 'still', 'keep', 'blocks', 'tied', 'together'] 312 ['but', 'still', 'keep', 'blocks', 'tied', 'together']
291 ] 313 ]
292 ] 314 ]
293 315
294 The important thing to notice here is that the nesting of lists in the final 316 The important thing to notice here is that the nesting of lists in the final
295 result matches the indentation of the original source. 317 result matches the indentation of the original source.
296 318
297 Configuration
298 ~~~~~~~~~~~~~
299
300 Various parameters can be passed to the `.config.blocks() <api/redirect.html#lepl.core.config.ConfigBuilder.blocks>`_ configuration:
301
302 * `block_policy` defines how indentations are detected:
303
304 * A simple integer gives the number of spaces by which a new block should
305 be indented. This uses the ``constant_indent()`` policy.
306
307 * Alternatively, an explicit "policy" function can be given:
308
309 * ``to_right()`` allows any size indent, as long as it moves to the
310 right.
311
312 * ``explicit()`` allows any size indent (blocks must be introduced by
313 some explicit means in the grammar --- for example, by using Python's
314 ":" marker).
315
316 * `block_start` is the initial indentation level (0 by default).
317
318 * `discard` defines the regular expression used to match whitespace.
319
320 * `tabsize` defines the number of spaces used to replace a tab (`None` to
321 disable).
Powered by Google Project Hosting