For the past couple of months, I have been learning about parsing text files into abstract syntax trees (ASTs). I started with the language I am a lot more comfortable with, Python. I wrote three separate parsers using the SAME technique everytime. Of course, when I say "technique," I'm not actually referring to parsing strategies, but about data structure. How my AST nodes are structured and how they communicate information about themselves.
Important Concepts
As someone with no formal education on this subject, I started how anyone else would: read Wikipedia pages on parsing. Then get extremely confused because of the academic language that gets thrown out at all times for some reason. So, rather than actually try to understand everything in these pages, I understood that I needed four different concepts to continue.
Lexical Analysis
This is probably the simplest concept in all of the research I've done. Although the name is kinda fancy, all it means is to convert a word (or symbol or whatever), some text, into a token.
Something that your parsers can read! It's the difference between + and Operator::Plus or func and Keyword::Function. It splits up our data into something easy to work with.
Let's look an example with English: The word crazy in written text, is not just the combination of letters, but something more meaningful. It is an adjective. Not just any adjective, it's a specific word, which our brain then parses into whatever we think it means.
Your active work in reading is the lexical analysis, while the meaning comes from the parsing your brain does. I don't know how much this actually helps but I found it fun to think about.
Very simple, I wouldn't worry about it too much. Let's move on to the parsing concepts.
LL(k) Grammars
An LL grammar is a language that can be parsed using a left-to-right reading parser, which can lookahead by k tokens. This is a strategy of top-down parsing, and one of the most widely used. In fact, these grammars can also be parsed by the MOST widely used strategy, recursive descent parsers.
In case you really want to know why it's called an LL parser: "It parses the input from Left to right, performing Leftmost derivation of the sentence." At least, that's just a sentence from Wikipedia with the bold Ls... no sources on that...
Now, let's say we have a very simple language, one which only accepts expressions, operators, and literals. Where it is in a non-sense grammar construction language (metasyntax):
literal = [0-9]+ # 1 or more digits
operator = + or - or * or /
expr = <literal>
or <expr> <operator> <expr>
Now that's an LL(1) language! How do I know? Well, if while parsing expression, we lookahead and see an operator, we know exactly what the sentence SHOULD look like, otherwise we STILL know what it is. But, LL parsing isn't exactly what we want to do.
Recursive Descent Parsing
Another top-down parser, but one which uses recursion. This is a concept that I feel is better learned in code. So, lets do it! Using that simple language, we can already see expr is self-referential or recursive. Now let's parse it.
program = "5 * 2".split(' ') # naive lexing
cur_index = 0 # start here
symbol = program[0]
# move program forward
def accept():
if cur_index + 1 < len(program):
cur_index += 1
symbol = program[cur_index]
else:
exit(0)
def expr():
# an expression can ONLY start with a literal
if symbol.isnumeric():
literal = symbol
accept()
if symbol in "+-*/": # lookahead, if it's an operator, we expect another expression
operator = symbol
accept()
right_expr = expr() # recursion is here
return (literal, operator, right_expr)
else:
return literal
raise Exception("Not a valid expression!")
def parse():
expression = expr()
Hopefully that demonstrates how recursive descent parsing is done. But look at how I'm handling my data... I either have a numeric string "5" or a tuple ("5", "+", "2").
Let's see how we can handle it better.
Abstract Syntax Tree (AST)
This is everyone's favorite talking point about compilers, at least when you're starting out. When I started, this intimadated me. But, it's essentially just linking together different data structures that are necessary for your language. Meaning the "abstract" part is YOUR abstraction, how the language looks like to you and to your compiler.
Using the program example above, our AST would look like:
Thats it! Or, written in (good) Python:
@dataclass
class Literal:
number: int
@dataclass
class Operator:
kind: string
lhs: Expression
rhs: Expression
@dataclass
class Expression:
data: Union[Operator, Literal]
Filled up:
Operator(kind: "plus",
lhs: Expression(data: 5),
rhs: Expression(data: 2)
)Once we get to this step, we can then use this data to generate any code we want. For example, in LLVM IR
define i32 @main() #0 {
start:
%1 = add i16 5, 2
ret 0
}or maybe even generate Python
5 + 2Outro
This article is just to show how easy it is to parse arbitrary languages. Use this to your advantage when it feels like it an input would be better as a full-fledged language.
Or, you know, use YACC if you need something robust.