Building a Lexer with Flex, Part 1

24 Sep 2017

In the previous post, I described what lexical analysis is, what a token is, and at a broad level, what sort of API a lexer provides. In this post, I’ll describe how to build a lexer in C/C++ using Flex, the Fast Lexical Analyzer Generator.

Lex and Flex

Recall from the previous post that every token type corresponds to a pattern. In our JavaScript example, the token type we called T_VAR always corresponded to the lexeme var, while the token type T_IDENTIFIER corresponded to any identifier (a string of one or more characters, where (1) every character is either a letter, decimal digit, underscore, or dollar sign; (2) the first character is not a decimal digit; and (3) the string is not a reserved word like if or while).

In general, a lexical analyzer can be described by a list of rules, something like this.

When you see this pattern… Take this action…
(whitespace) Skip it
= It is a T_EQ token
== It is a T_EQ_EQ token
break It is a T_BREAK token

It’s not too difficult to hand-write a lexer. However, several decades ago, computer scientists realized that source code for a lexical analyzer could be generated from a description like this. This has several advantages: the description is generally short, easy to write, and easy to understand; and the generated code can be exceptionally fast, often much faster than a hand-written lexer.

The idea of generating lexers originated with a tool called Lex, developed at Bell Labs in the 1970s. Since then, numerous Lex-like tools have been written. Lex has become a standard Unix development tool, and Lex-like tools have emerged for nearly every major programming language.

In this and the following blog posts, I will focus on a Lex clone called Flex. The original Lex tool is no longer in use (it was from 1975!); Flex is a modern successor that is standard on Linux and macOS. All of the code in these posts will be compatible with Flex 2.5.35. That version is old, but it’s still the version that ships with macOS. Fortunately, newer versions of Flex are backward-compatible, so all of the code in this post will work with newer versions of Flex as well.

Demo #1: Hello, Flex

The idea behind Flex is simple. A Flex input file contains a list of regular expressions, each of which describes the pattern for a particular token type. After that regular expression is a snippet of C (or C++) code that is executed when that pattern is matched. Generally, it will return a value to the parser indicating what token type was matched.

The flex tool reads this input file and generates C (or C++) code for a lexical analyzer.

As a concrete example, let’s start with the following Flex input file. (A “real” Flex lexer will look quite different from this; this example is only to help explain how Flex works.)

/* Flex Demo 1 - lexer.l */

%option reentrant
%option noyywrap
%option warn nodefault

%%

[\t\v\f \n\r]*  { return 1; }
"="             { return 2; }
"=="            { return 3; }
"break"         { return 4; }
.               { return 5; }

%%

Before explaining what this input file means, exactly, let’s turn it into a working program. When the flex tool is run Flex on this input file…

flex -olexer.cpp --header-file=lexer.hpp lexer.l

…it will generate two new files: lexer.cpp and lexer.hpp. Let’s write a main program that uses the Flex-generated lexer to tokenize some strings. (Just skim it for now; I’ll explain the details later.)

// Flex Demo 1 - main.cpp

#include "lexer.hpp" // yy* types and methods

#include <cassert>  // assert
#include <cstdio>   // perror
#include <cstdlib>  // exit
#include <iostream> // cout, endl

namespace {
void describe_tokens(const char *string) {
  // Create a new lexer
  yyscan_t scanner;
  if (yylex_init(&scanner)) {
    perror("Error initializing lexer");
    exit(1);
  }

  // Copy the string to scan into a buffer
  // (note: copying may not be desirable in production)
  YY_BUFFER_STATE buf = yy_scan_string(string, scanner);

  std::cout << "Tokens for \"" << string << "\" are:" << std::endl;

  // Repeatedly get the next token from the lexer, stopping after the
  // end-of-input token
  int tokenNumber;
  do {
    tokenNumber = yylex(scanner);
    std::cout << tokenNumber << ' ';
  } while (tokenNumber != 0);

  std::cout << std::endl << std::endl;

  // Destroy the buffer and the lexer
  yy_delete_buffer(buf, scanner);
  yylex_destroy(scanner);
}
} // namespace

int main(int argc, char **argv) {
  describe_tokens("");
  describe_tokens("break===\n\nx");
  describe_tokens("=====");
  describe_tokens("breakbreak");
  describe_tokens("\x84"); // Non-ASCII
  return 0;
}

Now, let’s compile this into a program called demo1:

g++ -odemo1 main.cpp lexer.cpp

When we run it, we get the following output, which I will explain momentarily:

Tokens for "" are:
0 

Tokens for "break===

x" are:
4 3 2 1 5 0

Tokens for "=====" are:
3 3 2 0 

Tokens for "=
=break" are:
2 1 2 4 0 

Tokens for "?" are:
5 0 

The C++ driver code (main.cpp)

First, let me describe the C++ driver code.

The main function calls describe_tokens on several example strings. The describe_tokens function receives a string, creates a lexer, uses the lexer to tokenize the string, and displays some information about what tokens were found.

Of course, the interesting part is how we tokenize the string in the describe_tokens function.

  • First, we create an instance of the lexer. We do this by declaring a variable of type yyscan_t (remember, the words “lexer” and “scanner” are often used interchangeably) and calling yylex_init to initialize it. If yylex_init fails, it sets the global errno variable, so we can invoke perror (from stdio.h) to display a readable error message, which is most likely “out of memory.”
  • Next, we need to tell the lexer what input we want to tokenize. The text to be tokenized must be placed in a buffer that (1) can be overwritten as the input is processed and (2) ends with two NUL characters (ASCII 0). The easiest way to create such a buffer is to call yy_scan_string, which creates a new buffer and copies a string into it. If yy_scan_string fails (e.g., if the process is out of memory), the program terminates. (As you can imagine, we probably won’t use yy_scan_string in production…)
  • We call the function yylex repeatedly to tokenize the input. Each time yylex is called, the lexer identifies the next token in the input and returns an integer, which is intended to represent the token type. When no tokens remain, yylex returns 0.
  • Finally, we must call yy_delete_buffer and yylex_destroy to free the memory for the buffer and the scanner, respectively.

The Flex input file (lexer.l)

Now, let’s look at the Flex input file and try to understand what our lexer is doing. Look again at the five rules in our Flex input file:

[ \t\v\f\n\r]*  { return 1; }
"="             { return 2; }
"=="            { return 3; }
"break"         { return 4; }
.               { return 5; }

Each rule describes the pattern for a particular token. This is followed by C code in curly braces. As we saw in the previous section, the driver calls yylex repeatedly to tokenize an input string; the return statement in curly braces indicates what value yylex will return for that pattern.

The first pattern matches whitespace characters: tab, vertical tab, form feed, newline, and carriage return. The * means, “Match a string or zero or more such characters.” For example, this pattern will match a single space, or two spaces, or \r\n, or several spaces followed by a form feed and more spaces, etc. The next three patterns are self-explanatory: they match a particular string. The . pattern matches any character except newline (\n, ASCII 10).

Now, when our program was run, it contained this output:

Tokens for "break===

x" are:
4 3 2 1 5 0

When our lexer tokenized the string break===\n\nx, it produced five tokens: the first matched the rule with the code return 4, the second matched the rule with return 3, etc. (What about the 0 at the end of the output? Remember that yylex returned 0 to indicate “end of input.”) Looking at the rules above, the return values 4 3 2 1 5 correspond to this tokenization:

It’s important to realize that Flex matches its input character by character, not line by line. In this example, the two newlines were grouped into a single token. In fact, if there were a space between the word break and the newline, this token would have consisted of three characters: space, newline, newline. In other words, there’s nothing special about newline characters – they’re matched just like any other character.

Now, the tokenization we just saw – 4 3 2 1 5 – seems reasonable. However, notice that the last pattern (.) matches any single, non-newline character. Couldn’t break be five tokens, each a single character? Also, the second rule matches the character =, but . also matches =. Why did the lexer choose the “=” rule instead of the . rule? To answer these questions, we need to understand how Flex matches its input.

How a Flex-generated lexer matches input

When yylex is called for the first time, it begins reading characters from the beginning of the input. It identifies a token as follows:

  • Starting from the beginning, it finds the longest prefix that matches any rule in the grammar. Then, it executes the C code associated with that rule.
  • If the longest prefix matches more than one rule, it chooses whichever rule appears first.

So what does the C code in curly braces do?

  • A return statement in the C code causes the yylex function to return that value. The next time yylex is called, it will start matching from the next character in the input.
  • If the C code does not return, then the current token is discarded, and the lexer searches for a new token beginning with the next character in the input.

To make this clearer, let’s work through some examples.

Example 1. Continuing the example from earlier, consider how break===\n\nx is tokenized:

  • The driver calls yylex for the first time. The lexer tries to find the longest prefix that matches a rule.
    • break===\n\nx does not match any rule.
    • break===\n\n does not match any rule.
    • break===\n does not match any rule.
    • break=== does not match any rule.
    • break== does not match any rule.
    • break= does not match any rule.
    • break matches the rule for “break”, so the C code is executed, and yylex returns 4.
  • The driver calls yylex again. The remaining input is ===\n\nx. The lexer tries to find the longest prefix that matches a rule.
    • The longest prefix that matches a rule is ==. This matches the rule for “==”, and the lexer returns 3.
  • The driver calls yylex again. The remaining input is =\n\nx. The lexer tries to find the longest prefix that matches a rule.
    • The longest prefix that matches any rule is =. However, = matches the rule for “=” as well as the rule for . (any non-newline character). Since the rule for “=” is listed before the rule for ., the lexer chooses the rule for “=” and returns 2.
  • The driver calls yylex again. The remaining input is \n\nx. The lexer tries to find the longest prefix that matches a rule.
    • \n\n matches the rule for [\t\v\f \n\r]*, so yylex returns 1.
  • The driver calls yylex again. The remaining input is x. The lexer tries to find the longest prefix that matches a rule.
    • x matches the rule for . (any non-newline character), so the lexer returns 5.
  • The driver calls yylex again. No input remains, so it returns 0.

Now, we have the answers to the questions I asked earlier. Why is break not five tokens? Flex always matches the longest possible string. Since the first five characters (break) match a rule, it prefers that over matching just the first character, b.

Example 2. Consider how ===== is tokenized:

  • The lexer matches == and returns 3.
  • The lexer matches == and returns 3.
  • The lexer matches = and returns 2.

Remember: Flex always matches the longest prefix possible.

Example 3. The string breakbreak is tokenized as two break tokens. This is fine for now. However, in the future, we will add a rule for JavaScript identifiers; at that time, we’ll want this to tokenize as a single identifier breakbreak instead of two break tokens. But again, two break tokens is fine for now.

Flex input files: The structure

Before going further, let’s explain the format of a Flex input file.

A Flex input file is divided into three sections. The sections are separated by the delimiter %% on a line by itself.

Definitions
%%
Rules
%%
User Code

The Definitions section

In our example above, the Definitions section contains two things.

The first line is a comment. Note that this comment starts at the beginning of the line. Unindented comments in the Definitions section are copied directly into the output file (in our case, lexer.cpp).

The next lines contain %option directives. Each directive can specify one or more options; it doesn’t matter if the options are given on the same line or separate lines, and it doesn’t matter what order the options are specified. We used the following options:

  • %option reentrant creates a reentrant lexer, so lexers can be created and used concurrently on several threads. This ensures that the generated lexer does not write to any global variables.
  • %option noyywrap tells Flex that our lexer does not require a yywrap function. This function is used to supply more input when the end-of-input is reached; we don’t need to do this.
  • %option warn and %option nodefault turn on warnings and disable the default rule. Together, these options guarantee that Flex will issue a warning if the list of rules is incomplete – i.e., if there are inputs that will not match any rule. (We’ll discuss this more in a later post on patterns in Flex.)

There are a few other things that can appear in the Definitions section; we’ll see some of them in later posts.

The Rules section

Each line of the rules section has the form

        pattern        action

The pattern must not be indented, and the action must begin on the same line as the pattern. C-style comments can be included in the rules section, but they cannot appear at the beginning of a line (they must be indented).

Our example above contained the rule

"break"         { return 4; }

where, according to our terminology here, "break" is the pattern and { return 4; } is the action.

We’ve seen a few examples of patterns already. I’ll give the full syntax for patterns in a later post, where I can describe them in more detail.

The User Code section

The User Code section is copied verbatim into the output file (lexer.cpp). In our example, this section was empty, although it is sometimes used to define utility functions used by the code in the Rules section. (A main function could also be defined here.)

Demo #2: Symbolic names for token types, and skipping whitespace

In Demo #1 above, the yylex function returned 1 when it matched a whitespace, 2 when it matched “=”, etc. Let’s create a header file where we define symbolic names for those token types. While we’re at it, I’m going to change the token numbers: instead of 2, 3, and 4, we’ll use 257, 258, and 259. I’ll explain why shortly.

// Flex Demo 2 - lexer-decls.hpp

#ifndef LEXER_DECLS_HPP
#define LEXER_DECLS_HPP

enum {
  END_OF_INPUT = 0,
  T_EQ = 257,
  T_EQ_EQ = 258,
  T_BREAK = 259,
};

#endif

To use those names in our Flex input file, we need to make sure that header is included in the generated lexer.cpp. In the first section of the Flex input file, we will add an #include directive, enclosed in %{   %} delimeters. Any code in such a block will inserted into the generated lexer.cpp, close to the top of the file.

While we’re at it, let’s make two more changes.

  • We’ll change the action for the whitespace rule to an empty block, { }.
  • We’ll change the action for the . rule to return yytext[0].
/* Flex Demo 2 - lexer.l */
%{
#include "lexer-decls.hpp"
%}

%option reentrant
%option noyywrap nounput
%option warn nodefault

%%

[\t\v\f \n\r]*  { }
"="             { return T_EQ; }
"=="            { return T_EQ_EQ; }
"break"         { return T_BREAK; }
.               { return yytext[0]; }

%%

Now, what have we done?

If a pattern has an empty action, those tokens are discarded

When the rule for a token is empty, Flex will discard that token. In other word, when the lexer matches that pattern, yylex will not return that token. Instead, it will skip over it and start searching for the next token.

For example, the string break===\n\nx will now tokenize as follows:

yytext[0] is the first character of the token text

We changed the action for . to return yytext[0]. What is this?

Inside an action, yytext is a char * pointing to the text of the current token (and yyleng is an int containing its length, although we didn’t use that). So, return yytext[0] means, “Return the ASCII code of the first character of this token’s text.” Since the last two rules only match a single character, this means yylex will return that character’s ASCII code.

This means (1) when a single character is matched by the last rule, yylex will return its (extended) ASCII code – a value between 1 and 255 – and (2) when “=”, “==”, or “break” is matched, yylex will return T_EQ (257), T_EQ_EQ (258), or T_BREAK (259).

This is why we gave our symbolic constants (T_EQ_EQ, etc.) values larger than 255.

(But why 257? Later, I’ll talk about the parser generator Yacc and its modern successor, Bison. When Bison generates its token numbers, they always start at 257. 0 is the end-of-input token, 1 through 255 are used for single characters (as above), and 256 is an “error” token. So, custom tokens are numbered from 257. We don’t have to follow this convention, since we’re not using Bison – we could have started from 256 – but I did anyway.)

Updating main.cpp

Now, we can also use our symbolic names in main.cpp. Let’s add a function that returns a readable description of each token type.

std::string describe(int tokenType) {
  switch (tokenType) {
  case END_OF_INPUT:
    return "(end of input)";
  case T_EQ:
    return "=";
  case T_EQ_EQ:
    return "==";
  case T_BREAK:
    return "break";
  default: {
    std::ostringstream out;
    out << static_cast<char>(tokenType);
    return out.str();
  }
  }
}

Now, we can change the output loop, so it will output this description rather than the token number.

  // Repeatedly get the next token from the lexer, stopping after the
  // end-of-input token
  int tokenType;
  do {
    tokenType = yylex(scanner);
    std::cout << describe(tokenType) << ' ';
  } while (tokenType != END_OF_INPUT);

Now, our program produces the following output:

Tokens for "" are:
(end of input) 

Tokens for "break===

x" are:
break == = x (end of input) 

Tokens for "=====" are:
== == = (end of input) 

Tokens for "breakbreak" are:
break break (end of input) 

Tokens for "?" are:
? (end of input) 

Up Next: Text, Line Numbers, Patterns, and a JavaScript Lexer

So far, we’ve seen how Flex can match some simple tokens, but we haven’t seen enough to build a “real” lexer yet. But what we have so far is a good foundation. After just a few more posts, we’ll have covered enough about Flex to build a lexer for JavaScript.

Download the source code

Demo #1

Source Code:    lexer.l 15 lines
  main.cpp 48 lines
    Total: 63 lines
Makefile: Makefile   (GNU Make on Linux/macOS)

Demo #2

Source Code:    lexer.l 18 lines
  lexer-decls.hpp 13 lines
  main.cpp 69 lines
    Total: 100 lines
Makefile: Makefile   (GNU Make on Linux/macOS)

References

The Flex Manual is the official documentation for Flex and is quite good.

Lex was introduced in M.E. Lesk and E. Schmidt, “Lex – A Lexical Analyzer Generator,” Computing Science Technical Report 39, AT&T Bell Laboratories, Murray Hill, NJ (1975). Available online.

POSIX-compliant operating systems are required to include a lex tool. The POSIX.1-2008/IEEE 1003.1-2008 specification for lex is available online. (Note: The demonstration code in this blog post is Flex-specific; it will not work with POSIX lex unless it is modified to eliminate some Flex-specific functionality, like %option reentrant.)

Published on 24 Sep 2017 3401 words Comments? E-mail me!