Simple Data Flow Analysis

This is a toy problem to show basic Data Flow Analysis (DFA) techniques.

You need to write an analyzer, that receives as its input a program similar to the following example program:

public class Main {
    public static void method(boolean... conditions) {
        int x;
        x = 1;
        if (conditions[0]) {
            x = 2;
            if (conditions[1]) {
                x = 3;
            }
            x = 4;
            if (conditions[2]) {
                x = 5;
            }
        }
        if (conditions[3]) {
            x = 6;
        }
        System.out.println(x);
    }
}

A program always consists of a single Main class, that contains a single method.

The first line of the method’s body is always an x declaration and the last line is a println statement.

The rest of the body consists of ifs and x assignments.

The analyzer needs to print all possible values of x at the end of the program. For the example above the possible values are: 1, 4, 5, 6.

Tools

Tools used:

Visual Studio 2017.
ANTLR 4 as the parser generator.
nunit as the testing framework.

Analysis

This problem indeed illustrates several basic parts of a compiler pipeline, and might be fun for anyone who wants to understand how an analyzer or compiler, or interpreter works.

The input program needs to go through several transformations, before the DFA analysis can be made:

Initial textual form.
The parser will transform the text into a Parse Tree.
The Parse Tree will be converted into an assembly-like intermediate representation (IR).
The IR will be partitioned into basic blocks, from which a Control Flow Graph (CFG) will be constructed.
The CFG is finally a structure that can be analyzed using Data Flow Analysis methods.

Each of the steps has a huge amount of theory to it. Even scratching the surface here is unfeasible. The compiler theory is well developed and covered in many books and academic papers. What I want to do instead is focus on a very simple practical application of the said theory.

For more information on compiler theory, please consult some classic texts, such as the dragon book or Engineering: A Compiler. The ANTLR book is also a practical and fun introduction to parsers.

Parser

Any non-trivial program is actually a tree. Many constructs in a program have nested children. A program might contain classes, which contain methods, which contain statements, which contain expressions, which contain sub-expressions, and so on. There is often no limit to the height of the tree.

The problem is that a program in its textual form is just an array of characters. There is no explicit tree that can be manipulated programmatically. The tree structure is encoded in the text using special tools, such as matching pairs of braces, operator precedence and associativity for expressions, and sometimes indentation for operators.

To make a program amenable to programmatic manipulation, it needs to be parsed into a parse tree data structure first.

The conversion of text into a parse tree is usually split into two phases: lexing (aka tokenizing) and parsing. The lexer converts the raw text into a stream of tokens. Each token is a number of adjacent characters grouped together. A keyword is a token, a number is a token, a semicolon is also a token.

Here’s the example program converted into a stream of tokens:

public class Main { public static void method ( boolean... conditions ) { int x ; x = 1 ; if ( conditions [ 0 ] ) { x =2 ; if ( conditions [ 1 ] ) { x = 3 ; } x = 4 ; if ( conditions [ 2 ] ) { x = 5 ; } } if ( conditions [ 3 ] ) { x = 6 ; } System . out . println ( x ) ; } } <EOF>

Each token has a name (a number) and a range of characters to which is corresponds in the input program. There are usually no tokens corresponding to whitespaces or comments, as they are meaningless to the compiler (but not to a pretty printer or a refactoring tool).

A token might be as simple as a keyword, or pretty complex, such as a floating point literal, say, 4e-1 or -13.29f. Complex tokens are recognized using regular expressions of some sort.

The lexer used to produce the above stream is described using the following (ANTLR) formal grammar, i.e. a formalized specification of the lexer:

VARARG: TYPE '...';
DOT: '.';
SEMICOLON: ';';
EQUALS: '=';
IF: 'if';
LEFT_BRACE: '{';
RIGHT_BRACE: '}';
LEFT_PAREN: '(';
RIGHT_PAREN: ')';
LEFT_SQ_BRACKET: '[';
RIGHT_SQ_BRACKET: ']';
TYPE: 'int' | 'boolean' | 'void';
CLASS: 'class';
VISIBILITY: 'public';
STATIC: 'static';
NUM: [0-9]+;
ID: [a-zA-Z]+;
WS: [ \t\r\n]+ -> skip;

On the left side of the : is a token name, while on the right side is a rule, similar to a regular expression, that specifies the token.

A lexer is limited to non-recursive rules. It preprocesses the input to make it easies and faster for the parser to handle.

The parser operates on the token stream and produces a parse tree. Just as lexers, parsers are also described using a formal grammar. The difference is that the grammar of parsers is specified in terms of tokens, and not characters.

class_definition: VISIBILITY CLASS ID LEFT_BRACE method RIGHT_BRACE;
method: VISIBILITY STATIC TYPE ID LEFT_PAREN VARARG ID RIGHT_PAREN body;
body: statement
    | LEFT_BRACE statement* RIGHT_BRACE;
statement: var_definition
         | assignment
         | if_op
         | method_call;
method_call: ID (DOT ID)* LEFT_PAREN ID RIGHT_PAREN SEMICOLON;
if_op: IF LEFT_PAREN expr RIGHT_PAREN body;
var_definition: TYPE ID SEMICOLON;
assignment: ID EQUALS expr SEMICOLON;
expr: NUM | array_access;
array_access: ID LEFT_SQ_BRACKET NUM RIGHT_SQ_BRACKET;

Here the lower-case names are parser rules, while upper-case names are tokens. | means or and * means zero or more.

The complete parse tree is pretty big even for the small example program, so I only show a parse tree for the following part:

if (conditions[0]) {
    x = 2;
    if (conditions[1]) {
        x = 3;
    }
    x = 4;
    if (conditions[2]) {
        x = 5;
    }
}

Here it is:

symmetric cipher

Normally, a parse tree would be converted into an Abstract Syntax Tree (AST), which would be smaller, with less noise. But in this case, I decided to stick with the parse tree. Mostly because ANTLR generates handy visitors for its parse trees, and I wasn’t concerned with performance and memory usage.

The parse tree differs from an AST in that it contains unnecessary tokens such as braces, commas, semicolons etc. The tokens that specified the tree structure in the one-dimensional textual form, and that aren’t needed anymore now that the tree has been explicitly constructed.

Parse/syntax trees are great for text manipulation such as refactorings or pretty printing. But they don’t provide an explicit view of the control flow of the program. And the control flow graph is required for many kinds of optimizations and diagnostics.

Intermediate representation

An intermediate representation is an assembly-like language, a series of simple instructions, including branching instructions. It doesn’t have to be a full-blown virtual machine with registers, a complete memory model and exception mapping. Instead, it is a convenient presentation of a control flow graph. It is low level enough to be source language-agnostic, as many high level languages can be translated into a single intermediate representation.

Intermediate representations are used in managed environment such as Java VM (Java bytecode), .NET Framework (Intermediate Language, IL). And also in compilers, most notably in LLVM (the aptly named Intermediate Representation, IR) and GCC (GENERIC and GIMPLE).

The majority of compile-time optimizations are applied to the intermediate representation. Only a few instructions are sufficient to represent the input program.

Assign <num>. For x = <num> expressions.
Compare <num>. For conditions[<num>] expressions.
Branch <label>. Transfers control to a label, depending on the result of the previous Compare instruction. The input programs only have ifs to change the control-flow, but this instruction can be used for any other types of control flow operators, including loops.
Label <num>. The destination of a branch instruction. It doesn’t need to be a separate instruction, as it merely marks the next adjacent instruction. But it simplifies the intermediate representation.
Call <name>. For the System.out.println(x); statement.

The intermediate representation is built directly from the parsing tree. Here it is for the example program:

x = 1
Compare 0
Branch L0
x = 2
Compare 1
Branch L1
x = 3
L1:
x = 4
Compare 2
Branch L2
x = 5
L2:
L0:
Compare 3
Branch L3
x = 6
L3:
Call System.out.println(x)

The straightforward translation of the parsing tree left some consequent labels, but it doesn’t really matter. See the translation algorithm here.

Control flow graph

The program in its intermediate representation is actually a graph. But before we can build such a graph explicitly, as a data structure, we need to extract basic blocks from the intermediate representation.

Basic blocks are series of instructions that are always executed consequently, without interruptions. In other words, a basic block may only contain a single Label instruction, that must be its first instruction. And also only a single Branch instruction (if any), that is always its last instruction.

Below is a complete control-flow graph for the example program. Each node is a basic block. Notice that not all blocks end with a branch, or start with a label. All the branches are conditional, meaning that based on the previous comparison a branch is either taken or ignored.

control flow graph

The numbers in brackets are the possible values of x at the end of the block from which the corresponding edge originates.

This control-flow graph is finally suitable for the analysis we wanted to implement all along.

The algorithm constructing this graph is based on finding lead instructions, which are either branches, or the targets of branches. Once all leads are discovered, the program can be split into basic blocks, where each basic block starts with a lead instruction. Here is the implementation.

Data Flow Analysis

Remember, that the reason we did all the above, is to calculate all the possible values of x at the end of the program, no matter what values the conditions array contains.

We are interested in what possible values of x are at the beginning and end of each basic block.

Note the two properties of each basic block:

If a block assigns the x variable, then at the end of the block x equals the assigned value, irrespective of the possible x values at the beginning of the block.
If a block doesn’t assign a value to x, then the list of possible values is inherited from all the immediate ancestors of the basic block.

So each basic block has an associated list of possible values of x at the end of that block. This type of analysis is called reaching definitions analysis. Meaning that we want to answer the question: which definitions can reach a particular point in the program?

As each basic block depends on its ancestors, this analysis lends itself to a recursive algorithm. Notice, however, that our input programs never contain loops, nor do they compute the values of x, and, thankfully, they don’t compute the conditional expressions. If they did any of this, the algorithm would become more complicated, as would the intermediate representation. But still the architecture outlined above would remain the same.

The resulting algorithm is very short and can be found here. It caches the results of recursive calls so that the same values don’t get recomputed over and over again.