Railroad diagrams

Introduction

While I was studing the source code of colm, I found it hard to wrap my mind around it syntax and structure. The best guide that I could find was the colm.lm file. I recalled that the DB2 SQL reference guide had these very nice railroad diagrams that were very intuitive. I decided to make them for colm as well with colm.lm as the starting point.

The goal (not smart)

Make :

  • o a maximum coverage of the syntax,
  • o in a nice diagram,
  • o with a minimum set of tools,
  • o with a minimum amound of work,
  • o with minium manual modifications,
  • o with a fast transformation time.

I modified colm.lm just a tiny bit to make it a bit easier:

  • the lex blocks are all at the start of the document.
  • the patterns of the tokens are all put on one line.

The tools:

The transformation tools

1. generate to svg

Yeah, that would be great, but I do not have so much time. I decided to use a existing generator, and just transform colm.lm to whatever input would be needed.

I decided to userailroad-diagram-generator because it is is a commandline tool, GPL-ed, and looked very structured and small.

2. colm, obviously,

As colm is a transformation language, it makes sense to use that as dogfood. :-) So I tried, and tried and tried. For several days, it got me nothing but practice in frustration management and persistance. In the end I decided to cut my losses and accept that I am still a very NOOB when it comes to colm. I leared a lot, but I do not want to show you all the dumb experiments that I did.

3. Transform

I moved all the 'lex' blocks together, that way I could make the parsing a bit easier

First: bash and sed

We need to process the defs as these are the EBNF ``rules`.

grep -n 'def '  colm.lm \
    | head -n 1 \
    | cut -d':' -f1

That give us the clue that it starts at line 206. With tail we can extract the rest of the file.

tail -n +206 colm.lm
Next step: cleanup /transfom

Because I am very familiar with regular expressions, I achieved quite a bit with sed in a short time.

    "s/commit$//g"                  # remove commit if it is at the end of a line
    "s/ $//g"                       # remove space at the end of a line 
    "s/ :[A-Za-z]*$//g"             # remove a type at the end of a line
    "s/[A-Za-z]*: //g"              # remove a type for a symbol
    "s/\[\]/empty/g"                # hack to replace []
    "s/\[//g"                       # remove the [
    "s/ $//g"                       # remove space at the end of a line
    "s/\]//g"                       # remove the ]
    "/^#/D"                         # delete lines with comments
    "s/^\t\([a-zA-Z]*\)/\t\t\1/g"   # indent
    "s/^|/\t\t\|/g"                 # indent for the or
    "s/^def \(.*\)/\1\t::=\t/g"     # change 'def xxx' into 'xxx       ::= '
Next step: substitute literals

The next step was to substitute literals with their characters. A literal like TILDE should be in the EBNF file as "~". That way the railroad diagram can display it as a terminal instead of a symbol.

Getting the pairs was quite easy:

grep 'token [A-Z_]* ' $LM |sed -e "s/ ni$//g" -e "s/token *\([A-Z_]*\) *\/ *\(.*\) *\/ *$/('\1', r\"\2\"), /g"

But when it came to quoting the backreferences for the final substition of literals, I encountered problems. I tried a lot of things:

  • arrays in bash
  • IFS and set in bash

Step by step, hack after hack, I got further and further but I feld that is was like i wass putting bandaid upon bandaid upon bandaid. After an hour I decided to take another approach.

4. bash + sed + python

Python has a lot of powers to transform strings: unicode, raw, "double", """tripple double""", 'single', '''tripple''', \qouting, and the full power of regex.

Once more, there is plenty room for improvement, but I used pragmatic solutions to bypass the problems. The newlines, '"', "'" and "|" were especially tricky.

The result

That script generates this ebnf file, that the railroad-diagram-generator will use as input. The identation can be improved but it works! Overall, I was quite happy with the generated railroad diagrams, it realy shows how structured the colm language is.

file: script.sh

#!/bin/bash

LM=colm.lm
EBNF=colm.ebnf

position=`grep -n 'def ' $LM | head -n 1 | cut -d':' -f1`
tail -n +${position} $LM | \
    cat  |\
    sed \
    -e "s/commit$//g" \
    -e "s/ $//g" \
    -e "s/ :[A-Za-z]*$//g" \
    -e "s/[A-Za-z]*: //g" \
    -e "s/\[\]/empty/g" \
    -e "s/\[//g" \
    -e "s/ $//g" \
    -e "s/\]//g" \
    -e "/^#/D" \
    -e "s/^\t\([a-zA-Z]*\)/\t\t\1/g" \
    -e "s/^|/\t\t\|/g" \
    -e "s/^def \(.*\)/\1\t::=\t/g" \
     > $EBNF

tuple_list=`grep 'token [A-Z_]* ' $LM |sed -e "s/ ni$//g" -e "s/token *\([A-Z_]*\) *\/ *\(.*\) *\/ *$/('\1', r\"\2\"), /g"`
python3 - << EOF
import re
with open("$EBNF", 'r+', encoding="utf-8") as file_h:
    content = file_h.read()
    tuple_list = [ $tuple_list ]
    tuple_list.sort(key=lambda tup: (len(tup[0]), tup[0]), reverse=True)
    orred = re.compile("' +\| +'")
    corrections = {
            'LIT_DQ': """'"'""",
            'DQ': """'"'""",
            'CONS_SQ': '''"'"''',
            'CONS_SQ_NL': 'NEWLINE',
            'SQ': '''"'"''',
            'LIT_DQ_NL': 'NEWLINE',
            'TILDE_NL': '"~" NEWLINE',
            }
    for token, characters in tuple_list:
        pattern = re.compile(r"\b%s\b" % (token))
        if token in corrections:
            characters = corrections[token]
        if ' . ' in characters or ' .. ' in characters:
            characters = '"' + characters + '"'
        if orred.search(characters):
            characters = re.sub(orred, '" | "', characters)
            characters = '( "' + characters + '" )'
            content = pattern.sub(" " + characters + " ", content)
        else:
            content = pattern.sub('' + characters + '', content)
    file_h.seek(0)
    file_h.write(content)
EOF

Navigating through the generated image was quite a burdon. Therefor I decided to go the extra mile and make a change to the railroad-diagram-generator, to enable navigation from the symbol-term to the rule-term.

The final result is available here!

Next steps

  1. Do this exersize again, but only with colm instead of bash, sed, python
  2. Do something clever about EMPTY, NEWLINE, etc.
  3. Document the railroad diagrams also in colms documentation.