Railroad diagrams
Introduction
While I was studing the source code of colm, I found it hard to wrap my mind around it syntax and structure.
The best guide that I could find was the colm.lm
file. I recalled that the DB2 SQL reference guide had these very nice railroad diagrams that were very intuitive. I decided to make them for colm as well with colm.lm
as the starting point.
The goal (not smart)
Make :
- o a maximum coverage of the syntax,
- o in a nice diagram,
- o with a minimum set of tools,
- o with a minimum amound of work,
- o with minium manual modifications,
- o with a fast transformation time.
I modified colm.lm just a tiny bit to make it a bit easier:
- the
lex
blocks are all at the start of the document. - the patterns of the tokens are all put on one line.
The tools:
The transformation tools
1. generate to svg
Yeah, that would be great, but I do not have so much time.
I decided to use a existing generator, and just transform colm.lm
to whatever input would be needed.
- http://wiki.tcl.tk/21708 Need to install tcl/tk tooling
- https://github.com/marcioAlmada/railroad-diagram-generator looks structured and very small, allthough it is c++
- http://www.graphviz.org/ I know graphiv, but railroad diagrams will look very ugly
- http://dotnet.jku.at/applications/Visualizer/ windows and gui
- http://homepages.inf.ed.ac.uk/wadler/realworld/syngen.html looks abandoned
- https://github.com/tabatkins/railroad-diagrams browser only
- https://github.com/katef/kgt could not get pmake to compile it
I decided to userailroad-diagram-generator
because it is is a commandline tool, GPL-ed, and looked very structured and small.
2. colm, obviously,
As colm is a transformation language, it makes sense to use that as dogfood. :-) So I tried, and tried and tried. For several days, it got me nothing but practice in frustration management and persistance. In the end I decided to cut my losses and accept that I am still a very NOOB when it comes to colm. I leared a lot, but I do not want to show you all the dumb experiments that I did.
3. Transform
I moved all the 'lex' blocks together, that way I could make the parsing a bit easier
First: bash and sed
We need to process the def
s as these are the EBNF ``rules`.
grep -n 'def ' colm.lm \ | head -n 1 \ | cut -d':' -f1
That give us the clue that it starts at line 206. With tail we can extract the rest of the file.
tail -n +206 colm.lm
Next step: cleanup /transfom
Because I am very familiar with regular expressions, I achieved quite a bit with sed in a short time.
"s/commit$//g" # remove commit if it is at the end of a line "s/ $//g" # remove space at the end of a line "s/ :[A-Za-z]*$//g" # remove a type at the end of a line "s/[A-Za-z]*: //g" # remove a type for a symbol "s/\[\]/empty/g" # hack to replace [] "s/\[//g" # remove the [ "s/ $//g" # remove space at the end of a line "s/\]//g" # remove the ] "/^#/D" # delete lines with comments "s/^\t\([a-zA-Z]*\)/\t\t\1/g" # indent "s/^|/\t\t\|/g" # indent for the or "s/^def \(.*\)/\1\t::=\t/g" # change 'def xxx' into 'xxx ::= '
Next step: substitute literals
The next step was to substitute literals with their characters.
A literal like TILDE
should be in the EBNF file as "~"
. That way the railroad diagram can display it as a terminal instead of a symbol.
Getting the pairs was quite easy:
grep 'token [A-Z_]* ' $LM |sed -e "s/ ni$//g" -e "s/token *\([A-Z_]*\) *\/ *\(.*\) *\/ *$/('\1', r\"\2\"), /g"
But when it came to quoting the backreferences for the final substition of literals, I encountered problems. I tried a lot of things:
arrays
in bashIFS
andset
in bash
Step by step, hack after hack, I got further and further but I feld that is was like i wass putting bandaid upon bandaid upon bandaid. After an hour I decided to take another approach.
4. bash + sed + python
Python has a lot of powers to transform strings: unicode
, raw
, "double"
, """tripple double"""
, 'single'
, '''tripple'''
, \qouting
, and the full power of regex
.
Once more, there is plenty room for improvement, but I used pragmatic solutions to bypass the problems. The newlines, '"', "'" and "|" were especially tricky.
The result
That script generates this ebnf file, that the railroad-diagram-generator will use as input. The identation can be improved but it works! Overall, I was quite happy with the generated railroad diagrams, it realy shows how structured the colm language is.
file: script.sh
#!/bin/bash LM=colm.lm EBNF=colm.ebnf position=`grep -n 'def ' $LM | head -n 1 | cut -d':' -f1` tail -n +${position} $LM | \ cat |\ sed \ -e "s/commit$//g" \ -e "s/ $//g" \ -e "s/ :[A-Za-z]*$//g" \ -e "s/[A-Za-z]*: //g" \ -e "s/\[\]/empty/g" \ -e "s/\[//g" \ -e "s/ $//g" \ -e "s/\]//g" \ -e "/^#/D" \ -e "s/^\t\([a-zA-Z]*\)/\t\t\1/g" \ -e "s/^|/\t\t\|/g" \ -e "s/^def \(.*\)/\1\t::=\t/g" \ > $EBNF tuple_list=`grep 'token [A-Z_]* ' $LM |sed -e "s/ ni$//g" -e "s/token *\([A-Z_]*\) *\/ *\(.*\) *\/ *$/('\1', r\"\2\"), /g"` python3 - << EOF import re with open("$EBNF", 'r+', encoding="utf-8") as file_h: content = file_h.read() tuple_list = [ $tuple_list ] tuple_list.sort(key=lambda tup: (len(tup[0]), tup[0]), reverse=True) orred = re.compile("' +\| +'") corrections = { 'LIT_DQ': """'"'""", 'DQ': """'"'""", 'CONS_SQ': '''"'"''', 'CONS_SQ_NL': 'NEWLINE', 'SQ': '''"'"''', 'LIT_DQ_NL': 'NEWLINE', 'TILDE_NL': '"~" NEWLINE', } for token, characters in tuple_list: pattern = re.compile(r"\b%s\b" % (token)) if token in corrections: characters = corrections[token] if ' . ' in characters or ' .. ' in characters: characters = '"' + characters + '"' if orred.search(characters): characters = re.sub(orred, '" | "', characters) characters = '( "' + characters + '" )' content = pattern.sub(" " + characters + " ", content) else: content = pattern.sub('' + characters + '', content) file_h.seek(0) file_h.write(content) EOF
Navigating through the generated image was quite a burdon. Therefor I decided to go the extra mile and make a change to the railroad-diagram-generator, to enable navigation from the symbol-term to the rule-term.
The final result is available here!
Next steps
- Do this exersize again, but only with colm instead of bash, sed, python
- Do something clever about EMPTY, NEWLINE, etc.
- Document the railroad diagrams also in colms documentation.