Getting started with colm

History

Colm has its origin as a data transformation language. It started as the PHD project from Adrian Thurson and is the basis of the Ragel State Machine Compiler. It is both a tool and a scripting language.

As a scripting language it is influenced by TXL but it feels rather pythonic and c-ish when you use it.

As a tool, it is quite strange:

  • you define your script-snippets next to the language definition
  • it needs a compiler (gcc) when it runs the script
  • it's output is a binary (or code) that you can link to in your code.

So the binary can convert a input stream accourding to the script snippets.

Think sax parsing, but very meta.

It is C++ and build upon a crossplatform c++ library aapl.

FAQ

What is it?

a transformation language

How cool is it?

Extremely.

Competitors?

TXL, ANTRL, .. . When to use it? When you need to transform somekind of input into somekind of output. Duh. It is realy convenient if you know the rules of the input language. As the transformation is done by the compiler, it can generate very fast code. Oh, and it can work with streams and files.

Is it dead?

Nope, But the development has a rather slow pace. This is probably due to the 'niche', the complexity.

Where is it?

http://www.colm.net/open-source/colm/

How to use it

The documentation is a quite sparse. Some bits are documented in the /doc directory as ASCIIDOC. Some bits and pieces can be found in the test directory.

The thesis gives also good insights into the design of the language. Since then the language has evolved quite a bit and some code snippets are not working any more.

By skimming through the thesis, I found that colm is build around a virtual machine that parses input and can modify the parse tree on the go. Colm is the language that instructs the VM what to do.

Download

file: 1_download.sh

#!/bin/bash

wget -q http://www.colm.net/files/colm/thurston-phdthesis.pdf
sudo apt-get install make libtool gcc g++ autoconf automake
mkdir -p tmp
cd ./tmp
git clone git://git.colm.net/colm.git
cd colm
./autogen.sh
./configure
make

When we look a little bit closer we see that colm:

  • is able to be build as a static and/or shared libray.
  • is licenced under GPL 2
  • is equiped with a vim syntax highlighting file
  • is using the AAPL (LGPL 2.1 licenced) library from Adrian Thurston (ragel is using this library as well).
  • is documented via the thesis, or by studying the code

vim syntax highlighting

To activated colm syntax highlighting in vi:

cp /path_to_extracted_files/lcolm.vim ~/.vim/syntax

And add the following lines to your ~/.vimrc

" Work with colm
au BufRead,BufNewFile *.lm set filetype=colm

First impressions

./src/colm

Gives us:

Error: colm: colm: no input file given
./src/colm --help
usage: colm [options] file
general:
  -h, -H, -?, --help   print this usage and exit
   -v --version         print version information and exit
   -b <file>            write binary to <file>
   -o <file>            write object to <file>
   -e <file>            write C++ export header to <file>
   -x <file>            write C++ export code to <file>
   -m <file>            write C++ commit code to <file>
   -a <file>            additional code file to include in output program
   -E N=V               set a string value availabe in the program
   -I <path>            additional include path for the compiler
   -i                   activate branchpoint information
   -L <path>            additional library path for the linker
   -l                   activate logging
   -c                   compile only (don't produce binary)
   -V                   print dot format (graphiz)
   -d                   print verbose debug information

This reveals us some more insights: it reads a 'colm' file and creates a object file with eventually CPP/H/X code. How the colm file should look like, what happens and why is not clear yet. :-)

There is one file in the repository that stands out: 'colm.lm' It's syntax looks like the colm language that is presented in the thesis.

file: fizzbuzz.lm

int modulo( value:int, div:int) {
    times:int = value / div
    return value - ( times * div )
}

i:int = 0
while( i < 20 ) {
    mod5:int = modulo( i, 5 )
    mod3:int = modulo( i, 3 )
    if ( mod5 == 0 && mod3 == 0 ) {
        print( "FizzBuzz\n" )
    } elsif( mod5 == 0 ) {
        print( "Buzz\n" )
    } elsif( mod3 == 0 ) {
        print( "Fizz\n" )
    } else {
        print( i, "\n" )
    }
    i = i + 1
}

Some things jump to the attention:

  • there are functions, with a return type just as in c.
  • it has a 'identifier:type' declaration format

Colm 101

Hello world 1

file: hello_world_001.lm

print "hello world\n"

./tmp/colm/colm hello_world_001.lm

There are files created

ls -1 hello_world_001*
hello_world_001
hello_world_001.c
hello_world_001.lm

When we execute this file:

./hello_world_001
hello world

Amazing!

Hello world again

file: hello_world_002.lm


print( "hello ""world" "\r\n" )
    print 'hello ' "\"world\"" "\n"

hello world
hello "world"

We can now start experimenting with stuff. We see that :

  • 'print' can also be called as a function
  • single and double quotes can be used
  • there is no need for a concat operator
  • whitespace is not significant
  • newlines '\n' appear to be '\r\n'

Hello Figure 4.4

Browsing throught the thesis, it seems that colm is a genuine scripting language. A kind of fizzbuzz is decribed in the thesis on page 87 in Figure 4.4. Unfortunately this does not work with the current version. I guess that the language has evolved since 2008. With a little bit of fiddling we can get it to work.

file: figure_4_4.lm

#Slightly modified example of Figure 4.4, page 87
i: int = 0
j: int = i

while ( i < 10 ) {
    if ( i *( 10 - i) < 20 ) {
        print ( "hello ", i, ' ', j , '\n' )
        j = j+ 1
    }
    i = i + 1
}

./tmp/colm/src/colm figure_4_4.lm
./figure_4_4
hello 0 0
hello 1 1
hello 2 2
hello 8 3
hello 9 4

We see that that it is a c-like language with:

  • the variables are typed (i:int)
  • there appears to be no postfix increment operator (i = i + 1)

Hello Fizzbuzz

Now let us try to make a real fizzbuzz program, just to get familiar with the language.

file: fizzbuzz.lm

int modulo( value:int, div:int) {
    times:int = value / div
    return value - ( times * div )
}

i:int = 0
while( i < 20 ) {
    mod5:int = modulo( i, 5 )
    mod3:int = modulo( i, 3 )
    if ( mod5 == 0 && mod3 == 0 ) {
        print( "FizzBuzz\n" )
    } elsif( mod5 == 0 ) {
        print( "Buzz\n" )
    } elsif( mod3 == 0 ) {
        print( "Fizz\n" )
    } else {
        print( i, "\n" )
    }
    i = i + 1
}

Please note:

  • That the '&&' operator is working.
  • The return type is needed, but if there is no return statement, 'nil' is retuned
  • It appears that there is no modulo operator '%' as is common in other languages. Therefor we'll resort to a function.
  • Writing a function seems rather straight forward.
./tmp/colm/src/colm ./doc/code/fizzbuzz.lm
ls -ltr ./doc/code/fizzbuz*
./fizzbuzz 
FizzBuzz
1
2
Fizz
4
Buzz
Fizz
7
8
Fizz
Buzz
11
Fizz
13
14
FizzBuzz
16
17
Fizz
19

We can strip the executable and it is still smaller

strip ./fizzbuzz 
ls -l ./fizzbuzz 
-rwxr-xr-x 1 peter peter 14520 Jan 18 15:57 ./doc/code/fizzbuzz

Hello scope

In the thesis, there is a mention that the variable are in the global scope of in the function scope. Let's try that.

file: global_scope.lm

str d (where:str) {
    print( "in D ", where, "\n")
    where = "d"
    print( "in D ", where, "\n")
}

str c ( ) {
    print( "in C ", where_g, "\n")
    where_g = "c"
    print( "in C ", where_g, "\n")
}

str b ( where:str ) {
    print( "in B ", where, "\n")
    where = "b" 
    print( "in B ", where, "\n")
}

str a( where:str ) {
    print( "in A ", where, "\n")
    where = "a" 
    b( where )
    print( "in A ", where, "\n")
}

where: str =  "global"
print( "in global ", where, "\n")
a( where )
print( "in global ", where, "\n")
global where_g:str
c( )
print( "in global ", where_g, "\n")

The thesis also mentions that variables in nested block level scope might change in the future

file: nested_scope.lm


str a( where:str ) {
    print( "before block1 ", where, "\n" )
    while(true) {
        where = "block1"
        print( "in block1 ", where, "\n" )
        i:int = 0
        while( true ) {
            where =  where + "a"
            print( "in loop ", where, "\n" )
            break
        }
        print( "in block1 ", where, "\n" )
        break
    }
    print( "in A ", where, "\n" )
    return where
}

where: str =  "global"
print( "in global ", where, "\n" )
a( where )
print( "in global ", where, "\n" )

It seems that this is still the case.

Hello reference

The thesis also mentions that variables can be passed by reference instead of by value.

file: by_reference.lm

str sa( where:ref < str > ) {
    print( "in SA ", where, "\n" )
    where = "sa"
    print( "in SA ", where, "\n" )
}

where: str =  "global"
print( "in global ", where, "\n" )
sa( where )
print( "in global ", where, "\n" )

It appears that we can change strings, but not integers nor bools.

Hello extended types

We can build extended types with structs, list and map, and you can iterate over them.

file: extended_type.lm

alias Value_t map<int,str> 
values:Value_t = new Value_t()

values->insert(0, "Ace")
values->insert(1, "1")
values->insert(2, "2")
values->insert(3, "3")
values->insert(4, "4")
values->insert(5, "5")
values->insert(6, "6")
values->insert(7, "7")
values->insert(8, "8")
values->insert(9, "9")
values->insert(10, "Ten")
values->insert(11, "Jack")
values->insert(12, "Queen")
values->insert(13, "King")

alias Suit_t map<int,str> 
suit:Suit_t = new Suit_t()
suit->insert(1, "hearts")
suit->insert(2, "spades")
suit->insert(3, "diamonds")
suit->insert(4, "clubs")


struct Card_t
    s:int
    v:int
end

alias Hand_t list<Card_t>

struct Person_t
    name:str
    age:int
    hand:Hand_t
end

john:Person_t

john = new Person_t()
john->name = "john"
john->age = 18
john->hand = new Hand_t()

card:Card_t = new Card_t()
card->s = 2
card->v = 13
john->hand->push(card)

print("ok ", john->name, " ", john->age, "\n")
for card:Card_t in john->hand {
    print("\n\t", suit->find(card->s), " ", values->find(card->v), "\n")
    }

Hello stdin

The documentation gives us a more practical example how we can transform input.

file: assign.lm

lex start
    token id / ('a' .. 'z' | 'A' .. 'Z' ) + /
    token value / ( 'a' .. 'z' | 'A' .. 'Z' | '0' .. '9' )+ /
    literal `= `;
    ignore / [ \t\n] /
end

def assignment
    [ id `=  value `;]

def assignment_list
    [assignment assignment_list]
|	[assignment]
|	[]

parse Simple: assignment_list[ stdin ]

for I:assignment in Simple {
    print( I.id, "->", I.value, "\n" )
}

We can read from a stream, and naturally the following streams are available:

  • stdin
  • stdout
  • stderr
./tmp/colm/src/colm assign.lm
echo -e 'b=3;a=1;\n c=2;' | ./assign
b->3
a->1
c->2

Hello DNS Parsing

Writing a fizzbuzz is one thing, parsing a file is something else. Parsing a binary file is a quite a different sport. Parsing a binary stream is an completely different level.

Have a look at the DNS parsing example in test/binary1.lm.

Hello XML parsing

But there is also access to file streams with the 'open' function. this returns a stream.

A reworked version of figure 4.18 from the thesis could look like this.

file: figure_4_18.lm

lex
    token id /[a-zA-Z_][a-zA-Z0-9_]*/
    literal `= `< `> `/
    ignore /[ \t\n\r\v]+/
end


def attr
    [id `= id]

def open_tag
    [`< id attr* `>]

def close_tag
    [`< `/ id `>]

def tag
    [open_tag item* close_tag]

def item
    [tag]
|	[id]


print("start", "\n")

for arg:list_el<str> in argv {
    filename:str = arg->value
    print("processing ", filename, "\n")
    stream: stream = open( filename, 'r' )
    Tag:tag = parse tag [ stream ]
    match Tag ["<person name=" Val1:id attr* `> item* "</person>" ]
    stream->close( )
}

print("end", "\n")

file: blablabla.xml

<?xml version="1.0" encoding="utf-8"?>
<root>
    <person name=first>john</person>
    <person name=last>do</person>
</root>

./tmp/colm/src/colm figure_4_18.lm
./figure_4.18 blablabla.xml

Bummer. this segfaults, not shure why...

Further reading:

We only tipped the top of the iceberg here. When you browse the colm.lm file, the files in the doc or the test directory and read the thesis, you might find a lot of insight information.

This HN thread is also quite good.

If you are stuck, the mailing list might also help you.

FAQ

Q: I get this error:

/hello_world_001: error while loading shared libraries: libcolm-0.13.0.4.so: cannot open shared object file: No such file or directory

You probably configured and installed colm with the '--prefix' argument

sudo updatedb
ln -s `locate libcolm-0.13.0.4.so` /usr/lib/libcolm-0.13.0.4.so