colm

Code spelunking on a niche language

Introduction

Data conversion is my daily bread and butter. Be it mangling an CSV file, querying SQL servers, extracting list on a sharepoint server, moving ICS files into an oracle database, reading output via SNMP or pivoting something in excel.

I like to transform data in one form into one another form, and I am good at it. But I am not doing this just for fun. I am doing this because data it self is useless, it must be presented as information to the people who need it as the basis for sound decision making.

The tools that I use depend on the situation:

  • if it is on a linux server it could be bash tools;
  • if have access to a webserver in the same network, I might fallback to python or even PhP;
  • if the user clever and needs to dig further into the rawdata, then plain excel or SQLite might be suiteable.
  • if it is a one-off task, I can use low level tools and tricks.

However, with the advert of bigdata, and increasingly more and larger logfiles; I started to realize that a more structured approach could be helpfull. The pro's use dedicated parsers.

To transform data, the firs step is to get and parse the data. Apperently the average speed to process an CSV file with libcsv is +- 200 mb per second. This is anorder of magnitude lower than the speed of plain disk reads.

When I looked into parsers I was overwhelmed with al the variants, LARL, PEG, RAT, PRATT etc. It is all so abstract, and not compatible with my must, get hands dirty (TM) lifestyle.

Enter Ragel

I found Zed's Shaw article delightfull, but never got around to look further into ragel. When I found out of the Rosie language and ohm I thought that the time has come to upgrade my tools to the next level. So I decided that is was about time to have a deeper look at ragel.

Ragel was created by Dr. Adrian Thurston who is a guru on the field of parsers. Ragel is software that can generate a parser based on a language definition. He created ragel and colm during his PhD.

Here are some interesting links from the official website and twitter:

Enter colm

Although I am not paranoid, I jumped to conclusions:

  • Ragel is widely apriciated and brings value to the table
  • Ragel will be build upon colm
  • Ragel has got some improvements because of colm
  • Adrian will give colm more focus

So this got me interested, what is this colm? There is no manual, almost no documentation nor a good README in the code. That could mean that colm is the secret weapon that Paul Graham talks about.

And thus started the quest to learn as much as possible about this weapon to harvest it's power for my own benefit. :-)