Nicolò Andronio

Nicolò Andronio

Full-stack developer, computer scientist, engineer
Evil Genius in the spare time

Developing a custom domain-specific language with Jison

We are continuously bombarded with a plethora of new and different languages blossoming out of thin air every week. It looks like every grandma and her dog is trying to create the next big general purpose language. While a wild growth of different short-lived ill-conceived languages may become annoying, the idea behind the construction of them always look appealing and interesting. How many times did you think of writing your own language? The tools that allow such a feat are there and waiting to be used.

When to create a new language?

Given the premise of this articule, it is necessary to put some assumptions in place to limit the scope and breadth of the current topic. Take it as a friendly advice. If you are thinking of creating yet another general purpose programming language, it is probably not worth it. There are so many languages that already exist, which boast impressive code bases, tons of libraries and wildly active comunities. Sometimes, it’s useful to step back and avoid reinventing the wheel. Unless you find it fun: then go for it. Personal enjoyment should never be subject to the rules of time economy and efficiency.

However, it must be noted that other types of languages exist, other than general purpose programming languages. We have scripting languages, markup languages, definition languages, and so on and so forth… In most cases, it is possible to take one of the existing languages out of the toolbox and bend it to a specific task without too much trouble, saving time and money. For example, let’s say we want to define a way to specify how geometric figures are positioned within a picture and the colors used to draw them. First, we should look for a tool that already solves this problem: luckily it already exists and it’s called scalable vector graphics; but let’s assume for the sake of argument that no one invented it yet. We could come up with an entire new language, write an interpreter, test it and finally put it to work, a week later. Or, we could simply use json or yaml with a predefined schema and integrate any one of the thousand ready-made parsers out there to get up and running in less than an hour.

As you can see, it is often more convenient to build on what we already have rather than creating something new. Nevertheless, there are cases in which it is useful (or fun) to start anew. That is especially true for domain-specific languages, which are used to solve problems in a very narrow niche. In this article, I will exemplify how to create a domain specific language to write simple music. Of course, other similar more powerful languages already exist. My goal is to merely illustrate how that can be achieved, regardless of the specific utility of the language itself.

How to create a new language?

Creating a language is easy. Just think about what you want it to look like, write down some examples and a minimal set of rules, and you got yourself a broad overview. That’s the creative part, and it is pretty much the shortest bit of the whole process. In order to have a proper interpreter or compiler for it, there are many steps to go through:

Even though up until now I always referred to the textual input as “the program”, don’t be fooled into thinking that the output must be some kind of executable. Produced artefacts can be anything: images, videos, music, code, etc… For example, VHDL is a language used to speficy circuitry: its output, in a broad sense, is the blueprint to actually create physical hardware. Domain-specific languages are especially prone to have alternative kinds of output, hence my choice of toy project for this article.

Enter the parser generators

As you may have noticed, creating whole toolchain to deal with a language is no easy task. It involes several steps, each requiring a lot of attention. Luckily, some of these steps can be automated. In particular, steps 1 and 3, as creating a lexer and a parser for a given grammar is a well-known problem with many available solutions. A parser generator is a program that takes a grammar definition in input and produces a parser for that grammar (and/or a lexer included with it). In the javascript world, one of the most used parser generators is Jison, which aims at replicating the same functionality that Bison had for the C language. It also accepts the same grammar format in input, making it easier for people to migrate from one tool to the other.

Writing the grammar

So, we now have the knowledge of the tools we can use. Still need the core idea of the language. Here is a sample I came up with to broadly sketch how I wanted it to look like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
section "Intro"
tempo: 120 bpm
time_signature: 2/4
---
measure: rest/4 B4/16 A4/16 G#4/16 A4/16
measure: C5/8 rest/8 D5/16 C5/16 B4/16 C5/16
measure: E5/8 rest/8 F5/16 E5/16 D#5/16 E5/16
measure: B5/16 A5/16 G#5/16 A5/16 B5/16 A5/16 G#5/16 A5/16
end

section "Finale"
tempo: 120 bpm
time_signature: 2/4
---
measure: C6/4 A5/8 C6/8
measure: B5/8 A5/8 G5/8 A5/8
measure: B5/8 A5/8 G5/8 A5/8
measure: B5/8 A5/8 G5/8 F#5/8
measure: E5/4

@ Ignore the rest here, just an example
measure: import 1 to 4 from "Intro"
measure: import 1 to 5 from "Finale"
end

song "Rondo alla Turca"
section "Intro"
section "Finale"
end

And the semantic rules I associated with it:

How do we transition from that script to a full grammar specification? First we must start with the lexemes. To get them, we must break the language syntax down to its atomic components:

The lexicon specifies how to build these lexemes starting from a sequence of characters. In Jison and in many other parser generators, lexemes are usually specified using regular expressions. The generated lexer will go through the list of all given regular expressions and attempt to match them to the text in input. If a match is found, the associated token is returned. This process repeats until all the text in input has been exhausted or until some text is found that cannot be matched to any expression: in this case, the lexer stops and throws an error. It is very important to note that:

The lexer attempts to find the earliest, longest matches first

Therefore, the order in which regular expressions for tokens are written is extremely important. Using the standard Bison format, here is the lexicon for our grammar:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
/* lexical grammar */
%lex

%%
\s*\@[^\n\r]* /* skip line comments */
\s+ /* skip whitespace */

[A-G](\#|b)?[1-8]\/(128|64|32|16|8|4|2|1)(\.{0,3}) return 'NOTE';
rest\/(128|64|32|16|8|4|2|1)(\.{0,3}) return 'REST';

[0-9]+\b return 'NUMBER';
\-{3,} return 'BODY_SEPARATOR';
\"[^"\n]+\" return 'STRING'

"section" return 'SECTION';
"song" return 'SONG';
"end" return 'END';
"tempo" return 'TEMPO';
"time_signature" return 'TIME_SIGNATURE';
"measure" return 'MEASURE';
"import" return 'IMPORT';
"from" return 'FROM';
"to" return 'TO';
"bpm" return 'BPM';
":" return ':';
"/" return '/';

/lex

%lex, %% and /lex are special codes used to separate the lexicon from the actual grammar in a Bison file. You can disregard them as mandatory fluff. The rest of the script contains many lines of definitions, each consisting of two parts: a regular expression and a return statement. The former is the expression that the lexer will attempt to match, whereas the latter indicates an arbitrary name that will be used later to refer to tokens. Lines without any return statement, like the first two, will simply not produce any token and thus will be ignored (in compliance with our assumptions). Two things are worth noticing:

Before writing the full grammar, we can try it out with a very simple production. Save the following in a file named one-rule-grammar.jison:

Grammar proof of conceptview raw
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
/* lexical grammar */
%lex

%%
\s*\@[^\n\r]* /* skip line comments */
\s+ /* skip whitespace */

[A-G](\#|b)?[1-8]\/(128|64|32|16|8|4|2|1)(\.{0,3}) return 'NOTE';
rest\/(128|64|32|16|8|4|2|1)(\.{0,3}) return 'REST';

[0-9]+\b return 'NUMBER';
\-{3,} return 'BODY_SEPARATOR';
\"[^"\n]+\" return 'STRING'

"section" return 'SECTION';
"song" return 'SONG';
"end" return 'END';
"tempo" return 'TEMPO';
"time_signature" return 'TIME_SIGNATURE';
"measure" return 'MEASURE';
"import" return 'IMPORT';
"from" return 'FROM';
"to" return 'TO';
"bpm" return 'BPM';
":" return ':';
"/" return '/';

/lex

/* operator associations and precedence */

%left ':'
%left '/'

%start expressions

%% /* language grammar */

expressions
: song
{ return { song: $1 }; }
;

song
: SONG STRING END
{ $$ = { name: $2 } }
;

Then npm install jison -g and execute from the command line:

1
2
3
4
5
6
$ jison one-rule-grammar.jison
$ node
> const parser = require('./one-rule-grammar').parser
undefined
> parser.parse('song "A good song" end')
{ song: { name: '"A good song"' } }

As you can see, generating a parser is as simple as executing jison from the command line, which creates a new javascript file containing the parser code. The latter can then be imported into node like any other module to parse source code in input. The return value of the parser depends on what we write in the grammar definition. In our case:

1
2
3
4
5
6
7
8
9
10
11
12
13
%start expressions

%% /* language grammar */

expressions
: song
{ return { song: $1 }; }
;

song
: SONG STRING END
{ $$ = { name: $2 } }
;

%start defines the root production of the grammar. After that, single rules can be written to further outline the desired behaviour. Each rule is composed by three parts:

Since the test input above was song "A good song" end, the lexer produced three tokens: SONG, STRING and END. They in turn match one of our grammar rules, thus the input program is correct. If you try to truncate the input string, you will get a parsing error:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
> parser.parse('song "A good song"')
Thrown:
{ Error: Parse error on line 1:
song "A good song"
------------------^
Expecting 'END', got '1'
at Parser.parseError (...:100:21)
at Parser.parse (...:167:22)
hash:
{ text: '',
token: 1,
line: 0,
loc:
{ first_line: 1, last_line: 1, first_column: 5, last_column: 18 },
expected: [ "'END'" ] } }

In fact the only production in our grammar expects to receive those three tokens but only two were found, therefore the program is incomplete.

Putting it together

By slowly separating each syntax artifact into small grammar rules, it’s fairly straightfoward to construct the final grammar. You can take a peek at the complete file on the project repository. You can notice several new things that I introduced in the grammar definition:

1
2
3
4
5
6
list
: list item
{ $$ = $1.concat([$2]) }
| item
{ $$ = [$1] }
;

This concludes the first part of the article. In the next part, we will build on this grammar by:

If you don’t want to wait for the next article, the final project is already available on my personal bitbucket repository here.