Developing a custom domain-specific language with Jison

Published on Sunday, April 21st 2019 at 21:01
Last updated on Thursday, March 21st 2024 at 15:58

We are continuously bombarded with a plethora of new and different languages blossoming out of thin air every week. It looks like every grandma and her dog is trying to create the next big general purpose language. While a wild growth of different short-lived ill-conceived languages may become annoying, the idea behind the construction of them always look appealing and interesting. How many times did you think of writing your own language? The tools that allow such a feat are there and waiting to be used.

When to create a new language?

Given the premise of this articule, it is necessary to put some assumptions in place to limit the scope and breadth of the current topic. Take it as a friendly advice. If you are thinking of creating yet another general purpose programming language, it is probably not worth it. There are so many languages that already exist, which boast impressive code bases, tons of libraries and wildly active comunities. Sometimes, it’s useful to step back and avoid reinventing the wheel. Unless you find it fun: then go for it. Personal enjoyment should never be subject to the rules of time economy and efficiency.

However, it must be noted that other types of languages exist, other than general purpose programming languages. We have scripting languages, markup languages, definition languages, and so on and so forth… In most cases, it is possible to take one of the existing languages out of the toolbox and bend it to a specific task without too much trouble, saving time and money. For example, let’s say we want to define a way to specify how geometric figures are positioned within a picture and the colors used to draw them. First, we should look for a tool that already solves this problem: luckily it already exists and it’s called scalable vector graphics; but let’s assume for the sake of argument that no one invented it yet. We could come up with an entire new language, write an interpreter, test it and finally put it to work, a week later. Or, we could simply use json or yaml with a predefined schema and integrate any one of the thousand ready-made parsers out there to get up and running in less than an hour.

As you can see, it is often more convenient to build on what we already have rather than creating something new. Nevertheless, there are cases in which it is useful (or fun) to start anew. That is especially true for domain-specific languages, which are used to solve problems in a very narrow niche. In this article, I will exemplify how to create a domain specific language to write simple music. Of course, other similar more powerful languages already exist. My goal is to merely illustrate how that can be achieved, regardless of the specific utility of the language itself.

How to create a new language?

Creating a language is easy. Just think about what you want it to look like, write down some examples and a minimal set of rules, and you got yourself a broad overview. That’s the creative part, and it is pretty much the shortest bit of the whole process. In order to have a proper interpreter or compiler for it, there are many steps to go through:

First, it is necessary to specify the building blocks of the language. In layman terms, they are the smallest bits of syntax with meaning. For example, keywords (if, else, while, etc…), operators (+, =, !, etc…), comments (/*any word between these asterisks*/). In jargon, they are called lexemes - or more simply just tokens - and the component that produces them is called a lexer.
Language rules should be specified in a formal way, so that it is possible for a machine to grasp them. The conceptual tool used to express them is called grammar. A grammar specifies how lexems come together to form more complex rules (in jargon, productions) that carry higher level meanings (for example, the whole if then else block). There’s a whole field of computer science that deals with grammars and how they evolve into formal languages with different expressive power. If you are a CS graduate, you may still have nightmares about it. However, I will not delve further into the topic here.
A tool is now needed to crunch textual input, process it according to the grammar, and produce an abstract representation of the program (in jargon, an abstract syntax tree). This component is called parser. Among its functions, it also provides feedback about any mistakes that the input contains (for example, missing semicolons or unclosed braces), i.e. any erroneous artifact that prevents the input from satisfying all the rules imposed by the grammar.
Although the output of the parser may represent a valid program with respect to the given grammar, it may not be correct. There are properties that are simply not possible to check during parsing. For example, consider the snippet below. a cannot be assigned to b, because one is a string the other is a number. Types are not compatible. A semantic analyzer verifies that the program also satisfies these higher-level rules, which are not part of the grammar but still contribute in defining the overall language behaviour.
1
2
const a: string = "Hello!";
const b: number = a;

Now that the textual input has been transformed into a correct abstract representation of a program, the possibilities are endless. For programming languages, the abstract syntax tree can be scanned in order to directly produce a result by emulating instructions inscribed in its contents: in this case, the program is interpreted. Alternatively, the tree can be used to create an executable file in a lower level language, like a machine language (like assembly) or an intermediate language (like java bytecode): in this case, the program is compiled. It is also possible to just translate the input into another high-level language (like javascript), which is referred to as transpilation.

Even though up until now I always referred to the textual input as “the program”, don’t be fooled into thinking that the output must be some kind of executable. Produced artefacts can be anything: images, videos, music, code, etc… For example, VHDL is a language used to speficy circuitry: its output, in a broad sense, is the blueprint to actually create physical hardware. Domain-specific languages are especially prone to have alternative kinds of output, hence my choice of toy project for this article.

Enter the parser generators

As you may have noticed, creating whole toolchain to deal with a language is no easy task. It involes several steps, each requiring a lot of attention. Luckily, some of these steps can be automated. In particular, steps 1 and 3, as creating a lexer and a parser for a given grammar is a well-known problem with many available solutions. A parser generator is a program that takes a grammar definition in input and produces a parser for that grammar (and/or a lexer included with it). In the javascript world, one of the most used parser generators is Jison, which aims at replicating the same functionality that Bison had for the C language. It also accepts the same grammar format in input, making it easier for people to migrate from one tool to the other.

Writing the grammar

So, we now have the knowledge of the tools we can use. Still need the core idea of the language. Here is a sample I came up with to broadly sketch how I wanted it to look like:

section "Intro"
    tempo: 120 bpm
    time_signature: 2/4
    ---
    measure: rest/4 B4/16 A4/16 G#4/16 A4/16
    measure: C5/8 rest/8 D5/16 C5/16 B4/16 C5/16
    measure: E5/8 rest/8 F5/16 E5/16 D#5/16 E5/16
    measure: B5/16 A5/16 G#5/16 A5/16 B5/16 A5/16 G#5/16 A5/16
end

section "Finale"
    tempo: 120 bpm
    time_signature: 2/4
    ---
    measure: C6/4 A5/8 C6/8
    measure: B5/8 A5/8 G5/8 A5/8
    measure: B5/8 A5/8 G5/8 A5/8
    measure: B5/8 A5/8 G5/8 F#5/8
    measure: E5/4

    @ Ignore the rest here, just an example
    measure: import 1 to 4 from "Intro"
    measure: import 1 to 5 from "Finale"
end

song "Rondo alla Turca"
    section "Intro"
    section "Finale"
end

And the semantic rules I associated with it:

A song is a list of sections;
Each section must have a tempo and a time signature indication. It also must have a non-empty list of measures;
A measure can be a list of notes and rests, which are written in a format similar to the GUIDO notation. It can also be a reference to any other sublist of measures from an existing section;
Imports must only refer to measures with notes in them.

How do we transition from that script to a full grammar specification? First we must start with the lexemes. To get them, we must break the language syntax down to its atomic components:

Comments, i.e. everything starting with a @;
Whitespaces, which are ignored;
Notes and rests;
Numbers (used when specifying tempo and time signature);
The “body separator” ---, which separates the declaration of a section’s attributes from the list of its measures;
Strings, i.e. everything in between quotes. They are used as names for sections and songs;
All the keywords of the language, for example section, song, tempo, etc…
Finaly, the “operators” we use, i.e. : and /. They are not technically operators, but they are symbols we can treat as such. Just keep in mind that their role is more that of a separator, similar to the body separator listed above.

The lexicon specifies how to build these lexemes starting from a sequence of characters. In Jison and in many other parser generators, lexemes are usually specified using regular expressions. The generated lexer will go through the list of all given regular expressions and attempt to match them to the text in input. If a match is found, the associated token is returned. This process repeats until all the text in input has been exhausted or until some text is found that cannot be matched to any expression: in this case, the lexer stops and throws an error. It is very important to note that:

The lexer attempts to find the earliest, longest matches first

Therefore, the order in which regular expressions for tokens are written is extremely important. Using the standard Bison format, here is the lexicon for our grammar:

/* lexical grammar */
%lex

%%
\s*\@[^\n\r]*      /* skip line comments */
\s+                /* skip whitespace */

[A-G](\#|b)?[1-8]\/(128|64|32|16|8|4|2|1)(\.{0,3})  return 'NOTE';
rest\/(128|64|32|16|8|4|2|1)(\.{0,3})               return 'REST';

[0-9]+\b           return 'NUMBER';
\-{3,}             return 'BODY_SEPARATOR';
\"[^"\n]+\"        return 'STRING'

"section"          return 'SECTION';
"song"             return 'SONG';
"end"              return 'END';
"tempo"            return 'TEMPO';
"time_signature"   return 'TIME_SIGNATURE';
"measure"          return 'MEASURE';
"import"           return 'IMPORT';
"from"             return 'FROM';
"to"               return 'TO';
"bpm"              return 'BPM';
":"                return ':';
"/"                return '/';

/lex

%lex, %% and /lex are special codes used to separate the lexicon from the actual grammar in a Bison file. You can disregard them as mandatory fluff. The rest of the script contains many lines of definitions, each consisting of two parts: a regular expression and a return statement. The former is the expression that the lexer will attempt to match, whereas the latter indicates an arbitrary name that will be used later to refer to tokens. Lines without any return statement, like the first two, will simply not produce any token and thus will be ignored (in compliance with our assumptions). Two things are worth noticing:

First, the usage of return. Bison grammar files are hybrid. They contain lexer and grammar rules, but they also contain code. The latter will be executed in tandem with the parsing logic, allowing developers to create complex interactions between the source and its environment. If the presence of code is limited here, it will become more and more apparent as soon as we start to write grammar rules.
Tokens produced by the lexer are not single atomic units. They are a pair, which contains the token name (assigned by the return) and the actual portion of text that matched the corresponding regular expression. For example, if the string rest/4 is given in input, it will produce a token named REST whose string value is rest/4. Names are used to easily refer to tokens in grammar rules, while their values can be manipulated further or carried over by any other logic we wish to add later.

Before writing the full grammar, we can try it out with a very simple production. Save the following in a file named one-rule-grammar.jison:

Grammar proof of conceptview raw

/* lexical grammar */
%lex

%%
\s*\@[^\n\r]*      /* skip line comments */
\s+                /* skip whitespace */

[A-G](\#|b)?[1-8]\/(128|64|32|16|8|4|2|1)(\.{0,3})  return 'NOTE';
rest\/(128|64|32|16|8|4|2|1)(\.{0,3})               return 'REST';

[0-9]+\b           return 'NUMBER';
\-{3,}             return 'BODY_SEPARATOR';
\"[^"\n]+\"        return 'STRING'

"section"          return 'SECTION';
"song"             return 'SONG';
"end"              return 'END';
"tempo"            return 'TEMPO';
"time_signature"   return 'TIME_SIGNATURE';
"measure"          return 'MEASURE';
"import"           return 'IMPORT';
"from"             return 'FROM';
"to"               return 'TO';
"bpm"              return 'BPM';
":"                return ':';
"/"                return '/';

/lex

/* operator associations and precedence */

%left ':'
%left '/'

%start expressions

%% /* language grammar */

expressions
    : song
        { return { song: $1 }; }
    ;

song
    : SONG STRING END
        { $$ = { name: $2 } }
    ;

Then npm install jison -g and execute from the command line:

$ jison one-rule-grammar.jison
$ node
> const parser = require('./one-rule-grammar').parser
undefined
> parser.parse('song "A good song" end')
{ song: { name: '"A good song"' } }

As you can see, generating a parser is as simple as executing jison from the command line, which creates a new javascript file containing the parser code. The latter can then be imported into node like any other module to parse source code in input. The return value of the parser depends on what we write in the grammar definition. In our case:

%start expressions

%% /* language grammar */

expressions
    : song
        { return { song: $1 }; }
    ;

song
    : SONG STRING END
        { $$ = { name: $2 } }
    ;

%start defines the root production of the grammar. After that, single rules can be written to further outline the desired behaviour. Each rule is composed by three parts:

The head or name of the rule, for example expressions or song. These symbols are often referred to as non-terminals in jargon;
The body of the rule, which is one or more sequences of tokens or non-terminals. For example, expressions is the root rule, therefore the whole input must satisfy it. Its body is simply song, which means that the parser must try to match another rule named song in order to fulfill the expressions structure. In turn, song is only satisfied if the input text produced three specific tokens in succession;
The code associated to the rule, detailed between a pair of curly braces. Literally any valid javascript code can go in here. Usually it is better to keep it short and write any helper functions in another part of the file, delimited by a curly brace/percent block. In this section, developers have access to special variables provided by the parser, namely $$, $1, $2, etc… $$ represents the output of the rule, which can be any valid javascript expression. In this case we chose to return objects with properties inside them. Variables named $n contain the value of the n-th matched symbol. If it is a token, it contains the value of the token: you can now understand why I explained earlier that tokens have both a name (used to compose rules) and a value (used by rule code to build the parser output). Otherwise, if the symbol is a non-terminal, i.e. it refers to another grammar rule, $n will contain the output of the code associated to that rule.

Since the test input above was song "A good song" end, the lexer produced three tokens: SONG, STRING and END. They in turn match one of our grammar rules, thus the input program is correct. If you try to truncate the input string, you will get a parsing error:

> parser.parse('song "A good song"')
Thrown:
{ Error: Parse error on line 1:
song "A good song"
------------------^
Expecting 'END', got '1'
    at Parser.parseError (...:100:21)
    at Parser.parse (...:167:22)
  hash:
   { text: '',
     token: 1,
     line: 0,
     loc:
      { first_line: 1, last_line: 1, first_column: 5, last_column: 18 },
     expected: [ "'END'" ] } }

In fact the only production in our grammar expects to receive those three tokens but only two were found, therefore the program is incomplete.

Putting it together

By slowly separating each syntax artifact into small grammar rules, it’s fairly straightfoward to construct the final grammar. You can take a peek at the complete file on the project repository. You can notice several new things that I introduced in the grammar definition:

A section dedicated to helper functions, written in javascript. This section can contain arbitrary code to be used further down by grammar rules during the parsing process. You may wonder why I used helper functions to parse notes and rests given that we are supposedly delegating those actions to an external tool. Simply put, I wanted those entities to be atomic. If I let the parser generator handle them, I’d have to define some additional lexemes and rules to ensure that no space is included within them. It seemed too much effort.
Alternative rule choices, denoted by the | character. Choices allow the final language to be more flexible by permitting various options for any construct: in our example, measures can either contain a list of notes/rests or be imported from another section.
Lists of things, which are expressed using a recurrent pattern. You may find this useful if you want to experiment with grammars. Essentially, the rule is recursive. It says “a list is either a single item, or a smaller list plus one item”. Notice that this particular rule does not allow empty lists.

list
    : list item
        { $$ = $1.concat([$2]) }
    | item
        { $$ = [$1] }
    ;

This concludes the first part of the article. In the next part, we will build on this grammar by:

writing a semantic analyzer for a parsed program
implementing a way to “run” the program, by playing the encoded song out with Tone.js
visualizing the source code with custom syntax highlighting using CodeMirror.

If you don’t want to wait for the next article, the final project is already available on my personal bitbucket repository here.