[Home]

Table of contents


Wisent parser generator

For this tutorial you have to already know about BNF grammars and tools like flex and bison or yacc. If you want to learn about those, then read my tutorial here.

Emacs has its own lexical analyser and parser generator. These are like flex and bison written in emacs lisp, and available as part of the emacs package cedet. By the way, cedet is not installed by default when you install emacs. You can download the package manually from github. I refrain from providing a URL here, as these things keep on changing abruptly. Do a web search. You'll possibly find many github links all promising to provide cedet. Many of them will be mirrors of each other, while some will contain only a subset of the files. Just remember that the cedet that we shall need must contain the following forder hierarchy:
lisp/cedet/semantic/wisent
Then inside wisent there should be a number of .wy files. Just download such a version of cedet, unzip it and remember the name of the folder. No need to install anything right now. We shall use load-file as an when necessary.

Our grammar

We shall work with the following grammar:
line : NUM
line : line NUM

Here the terminal NUM denotes a number, and the nonterminal line, as you can see from the grammar itself, is just a list of a positive number of NUM's. Just to give a purpose to this we shall try to find the sum of all the numbers.

Specifying the grammar

This we can do in two ways: In this tutorial we shall learn only the low level method. Even though the high level method is admittedly easier to use, an acquaintance with the low level method will help us to understand (and debug) the high level method. Here is our grammar in the low level method:
( 
  (number)
1
  () 
2
  (line 
3
     ( (number) (string-to-number $1) )
4
     ( (line number) (+ $1 (string-to-number $2) )
  )
)
5

Explanation of the code:

1:
First you give a list of terminals your grammar is going to use. Here we have only one: number. Notice that we have changed the name of the terminal from NUM to number. This is to take advantage of already existing lexical analysis features of semantic, where a number terminal is called number. Of course, it is possible (and easy) to use custom names like NUM, but in this first example we stick to the names provided by semantic.
2:
In bison you can specify things like associativity and precedence of the operators as well as types of the terminals and nonterminals. Well, you can do the same here. This list is precisely for that purpose. In our simple example we just leave it empty. Caution: you cannot skip the list completely, the empty list must be present as a placeholder!
3:
Now we are about to specify a rule for the nonterminal line
4:
Here we specify the first RHS, i.e., number. The action follows. Here the action just converts the attribute (which is a string) of $1 (i.e., the 1-th symbol in the RHS) to a number (and makes it the attribute of the LHS nonterminal).
5:
Very similar to the last line.

Let us save it to a variable.

(setq g 
   '((number) () (line ( (number)  (string-to-number $1))
                            ( (line number) (+ $1
   (string-to-number $2)))))
)

Don't forget the little single quote character just before the grammar. It prevents lisp from trying to evaluate the grammar list.

Creating the parser

Now we have to apply the LALR(1) algo to it to create the action and goto tables. For this we shall use the wisent-grammar-compile function that is defined in the file cedet/semantic/wisent/comp.el. At this point I urge you browse through the directory hierarchy of the files you have downloaded, and manually load the file using M-x load-file.

Let's apply wisent-grammar-compile:
(setq p (wisent-compile-grammar g))
The action and goto tables (plus sundry other relevant pieces of information) are stored in lisp format inside the variable p. If you type p in the buffer and simply hit C-x C-e you'll see the value of p in the minibuffer. Not that it will help to look at it, but you get some satisfaction that wisent-compile-grammar is doing something!

Well, that takes care of the parser. Now we need a lexer to feed inputs to the parser.

Creating the lexer

This is slightly trickier, because emacs already contains a suit of lexical analysis tools, and we want to make use of them. We first use the define-lex function that comes with semantic-mode. Issue M-x semantic-mode for this.
(define-lex lx "" 
  semantic-lex-ignore-newline  
  semantic-lex-ignore-whitespace
  semantic-lex-number
  semantic-lex-default-action)
Eventually we are going to use the function semantic-lex-buffer, which will make use of the lx that we created just now. The question is how do we pass lx to semantic-lex-buffer? The technique is to use a global variable semantic-lex-analyzer:
(setq  semantic-lex-analyzer 'lx)
Now, emacs has the concept of a syntax table that tells it about the category of each character in a buffer. While doing lexical analysis emacs allows you the luxury of using any syntax table of our choice. This allows analysing a buffer in a way other than what its current mode dictates. So we need to specify the syntax table:
(setq semantic-lex-syntax-table (standard-syntax-table))
Now we are all set and can apply semantic-lex-buffer to extract a sequence of tokens. We want to save this sequence in a variable where wisent can find it.
  (setq wisent-lex-istream (semantic-lex-buffer))
Finally we have to do the parsing:
(wisent-parse p 'wisent-lex)
Here wisent-lex is a built-in function that reads the tokens one by one from the variable wisent-lex-istream that we created just now.

The wisent-parse function will return the attribute associated with the start symbol. Thus, in our example, the return value is the sum, which we can pretty-print like this:
(insert (format "Total is %s\n" (wisent-parse p 'wisent-lex))))
Nothing complicated so far! Now we encounter a little difficulty. We have made use of different global variables to interface our code with existing wisent code. All these global variables are what emacs calls buffer local, which means each buffer has its own copies of these variables. Even if you use setq to change the value the changes are not visible in a different buffer. Let's understand why this is going to pose a difficulty for us. We want to parse a buffer. Call it the target buffer. Our compiler code will reside in another buffer (compiler buffer, say). So all the above setq's will be inside the compiler buffer. The moment you visit the target buffer to start the parsing, all the globals will forget the values that you setq-ed into them in the compiler buffer! So it looks like we need to have the setq's inside the target buffer. But that does not look good either: that buffer should just contain the stuff that we want to parse.

The solution to this dilemma utilises a behaviour of emacs: If you call a function while visiting a buffer, and that function sets the value of a buffer local variable, then the new value remains visible in that buffer. So we must pack all our setq's that deal with the buffer local variables, into a function, and call that function (interactively) from the target buffer:
(defun doparse () 
  (interactive)
  (setq  semantic-lex-analyzer 'lx)
  (setq semantic-lex-syntax-table (standard-syntax-table))
  (setq wisent-lex-istream (semantic-lex-buffer))

  (insert (format "Total is %s\n" (wisent-parse p 'wisent-lex))))
Now we are all set. Create a buffer containing a list of numbers like
34 3.9 -89 
Visit that buffer, and issue the command M-x doparse. You will immediately see the output: "Total is -51.1" written where your cursor is.

But...

...you are not happy with all this, because...
  1. ...you want to have more exciting actions instead of writing the sum of the numbers right into the very same buffer!
  2. ...you hate the idea of writing your grammar as a lisp structure, and want to avoid so much boilerplate code. Also you want to use associativity, precedence etc.
  3. ...you want to integrate the entire activity into emacs, like add a menu item with names of functions, provide auto-completion or such goodies.
Let us take up these issues one by one. To have exciting actions you have to know how to achieve those action using emacs-lisp. Then just replace the simple actions that I have shown. Your imagination and grasp of emacs lisp are the only two limits on what you can achieve.

It is possible to specify the grammar much like .y file. Typically the extension .wy is used. Then you generate most of the code by just hitting a single key (assuming that lots of things are in their proper places, of course!).

Integration with the rest of emacs may be done in two ways. The first is using suitable action codes. For example, you may like to create menu that lists all the inner classes in your java file. Then your parser code can accumulate all the inner classes and you can create a menu with the result returned by wisent-parse. If your requirement is not very exotic, then you might be better off to allow semantic to work with your parser. Semantic already has built-in tools for various useful activities like tag completion, tag highlighting etc.

HTML Comment Box is loading comments...