[Home]

More lexical analysis

The comment-stripping example that we did in the last page is a special case of lexical analysis. In general, a text file consists of a sequence of characters. Thus, the statement

if(flag == 0)
   b = 67.5;

consists of the following sequence of characters:

i, f, (, f, l, a, g, =,
=, &lt;blank&gt;, 0, ), &lt;newline&gt;,
&lt;tab&gt;, b, =, 6, 7, ., 5, ;

But surely we do not want to think of the code snippet as merely a list of 21 characters. We want to think of the code as made up of the following 10 "high-level" building blocks:

if
(
flag
==
0
)
b
=
67.5
;

In particular we want to

ignore white-space characters like <blank> etc. However, we do need the white-space characters to delimit the other entities, since
```
      if (a == 0) then b = 9 
      
```
is not the same as
```
      if (a == 0) thenb = 9 
      
```
collect the different tokens and know which token is which. For instance in ForTran we have a token then, which has a special meaning in the language. But in C it is not a reserved word. So in C the token then will be treated a variable name (or identifier).

The analysis of a text-file to achieve these two ends is called Lexical Analysis. We shall consider one simple example of how to recognize identifiers in a program file.

What is an identifier? It is the name of any variable or function or any other object defined by the user. Thus in the following code snippet

void updateFile(str,FILE *f)
{
  if(!f)
    {
      fprintf(stderr,"Invalid file\n");
      exit(1);
    }
  else
    {
      fprintf(f,"%s\n",str);
    }
}

we have 6 identifiers :

updateFile
str
f
fprintf
stderr
exit

Most programming language textbooks specify the permissible identifiers as follows. They first list all the reserved words in the language. Then they say that any string other than these, starting with an alphbetic character (a to z or A to Z or _) and followed by any number of alphanumeric characters is an identifier. Some languages also put a bound on the length of the identifiers.

We shall see how to extract identifier names from a preogram files using a finite automaton. Our description will be deliberately informal, so that the reader is led to draw the transition diagram himself or herself.

Finite automaton to extract identifiers

The input set is of size three. The first input occurs when we read any alphabetic character (one may use the isalpha() function in C). The second input corresponds to reading any digit character, the third input means that the character is neither alphabetic nor numeric.

We shall need two states:

Outside an identifier
After reading the first character of an identifier

We shall not specify the output set. The reader is to fill in that blank with his or her programming common sense. When the automaton is in state 1, it continues to remain there until it meets an input of type 1. Then it moves into state 2. It remains in that state until it gets an input of type 3, which switches the automaton back to state 1.

Finite automata as Recognizers

The lexical analysis example leads to one concept of great practical and theoretical importance, viz., language recognition. In finite automata nomenclature the word language is defined as follows.

Suppose we have a finite (nonempty) set of symbols. This set is called our alphabet, D. Each symbol is called a letter. Consider the set D* consisting of all (finite) strings of letters (of whatever length, including 0 length). The zero-length string is just 'nothing', and is included in the set for some technical convenience. Each member of D* is called a word. Here is an example.

If D is the English alphabet (lower case, upper case and numerals) then here are some members of D*

a b who 456 robo89 45typ MyGOODNESS

By a language, L, we mean a subset of D*. Thus, in the English example, the set of all grammatically correct words form a language (which is our familiar English language.)

By language recognition we mean contructing some gadget which will take an element of D* and tell us whether it belongs to L or not. Finite automata (and variants thereof) are highly suited to this purpose. We shall discuss below finite automata for recognizing some languages.

In the comment-stripping example from the last page our alphabet consisted of all the allowable characters in a C program. The words were code snippets. Our language consisted of comments ie strings starting with /*, ending with */, and having in the middle any nonnegative number letters such that there is no */ in it.

In the identifier extracting example we had again the same alphabet. But this time our language consisted of words that started with an alphabetic character, and then had any nonnegative number of alphanumeric characters.

Consider the similarity between the two examples. In both cases, the language specification consists of specifying part of the words concretely (eg, must start with /* or alphabetic letter) and the rest of it is described as any nonnegative number of some partical type of letters. Any language that is specified by these two methods can be easily recognized by a finite automaton. If the reader has followed the last two examples (comment-stripping and identifier-extracting) well then constructing a recognizer for any such language should be easy. We take up this issue below with some rigor.

Details

Let D be our alphabet. The language to be recognized is L, a subset of D*. An automaton M for recognizing this language has input set consisting of all the letters in D alongwith one more special symbol (call it EOF). The output set is of size three. The outputs are : "yes" , "no" and "nothing". The last output is just to allow the automaton to refrain from showing any visible output. We shall say that M recognizes L if for any input string of the form d EOF, where d is a word in D*, the automaton produces the output "yes" finally if and only if d is a word in L. We do not care about the outputs it produces when it is somewhere halfway through the input string.

It is not true that any language L can be recognized by a finite automaton. The languages that can be recognized by a finite automaton are called regular expressions.

May be you already have an idea about what regular expressions are. They are used widely in editors and search engines. Here are a few examples:

grep/egrep: In Unix envioronments, one uses the grep or egrep utility to search inside files. Thus, for instance, to see if a file named foo.txt has any of the words "romeo" and "Romeo", we use the command
$ grep [rR]omeo foo.txt
Here the [rR] mean "either r or R". In general, square brackets in a regular expression denote one of the letters enclosed.
To see if foo.txt contains any word made up of only the letters "a", "g", "t" and "c" the command is
$ grep ([agtc])+ foo.txt
The symbol (...)+ in a regular expression means any positive number of repetition of the thing inside the parantheses, in this case [agtc].
The C identifiers considered above are of the form
[a-z]([0-9a-z])*
Here [a-z] is an abbreviation for [abc...wxyz], and similarly for [0-9a-z]. The (...)* means any nonnegative number of repetitions of the thing inside the parantheses.

It should not be too hard for the reader to adapt the argument given for recognizing C identifiers to recognize any regular expression. We shall not go into the details here, since more interesting things are awaiting us in the next page!

Table of contents

More lexical analysis

Finite automaton to extract identifiers

Finite automata as Recognizers

Details