Get Eli: Translator Construction Made Easy at
    Fast, secure and Free Open Source software downloads

General Information

 o Eli: Translator Construction Made Easy
 o Global Index
 o Frequently Asked Questions


 o Quick Reference Card
 o Guide For new Eli Users
 o Release Notes of Eli
 o Tutorial on Name Analysis
 o Tutorial on Type Analysis

Reference Manuals

 o User Interface
 o Eli products and parameters
 o LIDO Reference Manual


 o Eli library routines
 o Specification Module Library

Translation Tasks

 o Lexical analysis specification
 o Syntactic Analysis Manual
 o Computation in Trees


 o LIGA Control Language
 o Debugging Information for LIDO
 o Graphical ORder TOol

 o FunnelWeb User's Manual

 o Pattern-based Text Generator
 o Property Definition Language
 o Operator Identification Language
 o Tree Grammar Specification Language
 o Command Line Processing
 o COLA Options Reference Manual

 o Generating Unparsing Code

 o Monitoring a Processor's Execution


 o System Administration Guide

Lexical Analysis

Previous Chapter Next Chapter Table of Contents

Literal Symbols

If the generated processor includes a parser (see Top of Syntactic Analysis), then Eli will extract the descriptions of any literal terminal symbols from the context-free grammar defining that parser and add them to the specifications provided by type-`gla' files. For example, consider the following context-free grammar:

Program: Expression .
Expression: Evaluation / Binding .
  Constant / BoundVariable /
  '(' Expression '+' Expression ')' /
  '(' Expression '*' Expression ')' .
Binding: 'let' BoundVariable '=' Evaluation 'in' Expression .

This grammar has nine terminal symbols. Two (Constant and BoundVariable) are given by identifiers, and the other seven ((, ), +, *, let, = and in) are given by literals.

Only the character sequences to be classified as Constant or BoundVariable, and those to be classified as comments, need be defined by type-`gla' files. Descriptions of the symbols given as literals will be automatically extracted from the grammar by Eli. Thus the lexical analyzer for this language might be described by a single type-`gla' file containing the following:

Constant:      PASCAL_INTEGER

Overriding the Default Treatment of Literal Symbols

By default, a literal terminal symbol specified in a context-free grammar supplied to Eli will be recognized as though it had been specified by the appropriate regular expression. Thus the literal symbols '+' and 'let' will be recognized as though the following specifications had been given by the user:

Plus:  $\+
Let:   $let

(Here Plus and Let are arbitrary identifiers describing the initial classifications of the literal symbols. No such identifiers are actually supplied by Eli, but the literal symbols are not initially classified as comments.)

In some situations it is useful to carry out more complex operations at the time the literal symbol is recognized. In this case, the user must do two things:

  1. Mark the literal symbol as being a special case.

  2. Provide a specification for the literal symbol.

As a concrete example, suppose that %% were used as a major separator in the input text and could appear either once or twice. Assume that the first occurrence is required, and the second is optional. All text following the second occurrence is to be ignored.

One approach to this problem would be to count the number of occurrences of the literal symbol %%, advancing to the end of the input text after the second. This could be done by an auxiliary scanner (see Auxiliary Scanners) that either returns a pointer to the character following the %% or a pointer to the ASCII NUL terminating the input text, and a token processor (see Token Processors) that reclassifies the second occurrence of %% as a comment. The grammar would specify only the required first occurrence of %%.

In order to mark the literal symbol %% as a special case that should not receive the default treatment, the user must supply a type-`delit' file specifying that symbol as a regular expression. The entry in the type-`delit' file also needs to define an identifier to represent the classification:

$%%  PercentPercent

Each line of a type-`delit' file consists of a regular expression and an identifier, separated by white space. The regular expression must describe a literal symbol appearing in a context-free grammar supplied to Eli. That literal symbol will not be incorporated automatically into the generated lexical analyzer; it must be specified explicitly by the user. The identifier will be given the appropriate value by an Eli-generated #define directive in file `litcode.h'.

In our example, %% could be specified by the following line of a type-`gla' file:

  $%%  (SkipOrNot) [CommentOrNot]

Initially, the separator will be classed as a comment because there is no identifier preceding the regular expression. SkipOrNot will use a state variable to decide whether or not to skip text (see Building scanners), while CommentOrNot will use the same state variable to decide whether or not to change the classification to PercentPercent (see Building processors):

#include <fcntl.h>
#include "source.h"
#include "litcode.h"

static int Second = 0;

char *
SkipOrNot(char *start, int length)
{ if (!Second) return start + length;
  initBuf("/dev/null", open("/dev/null", O_RDONLY));
  return TEXTSTART;

CommentOrNot(char *start, int length, int *syncode, int *intrinsic)
{ if (!Second) { Second++; *syncode = PercentPercent; }

The remainder of the text is skipped by closing the current input file and opening an empty file to read (see Text Input of Library Reference Manual). Since %% is initially classified as a comment, its classification must be changed only on the first occurrence.

File `fcntl.h' defines open and O_RDONLY, `source.h' defines initBuf, finlBuf and TEXTSTART, and `litcode.h' defines PercentPercent.

Using Literal Symbols to Represent Other Things

In some cases the phrase structure of a language depends upon layout cues rather than visible character sequences. For example, indentation is used in Occam2 to indicate block structure: If the first non-blank character of a line is indented further than the first non-blank character of the line preceding it, then the new line begins a new block. If the first non-blank character of a line is not indented as far as the first non-blank character of the line preceding it, then the old line ends one or more blocks depending on the difference in indentation. If the first non-blank characters of two successive lines are indented by the same amount, then the lines simply contain adjacent statements of the same block.

Layout cues can be represented by literal symbols in the context-free grammar that describes the phrase structure. The processing needed to recognize the layout cues can then be described in any convenient manner, and the sequence of white space characters implementing those cues can be classified as the appropriate literal symbol.

Suppose that the beginning of a block is represented in the Occam2 grammar by the literal symbol '{', the statement separator by ';', and the end of a block by '}'. In the input text, blocks and statement separators are defined by layout cues as described above. A type-`delit' file marks the literal symbols as requiring special recognition and associates an identifier with each:

$\{  Initiate
$;  Separate
$\}  Terminate

Indentation can be specified as white space following a new line:

  $\n[\t\040]*  [OccamIndent]

The token processor OccamIndent would carry out all of the processing necessary to determine the meaning of the indentation. This processing is complex, involving interactions with several other components of the generated lexical analyzer (see An Example of Interface Usage). It constitutes an operational definition of the meaning of indentation in Occam2.

Previous Chapter Next Chapter Table of Contents