------------------------------- Page    i -------------------------------

                  Yacc - Yet Another Compiler-Compiler




                                                 Stephen C. Johnson



                                                 Edited for UTS

                                                 February 16, 1981

------------------------------- Page   ii -------------------------------

                            TABLE OF CONTENTS


Abstract  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   1

Introduction  . . . . . . . . . . . . . . . . . . . . . . . . . . . .   1

1.    Basic Specifications  . . . . . . . . . . . . . . . . . . . . .   4

2.    Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . .   6

3.    Lexical Analysis  . . . . . . . . . . . . . . . . . . . . . . .   9

4.    How the Parser Works  . . . . . . . . . . . . . . . . . . . . .  10

5.    Ambiguity and Conflicts . . . . . . . . . . . . . . . . . . . .  16

6.    Precedence  . . . . . . . . . . . . . . . . . . . . . . . . . .  20

7.    Error Handling  . . . . . . . . . . . . . . . . . . . . . . . .  23

8.    The Yacc Environment  . . . . . . . . . . . . . . . . . . . . .  25

9.    Hints for Preparing Specifications  . . . . . . . . . . . . . .  27

10.   Input Style . . . . . . . . . . . . . . . . . . . . . . . . . .  27

11.   Left Recursion  . . . . . . . . . . . . . . . . . . . . . . . .  27

12.   Lexical Tie-ins . . . . . . . . . . . . . . . . . . . . . . . .  29

13.   Reserved Words  . . . . . . . . . . . . . . . . . . . . . . . .  30

Appendix A.    A Simple Example . . . . . . . . . . . . . . . . . . .  31

Appendix B.    Yacc Input Syntax  . . . . . . . . . . . . . . . . . .  34

Appendix C.    An Advanced Example  . . . . . . . . . . . . . . . . .  36

Appendix D.    Old Features Supported but not Encouraged  . . . . . .  43


                                                            Last Page  43

-------------------------------- Page  1 --------------------------------

ABSTRACT

Computer program input generally has some structure; in fact, every  com-
puter program that  does input can  be thought of  as defining an  'input
language' which it accepts.   An input language  may be as  complex as  a
programming language,  or as  simple as  a sequence  of numbers.   Unfor-
tunately, usual input facilities are limited, difficult to use, and often
are lax about checking their inputs for validity.

Yacc provides a general tool for describing the input to a computer  pro-
gram.  The Yacc user specifies the structures of his input, together with
code to be invoked as each such structure is recognized.  Yacc turns such
a specification into a  subroutine that handles  the input process;  fre-
quently, it is convenient  and appropriate to  have most of  the flow  of
control in the user's application handled by this subroutine.

The input subroutine produced  by Yacc calls  a user-supplied routine  to
return the next basic input  item.  Thus, the user can specify his  input
in terms of individual input characters, or in terms of higher level con-
structs such as names  and numbers.  The  user-supplied routine may  also
handle idiomatic features such as  comment and continuation  conventions,
which typically defy easy grammatical specification.

Yacc is written in portable C.  The class of specifications accepted is a
very general one: LALR(1) grammars with disambiguating rules.

In addition to compilers for C, APL, Pascal, RATFOR, etc., Yacc has  also
been used for  less conventional  languages, including a  phototypesetter
language, several desk calculator languages, a document retrieval system,
and a Fortran debugging system.




INTRODUCTION

Yacc provides a general  tool for imposing  structure on the  input to  a
computer program.  The Yacc  user prepares a  specification of the  input
process; this includes rules describing  the input structure, code to  be
invoked when these rules  are recognized, and  a low-level routine to  do
the basic input.  Yacc  then generates  a function to  control the  input
process.  This function,  called a parser,  calls the user-supplied  low-
level input routine (the "lexical analyzer")  to pick up the basic  items
(called tokens)  from  the  input  stream.  These  tokens  are  organized
according to the input structure rules, called "grammar rules"; when  one
of these  rules has  been recognized,  then user code  supplied for  this
rule, an action, is invoked;  actions have the  ability to return  values
and make use of the values of other actions.

-------------------------------- Page  2 --------------------------------

Yacc is written in a  portable dialect of C and  the actions, and  output
subroutine, are in C  as well.  Moreover,  many of the syntactic  conven-
tions of Yacc follow C.

The heart of the input  specification is a  collection of grammar  rules.
Each rule  describes an  allowable structure  and gives it  a name.   For
example, one grammar rule might be

        date  :  month_name  day  ','  year   ;

Here, date, month_name, day, and year represent structures of interest in
the input  process; presumably,  month_name,  day, and  year are  defined
elsewhere.  The comma ',' is enclosed in single quotes; this implies that
the comma is to appear  literally in the input.  The colon and  semicolon
merely serve as punctuation in the rule, and have no significance in con-
trolling the input.  Thus, with proper definitions, the input

        July  4, 1776

might be matched by the above rule.

An important part  of the  input process is  carried out  by the  lexical
analyzer.  This  user routine  reads  the input  stream, recognizing  the
lower level structures, and communicates these tokens to the parser.  For
historical reasons,  a structure  recognized by the  lexical analyzer  is
called a "terminal symbol", while the structure recognized by the  parser
is called a "nonterminal  symbol".  To avoid confusion, terminal  symbols
will usually be referred to as tokens.

There is considerable leeway in deciding whether to recognize  structures
using the lexical analyzer or grammar rules.  For example, the rules

        month_name  :  'J' 'a' 'n'   ;
        month_name  :  'F' 'e' 'b'   ;

                 . . .

        month_name  :  'D' 'e' 'c'   ;

might be used in the above example.  The lexical analyzer would only need
to recognize individual  letters, and month_name  would be a  nonterminal
symbol.  Such low-level rules tend to waste time and space, and may  com-
plicate the specification beyond  Yacc's ability to  deal with it.   Usu-
ally, the lexical analyzer would recognize the month names, and return an
indication that a month_name was seen; in this case, month_name  would be
a token.

Literal characters such as ','  must also be  passed through the  lexical
analyzer, and are also considered tokens.

-------------------------------- Page  3 --------------------------------

Specification files are very flexible.   It is realively  easy to add  to
the above example the rule

        date  :  month '/' day '/' year   ;

allowing

        7 / 4 / 1776

as a synonym for

        July 4, 1776

In most cases, this new  rule could be 'slipped in'  to a working  system
with minimal effort, and little danger of disrupting existing input.

The input being read may not conform to the specifications.  These  input
errors are detected as  early as is  theoretically possible with a  left-
to-right scan; thus, not only is the chance of reading and computing with
bad input data  substantially reduced,  but the bad  data can usually  be
quickly found.  Error handling, provided as part of the input  specifica-
tions, permits the reentry of bad data, or the continuation  of the input
process after skipping over the bad data.

In some cases, Yacc fails to produce a parser when given a set of specif-
ications.  For example, the specifications may be self contradictory,  or
they may require a more  powerful recognition mechanism than that  avail-
able to Yacc.  The former cases represent design errors; the latter cases
can often be corrected by making  the lexical analyzer more powerful,  or
by rewriting some  of the  grammar rules.  While  Yacc cannot handle  all
possible specifications, its power compares  favorably with similar  sys-
tems; moreover, the constructions which are difficult for Yacc to  handle
are also frequently  difficult for  human beings to  handle.  Some  users
have reported that  the discipline of  formulating valid Yacc  specifica-
tions for their input revealed  errors of conception  or design early  in
the program development.

The theory underlying Yacc has  been described elsewhere.  Yacc has  been
extensively used in numerous practical applications, including lint,  the
Portable C Compiler, and a system for typesetting mathematics.

The next several sections describe the basic process of preparing a  Yacc
specification; Section 1 describes the preparation of grammar rules, Sec-
tion 2 the preparation of the user supplied actions associated with these
rules, and Section  3 the  preparation of lexical  analyzers.  Section  4
describes the operation of the parser.  Section 5 discusses various  rea-
sons why Yacc may be unable to produce a parser from a specification, and
what to do about it.  Section 6 describes a simple mechanism for handling
operator precedences  in  arithmetic expressions.   Section  7  discusses
error  detection  and  recovery.   Section  8  discusses  the   operating

-------------------------------- Page  4 --------------------------------

environment and special features of the parsers Yacc produces.  Section 9
gives some suggestions which should  improve the style and efficiency  of
the specifications, and section 10 discusses some advanced topics.   Sec-
tion 11 discussed left recursion, section 12 talks about lexical tie-ins,
and section 13 argues in favor of reserved words.

Appendix A has a  brief example, and  Appendix C gives  a summary of  the
Yacc input syntax.  Appendix  C gives an  example using some of the  more
advanced features of Yacc, and, finally, Appendix D describes  mechanisms
and syntax no longer actively supported, but provided for historical con-
tinuity with older versions of Yacc.




1.    BASIC SPECIFICATIONS

Names refer to either tokens or nonterminal symbols.  Yacc requires token
names to be declared as such.  In addition, for reasons discussed in Sec-
tion 3, it is often desirable to include the lexical analyzer as part  of
the specification file;  it may  be useful to  include other programs  as
well.  Thus, every  specification file  consists of  three sections:  the
declarations,  "(grammar)  rules",  and   programs.   The  sections   are
separated by double percent  '%%' marks.  (The  percent '%' is  generally
used in Yacc specifications as an escape character.)

In other words, a full specification file looks like

        declarations
        %%
        rules
        %%
        programs

The declaration section may be empty.  Moreover, if the programs  section
is omitted, the second  %% mark may  be omitted also; thus, the  smallest
legal Yacc specification is

        %%
        rules

Blanks, tabs, and newlines are ignored except that they may not appear in
names or multi-character reserved symbols.  Comments may appear  wherever
a name is legal; they are enclosed in /* . . . */, as in C and PL/I.

The rules section is  made up of  one or more  grammar rules.  A  grammar
rule has the form:

-------------------------------- Page  5 --------------------------------

        A  :  BODY  ;

A represents a nonterminal name, and  BODY represents a sequence of  zero
or more names and literals.   The colon and the semicolon are Yacc  punc-
tuation.

Names may be of arbitrary length, and may be made up of letters, dot '.',
underscore '_', and non-initial digits.  Upper and lower case letters are
distinct.  The names used  in the body  of a grammar  rule may  represent
tokens or nonterminal symbols.

A literal consists of a character enclosed  in single quotes '''.  As  in
C, the backslash '\' is an escape character within literals,  and all the
C escapes are recognized.  Thus

        '\n' newline
        '\r' return
        '\''        single quote '''
        '\\'        backslash '\'
        '\t' tab
        '\b' backspace
        '\f' form feed
        '\xxx'       'xxx' in octal

For a number of technical reasons, the  NUL character ('\0' or 0)  should
never be used in grammar rules.

If there are  several grammar  rules with the  same left  hand side,  the
vertical bar '|' can be  used to avoid rewriting the left hand side.   In
addition, the semicolon  at the end  of a  rule can be  dropped before  a
vertical bar.  Thus the grammar rules

        A       :       B  C  D   ;
        A       :       E  F   ;
        A       :       G   ;

can be given to Yacc as

        A       :       B  C  D
                |       E  F
                |       G
                ;

It is not necessary that all grammar rules with the same left side appear
together in the grammar rules  section, although it makes the input  much
more readable, and easier to change.

If a nonterminal symbol matches the  empty string, this can be  indicated
in the obvious way:

-------------------------------- Page  6 --------------------------------

        empty :   ;

Names representing tokens must be declared;  this is most simply done  by
writing

        %token   name1  name2 . . .

in the declarations section.  (See  Sections 3 , 5, and  6 for much  more
discussion).  Every  name  not defined  in  the declarations  section  is
assumed to represent a nonterminal symbol.  Every nonterminal symbol must
appear on the left side of at least one rule.

Of all the nonterminal symbols, one, called the "start symbol", has  par-
ticular importance.  The parser is  designed to recognize the start  sym-
bol; thus, this  symbol represents  the largest,  most general  structure
described by the grammar rules.  By default, the start symbol is taken to
be the left hand side of the first grammar rule in the rules section.  It
is possible, and in  fact desirable, to  declare the start symbol  expli-
citly in the declarations section using the %start keyword:

        %start   symbol

The end of the input to the parser is signaled by a special token, called
the endmarker.  If  the tokens  up to, but  not including, the  endmarker
form a  structure which  matches the  start symbol,  the parser  function
returns to its caller after the endmarker is seen; it  accepts the input.
If the endmarker is seen in any other context, it is an error.

It is the job of  the user-supplied lexical analyzer  to return the  end-
marker when  appropriate; see  section 3, below.   Usually the  endmarker
represents some reasonably obvious I/O  status, such as 'end-of-file'  or
'end-of-record'.




2.    ACTIONS

With each grammar rule, the  user may associate  actions to be  performed
each time the rule is recognized in the input process.  These actions may
return values, and may  obtain the values  returned by previous  actions.
Moreover, the lexical analyzer can return values for tokens, if desired.

An action is an arbitrary C statement, and as such can do input and  out-
put, call  subprograms, and  alter  external vectors  and variables.   An
action is specified by one or  more statements, enclosed in curly  braces
'{' and '}'.  For example,

-------------------------------- Page  7 --------------------------------

        A       :       '('  B  ')'
                                {       hello( 1, "abc" );  }

and

        XXX     :       YYY  ZZZ
                                {      printf("a message\n");
                                       flag = 25;   }

are grammar rules with actions.

To facilitate easy communication between the actions and the parser,  the
action statements are altered slightly.  The symbol 'dollar sign' '$'  is
used as a signal to Yacc in this context.

To return a value, the action  normally sets the pseudo-variable '$$'  to
some value.   For example,  an action  that does nothing  but return  the
value 1 is

                {  $$ = 1;  }

To obtain  the  values  returned  by previous  actions  and  the  lexical
analyzer, the action may  use the pseudo-variables  $1, $2, . . .,  which
refer to the values  returned by the  components of the  right side of  a
rule, reading from left to right.  Thus, if the rule is

        A       :       B  C  D   ;

for example,  then $2  has the  value returned  by C,  and $3  the  value
returned by D.

As a more concrete example, consider the rule

        expr    :       '('  expr  ')'   ;

The value returned  by this  rule is  usually the  value of  the expr  in
parentheses.  This can be indicated by

        expr    :        '('  expr  ')'             { $$ = $2 ; }

By default, the value of a rule is the value  of the first element in  it
($1).  Thus, grammar rules of the form

        A       :       B    ;

frequently need not have an explicit action.

In the examples above, all  the actions came at the  end of their  rules.
Sometimes, it is desirable to get control before a rule  is fully parsed.
Yacc permits an action to be written in the middle  of a rule as well  as

-------------------------------- Page  8 --------------------------------

at the end.  This rule  is assumed to return a value, accessible  through
the usual mechanism by the actions to the right  of it.  In turn, it  may
access the values returned by the symbols to its left.  Thus, in the rule

        A       :       B
                                {  $$ = 1;  }
                        C
                                {   x = $2;   y = $3;  }
                ;

the effect is to set x to 1, and y to the value returned by C.

Actions that do  not terminate  a rule are  actually handled  by Yacc  by
manufacturing a new nonterminal symbol name, and a new rule matching this
name to the empty string.   The interior action  is the action  triggered
off by recognizing this added rule.  Yacc actually treats the above exam-
ple as if it had been written:

        $ACT    :       /* empty */
                                {  $$ = 1;  }
                ;

        A       :       B  $ACT  C
                                {   x = $2;   y = $3;  }
                ;

In many applications, output is not done directly by the actions; rather,
a data structure,  such as a  parse tree, is  constructed in memory,  and
transformations are  applied to  it before  output is  generated.   Parse
trees are particularly  easy to  construct, given routines  to build  and
maintain the tree structure desired.  For  example, suppose there is a  C
function node, written so that the call

        node( L, n1, n2 )

creates a node with label L, and descendants  n1 and n2, and returns  the
index of the newly created node.  Then parse tree can be built by supply-
ing actions such as:

        expr    :       expr  '+'  expr
                                { $$ = node( '+', $1, $3 ); }

in the specification.

The user may define other variables to be used by the actions.   Declara-
tions and definitions can appear in the declarations section, enclosed in
the marks '%{' and '%}'.  These declarations and definitions have  global
scope, so  they  are  known to  the  action statements  and  the  lexical
analyzer.  For example,

-------------------------------- Page  9 --------------------------------

        %{   int variable = 0;   %}

could be placed in the  declarations section, making variable  accessible
to all  of the actions.   The Yacc  parser uses only  names beginning  in
'yy'; the user should avoid such names.

In these examples, all the values are integers: a discussion of values of
other types will be found in Section 10.




3.    LEXICAL ANALYSIS

The user must supply a lexical analyzer to read the input stream and com-
municate tokens (with  values, if  desired) to the  parser.  The  lexical
analyzer is  an  integer-valued  function  called  yylex.   The  function
returns an integer, the  "token number", representing  the kind of  token
read.  If  there is  a value  associated with  that token,  it should  be
assigned to the external variable yylval.

The parser and the lexical analyzer must agree on these token numbers  in
order for communication between them  to take place.  The numbers may  be
chosen by Yacc, or chosen  by the user.  In either  case, the '#  define'
mechanism of  C is used  to allow  the lexical analyzer  to return  these
numbers symbolically.  For example, suppose that the token name DIGIT has
been defined in the declarations section of the Yacc specification  file.
The relevant portion of the lexical analyzer might look like:

        yylex(){
                extern int yylval;
                int c;
                . . .
                c = getchar();
                . . .
                switch( c ) {
                        . . .
                case '0':
                case '1':
                  . . .
                case '9':
                        yylval = c-'0';
                        return( DIGIT );
                        . . .
                        }
                . . .

-------------------------------- Page 10 --------------------------------

The intent is to return a token number of DIGIT, and a value equal to the
numerical value of the digit.  Provided that the lexical analyzer code is
placed in the programs section of the specification file, the  identifier
DIGIT will  be defined  as  the token  number associated  with the  token
DIGIT.

This mechanism leads  to clear,  easily modified  lexical analyzers;  the
only pitfall is the  need to avoid using  any token names in the  grammar
that are reserved or significant in C or the parser; for example, the use
of token names if or  while will almost certainly cause severe  difficul-
ties when the  lexical analyzer  is compiled.   The token  name error  is
reserved for error handling, and should not be used naively  (see Section
7).

As mentioned above, the  token numbers may  be chosen by  Yacc or by  the
user.  In the  default situation,  the numbers are  chosen by Yacc.   The
default token number for a  literal character is  the numerical value  of
the character in the local character set.  Other names are assigned token
numbers starting at 257.

To assign  a token  number to  a token  (including literals),  the  first
appearance of the token name  or literal in the declarations section  can
be immediately followed by a nonnegative integer.  This integer is  taken
to be the token  number of the name  or literal.  Names and literals  not
defined by this mechanism retain their default definition.  It is  impor-
tant that all token numbers be distinct.

For historical reasons, the endmarker must  have token number 0 or  nega-
tive.  This token number cannot be redefined by the user; thus, all lexi-
cal analyzers  should be  prepared to  return 0  or negative  as a  token
number upon reaching the end of their input.

A very useful tool for constructing lexical analyzers is the Lex program.
These lexical analyzers are designed  to work in close harmony with  Yacc
parsers.  The  specifications for  these  lexical analyzers  use  regular
expressions instead of grammar rules.  Lex can be easily used  to produce
quite complicated  lexical analyzers,  but  there remain  some  languages
(such as FORTRAN) which do  not fit any theoretical framework, and  whose
lexical analyzers must be crafted by hand.




4.    HOW THE PARSER WORKS

Yacc turns the  specification file  into a  C program,  which parses  the
input according to  the specification  given.  The algorithm  used to  go
from the  specification  to  the  parser is  complex,  and  will  not  be

-------------------------------- Page 11 --------------------------------

discussed here  (see the  references for more  information).  The  parser
itself, however, is relatively  simple, and understanding  how it  works,
while not strictly necessary,  will nevertheless make treatment of  error
recovery and ambiguities much more comprehensible.

The parser produced by  Yacc consists of  a finite state  machine with  a
stack.  The parser is  also capable of  reading and remembering the  next
input token (called the lookahead token).  The "current state" is  always
the one on the top of the stack.  The states  of the finite state machine
are given small integer labels; initially, the machine is in state 0, the
stack contains only state 0, and no lookahead token has been read.

The machine has only four actions available to it, called shift,  reduce,
accept, and error.  A move of the parser is done as follows:

1.   Based on its current state,  the parser decides  whether it needs  a
     lookahead token to decide  what action should  be done; if it  needs
     one, and does not have one, it calls yylex to obtain the next token.

2.   Using the  current state,  and the  lookahead token  if needed,  the
     parser decides on its next action, and carries it out.

This may result in states being pushed onto  the stack, or popped off  of
the stack, and in the lookahead token being processed or left alone.

The shift action is the most common action the parser takes.  Whenever  a
shift action is taken, there  is always a lookahead token.  For  example,
in state 56 there may be an action:

                IF      shift 34

which says, in state 56, if the lookahead token is IF, the current  state
(56) is pushed down on the stack, and state 34  becomes the current state
(on the top of the stack).  The lookahead token is cleared.

The reduce action keeps  the stack from  growing without bounds.   Reduce
actions are appropriate when the parser has seen the right hand side of a
grammar rule, and is prepared to announce that it has seen an instance of
the rule, replacing the right hand side by the left hand side.  It may be
necessary to consult the lookahead token to decide whether to reduce, but
usually it is not; in fact, the default action (represented  by a '.') is
often a reduce action.

Reduce actions  are associated  with individual  grammar rules.   Grammar
rules are also given  small integer numbers,  leading to some  confusion.
The action

                . reduce 18

refers to "grammar rule" 18, while the action

-------------------------------- Page 12 --------------------------------

                IF      shift 34

refers to state 34.

Suppose the rule being reduced is

        A       : x  y  z    ;

The reduce action depends on the left hand  symbol (A in this case),  and
the number of symbols  on the right hand  side (three in this case).   To
reduce, first pop off the  top three states from  the stack (In  general,
the number of  states popped equals  the number of  symbols on the  right
side of the  rule).  In effect,  these states  were the ones  put on  the
stack while recognizing x, y, and z, and no longer  serve any useful pur-
pose.  After popping these  states, a  state is uncovered  which was  the
state the parser was in before beginning to process the rule.  Using this
uncovered state, and the  symbol on the  left side of  the rule,  perform
what is in effect a shift of A.  A new state is obtained, pushed onto the
stack, and parsing continues.  There are significant differences  between
the processing of the left hand symbol and an ordinary  shift of a token,
however, so this action is called a goto action.  In particular, the loo-
kahead token is cleared  by a shift, and is  not affected by a goto.   In
any case, the uncovered state contains an entry such as:

                A       goto 20

causing state 20  to be  pushed onto the  stack, and  become the  current
state.

In effect, the reduce action 'turns back the clock' in the parse, popping
the states off the  stack to go  back to the  state where the right  hand
side of the rule was  first seen.  The parser then  behaves as if it  had
seen the left side at  that time.  If the right hand side of the rule  is
empty, no states are popped off of the  stack: the uncovered state is  in
fact the current state.

The reduce action  is also  important in the  treatment of  user-supplied
actions and values.  When a  rule is reduced, the code supplied with  the
rule is executed before the stack is adjusted.  In addition to the  stack
holding the states, another stack, running in parallel with it, holds the
values returned from the lexical analyzer and the actions.  When a  shift
takes place, the external variable yylval is copied onto the value stack.
After the return from the user code, the reduction is carried out.   When
the goto action is done,  the external variable yyval is copied onto  the
value stack.   The pseudo-variables  $1,  $2, etc.,  refer to  the  value
stack.

The other two parser actions  are conceptually much simpler.  The  accept
action indicates that the entire input has been seen and  that it matches
the specification.  This action appears only when the lookahead token  is

-------------------------------- Page 13 --------------------------------

the endmarker, and indicates  that the parser  has successfully done  its
job.  The error action, on the other  hand, represents a place where  the
parser can  no longer  continue parsing according  to the  specification.
The input tokens it has seen,  together with the lookahead token,  cannot
be followed by anything that  would result in a legal input.  The  parser
reports an error, and attempts to recover the situation and resume  pars-
ing: the error recovery  (as opposed to  the detection of error) will  be
covered in Section 7.

It is time for an example!  Consider the specification

        %token  DING  DONG  DELL
        %%
        rhyme   :       sound  place
                ;
        sound   :       DING  DONG
                ;
        place   :       DELL
                ;

When Yacc is invoked with the -v option,  a file called y.output is  pro-
duced, with  a human-readable  description of the  parser.  The  y.output
file corresponding to the  above grammar (with  some statistics  stripped
off the end) is:

-------------------------------- Page 14 --------------------------------

        state 0
                $accept  :  _rhyme  $end

                DING  shift 3
                .  error

                rhyme  goto 1
                sound  goto 2

        state 1
                $accept  :   rhyme_$end

                $end  accept
                .  error

        state 2
                rhyme  :   sound_place

                DELL  shift 5
                .  error

                place   goto 4

        state 3
                sound   :   DING_DONG

                DONG  shift 6
                .  error

        state 4
                rhyme  :   sound  place_    (1)

                .   reduce  1

        state 5
                place  :   DELL_    (3)

                .   reduce  3

        state 6
                sound   :   DING  DONG_    (2)

                .   reduce  2

Notice that,  in addition  to the  actions  for each  state, there  is  a
description of the parsing  rules being processed  in each state.  The  _
character is used  to indicate what  has been  seen, and what  is yet  to
come, in each rule.  Suppose the input is

        DING  DONG  DELL

-------------------------------- Page 15 --------------------------------

It is instructive to follow the steps of the parser while processing this
input.

Initially, the current state is  state 0.  The parser  needs to refer  to
the input in order to decide between the actions available in state 0, so
the first token, DING, is read, becoming the lookahead token.  The action
in state 0 on DING is is 'shift 3', so  state 3 is pushed onto the stack,
and the lookahead token is cleared.   State 3 becomes the current  state.
The next token, DONG, is read, becoming the lookahead token.   The action
in state 3 on the token DONG is 'shift 6', so state 6 is pushed onto  the
stack, and the lookahead is cleared.  The stack now contains 0, 3, and 6.
In state 6, without even consulting the lookahead, the parser reduces  by
rule 2.

                sound  :   DING  DONG

This rule has two symbols on the right hand side, so two states, 6 and 3,
are popped off of the stack, uncovering state 0.  Consulting the descrip-
tion of state 0, looking for a goto on sound,

                sound   goto 2

is obtained; thus state 2 is pushed onto the stack, becoming the  current
state.

In state 2, the next token, DELL, must be read.  The action is 'shift 5',
so state 5 is pushed onto the stack, which now has 0, 2, and 5 on it, and
the lookahead token is cleared.  In state 5, the only action is to reduce
by rule 3.  This has one symbol on the right  hand side, so one state, 5,
is popped off, and state 2 is uncovered.  The  goto in state 2 on  place,
the left side of rule  3, is state 4.  Now, the stack contains 0, 2,  and
4.  In state 4, the  only action is to reduce  by rule 1.  There are  two
symbols on the right,  so the top  two states are popped off,  uncovering
state 0 again.  In state 0, there is a  goto on rhyme causing the  parser
to enter  state 1.   In  state 1,  the input  is read;  the endmarker  is
obtained, indicated by '$end' in the y.output file.  The action in  state
1 when the endmarker is seen is to accept, successfully ending the parse.

The reader is urged to consider how the parser works when confronted with
such incorrect strings as "DING DONG DONG", "DING DONG", "DING  DONG DELL
DELL", etc.  A few minutes spend with this and other simple examples will
probably be repaid when problems arise in more complicated contexts.

-------------------------------- Page 16 --------------------------------

5.    AMBIGUITY AND CONFLICTS

A set of grammar rules  is ambiguous if there is  some input string  that
can be structured in two or more different ways.  For  example, the gram-
mar rule

        expr    :       expr  '-'  expr

is a natural way of expressing the fact that one way of forming an arith-
metic expression is to  put two other  expressions together with a  minus
sign between them.  Unfortunately, this grammar rule does not  completely
specify the way that all complex inputs should be structured.   For exam-
ple, if the input is

        expr  -  expr  -  expr

the rule allows this input to be structured as either

        (  expr  -  expr  )  -  expr

or as

        expr  -  (  expr  -  expr  )

(The first is called "left association", the second "right association").

Yacc detects such ambiguities when it is attempting to build the  parser.
It is instructive to consider the problem that confronts the  parser when
it is given an input such as

        expr  -  expr  -  expr

When the parser has read the second expr, the input that it has seen:

        expr  -  expr

matches the  right side  of the  grammar rule  above.  The  parser  could
reduce the  input by  applying this  rule; after applying  the rule;  the
input is reduced to expr(the  left side of the  rule).  The parser  would
then read the final part of the input:

        -  expr

and again reduce.  The  effect of this  is to take  the left  associative
interpretation.

Alternatively, when the parser has seen

        expr  -  expr

-------------------------------- Page 17 --------------------------------

it could defer the immediate application of the rule, and continue  read-
ing the input until it had seen

        expr  -  expr  -  expr

It could then apply  the rule  to the rightmost  three symbols,  reducing
them to expr and leaving

        expr  -  expr

Now the rule can be  reduced once more; the effect  is to take the  right
associative interpretation.  Thus, having read

        expr  -  expr

the parser can do two  legal things, a shift or  a reduction, and has  no
way of  deciding between them.   This is  called a "shift  / reduce  con-
flict".  It may also  happen that the  parser has a  choice of two  legal
reductions; this is called a "reduce / reduce conflict".  Note that there
are never any 'Shift/shift' conflicts.

When there are shift/reduce or  reduce/reduce conflicts, Yacc still  pro-
duces a parser.  It does  this by selecting one of the valid steps  wher-
ever it has a choice.  A rule describing which choice to make in a  given
situation is called a "disambiguating rule".

Yacc invokes two disambiguating rules by default:

1.   In a shift/reduce conflict, the default is to do the shift.

2.   In a reduce/reduce conflict, the default is to reduce by the earlier
     grammar rule (in the input sequence).

Rule 1 implies that reductions are  deferred whenever there is a  choice,
in favor of shifts.  Rule 2 gives the user rather  crude control over the
behavior of the  parser in  this situation,  but reduce/reduce  conflicts
should be avoided whenever possible.

Conflicts may arise because of mistakes in input or logic, or because the
grammar rules, while consistent, require a more complex parser than  Yacc
can construct.  The use of actions within rules can also cause conflicts,
if the action must  be done before the parser  can be sure which rule  is
being recognized.   In these  cases,  the application  of  disambiguating
rules is inappropriate, and leads to an incorrect parser.  For  this rea-
son, Yacc always  reports the  number of  shift/reduce and  reduce/reduce
conflicts resolved by Rule 1 and Rule 2.

In general, whenever it is possible to apply disambiguating rules to pro-
duce a correct parser, it  is also possible to rewrite the grammar  rules
so that the same inputs  are read but there are  no conflicts.  For  this

-------------------------------- Page 18 --------------------------------

reason, most previous parser  generators have considered conflicts to  be
fatal errors.  Our experience has suggested that this rewriting is  some-
what unnatural,  and produces  slower  parsers; thus,  Yacc will  produce
parsers even in the presence of conflicts.

As an example of the power  of disambiguating rules, consider a  fragment
from a programming language involving an 'if-then-else' construction:

        stat    :       IF  '('  cond  ')'  stat
                |       IF  '('  cond  ')'  stat  ELSE  stat
                ;

In these rules,  IF and  ELSE are tokens,  cond is  a nonterminal  symbol
describing conditional (logical) expressions,  and stat is a  nonterminal
symbol describing statements.  The first rule will be called the  simple-
if rule, and the second the if-else rule.

These two rules form an ambiguous construction, since input of the form

        IF  (  C1  )  IF  (  C2  )  S1  ELSE  S2

can be structured according to these rules in two ways:

        IF  (  C1  )  {
                IF  (  C2  )  S1
                }
        ELSE  S2

or

        IF  (  C1  )  {
                IF  (  C2  )  S1
                ELSE  S2
                }

The second interpretation is the one given in most programming  languages
having this construct.  Each ELSE  is associated with the last  preceding
'un-ELSE'd' IF.  In this example, consider the situation where the parser
has seen

        IF  (  C1  )  IF  (  C2  )  S1

and is looking at the ELSE.  It  can immediately reduce by the  simple-if
rule to get

        IF  (  C1  )  stat

and then read the remaining input,

        ELSE  S2

-------------------------------- Page 19 --------------------------------

and reduce

        IF  (  C1  )  stat  ELSE  S2

by the if-else rule.  This leads to the  first of the above groupings  of
the input.

On the other hand, the ELSE may be shifted,  S2 read, and then the  right
hand portion of

        IF  (  C1  )  IF  (  C2  )  S1  ELSE  S2

can be reduced by the if-else rule to get

        IF  (  C1  )  stat

which can be reduced by the simple-if rule.  This leads to the second  of
the above groupings of the input, which is usually desired.

Once again the parser can do two valid  things - there is a  shift/reduce
conflict.  The application of disambiguating  rule 1 tells the parser  to
shift in this case, which leads to the desired grouping.

This shift/reduce conflict arises only when there is a particular current
input symbol, ELSE, and particular inputs already seen, such as

        IF  (  C1  )  IF  (  C2  )  S1

In general, there may be many conflicts, and each one will be  associated
with an input symbol and a set of previously read inputs.  The previously
read inputs are characterized by the state of the parser.

The conflict messages of Yacc are  best understood by examining the  ver-
bose (-v) option output file.   For example, the output corresponding  to
the above conflict state might be:

        23: shift/reduce conflict (shift 45, reduce 18) on ELSE

        state 23

                  stat  :  IF  (  cond  )  stat_         (18)
                  stat  :  IF  (  cond  )  stat_ELSE  stat

                 ELSE     shift 45
                 .        reduce 18


The first line describes  the conflict,  giving the state  and the  input
symbol.  The ordinary state description follows, giving the grammar rules
active in the state, and the  parser actions.  Recall that the  underline

-------------------------------- Page 20 --------------------------------

marks the portion of the grammar rules which has been  seen.  Thus in the
example, in state 23 the parser has seen input corresponding to

        IF  (  cond  )  stat

and the two grammar rules shown are active at this time.  The parser  can
do two possible things.  If  the input symbol is ELSE, it is possible  to
shift into state 45.  State 45 will have, as part of its description, the
line

        stat  :  IF  (  cond  )  stat  ELSE_stat

since the ELSE will have been shifted in  this state.  Back in state  23,
the alternative action, described by '.', is to be done if the input sym-
bol is not mentioned explicitly in the above actions; thus, in this case,
if the input symbol is not ELSE, the parser reduces by grammar rule 18:

        stat  :  IF  '('  cond  ')'  stat

Once again, notice that the  numbers following 'shift' commands refer  to
other states,  while the  numbers  following 'reduce'  commands refer  to
grammar rule numbers.  In the y.output file, the rule numbers are printed
after those rules which can  be reduced.  In most one states, there  will
be at most  reduce action possible  in the  state, and this  will be  the
default command.  The  user who  encounters unexpected shift/reduce  con-
flicts will probably want to look at the verbose output to decide whether
the default actions  are appropriate.   In really tough  cases, the  user
might need to know more about the behavior and construction of the parser
than can be covered  here.  In this  case, one of the theoretical  refer-
ences might be  consulted; the  services of a  local guru  might also  be
appropriate.




6.    PRECEDENCE

There is one common situation where  the rules given above for  resolving
conflicts are  not  sufficient; this  is  in the  parsing  of  arithmetic
expressions.  Most  of the  commonly  used constructions  for  arithmetic
expressions can be naturally described by the notion of precedence levels
for operators, together  with information  about left  or right  associa-
tivity.  It turns out  that ambiguous grammars with appropriate  disambi-
guating rules can be used to create parsers that are faster and easier to
write than  parsers  constructed from  unambiguous grammars.   The  basic
notion is to write grammar rules of the form

        expr  :  expr  OP  expr

-------------------------------- Page 21 --------------------------------

and

        expr  :  UNARY  expr

for all binary and unary operators desired.  This creates a very  ambigu-
ous grammar, with many  parsing conflicts.  As disambiguating rules,  the
user specifies the precedence, or binding strength, of all the operators,
and the associativity of the binary operators.  This information is  suf-
ficient to allow Yacc to resolve the parsing conflicts in accordance with
these rules, and construct a parser that realizes the desired precedences
and associativities.

The precedences  and  associativities  are  attached  to  tokens  in  the
declarations section.  This is done by a series of lines beginning with a
Yacc keyword: %left, %right, or %nonassoc, followed by a list of  tokens.
All of the  tokens on  the same line  are assumed to  have the same  pre-
cedence level  and  associativity;  the  lines are  listed  in  order  of
increasing precedence or binding strength.  Thus,

        %left  '+'  '-'
        %left  '*'  '/'

describes the precedence and associativity of the four arithmetic  opera-
tors.  Plus and  minus are  left associative, and  have lower  precedence
than star and slash, which are also left associative.  The keyword %right
is used to describe right associative operators, and the keyword  %nonas-
soc is used to  describe operators,  like the operator  .LT. in  Fortran,
that may not associate with themselves; thus,

        A  .LT.  B  .LT.  C

is illegal in Fortran, and such an  operator would be described with  the
keyword %nonassoc  in  Yacc.  As  an  example of  the behavior  of  these
declarations, the description

        %right  '='
        %left  '+'  '-'
        %left  '*'  '/'

        %%

        expr    :       expr  '='  expr
                |       expr  '+'  expr
                |       expr  '-'  expr
                |       expr  '*'  expr
                |       expr  '/'  expr
                |       NAME
                ;

might be used to structure the input

-------------------------------- Page 22 --------------------------------

        a  =  b  =  c*d  -  e  -  f*g

as follows:

        a = ( b = ( ((c*d)-e) - (f*g) ) )

When this mechanism is used, unary operators must, in general, be given a
precedence.  Sometimes a unary  operator and a  binary operator have  the
same symbolic representation, but different  precedences.  An example  is
unary and binary '-'; unary minus may be given the  same strength as mul-
tiplication, or even higher, while binary minus has a lower strength than
multiplication.  The keyword, %prec, changes the precedence level associ-
ated with a particular grammar rule.  %prec appears immediately after the
body of the grammar rule, before the action or closing  semicolon, and is
followed by a token  name or literal.   It causes the  precedence of  the
grammar rule to become that of the following token name  or literal.  For
example, to make unary minus  have the same precedence as  multiplication
the rules might resemble:

        %left  '+'  '-'
        %left  '*'  '/'

        %%

        expr    :       expr  '+'  expr
                |       expr  '-'  expr
                |       expr  '*'  expr
                |       expr  '/'  expr
                |       '-'  expr      %prec  '*'
                |       NAME
                ;

A token declared by %left, %right, and %nonassoc need not be, but may be,
declared by %token as well.

The precedences and associativities are  used by Yacc to resolve  parsing
conflicts; they give rise  to disambiguating rules.  Formally, the  rules
work as follows:

1.   The precedences and  associativities are recorded  for those  tokens
     and literals that have them.

2.   A precedence and associativity is associated with each grammar rule;
     it is the precedence and associativity of the last token  or literal
     in the body  of the  rule.  If the  %prec construction  is used,  it
     overrides this default.  Some  grammar rules may have no  precedence
     and associativity associated with them.

3.   When there is a reduce/reduce  conflict, or there is a  shift/reduce
     conflict and  either the  input symbol  or the grammar  rule has  no

-------------------------------- Page 23 --------------------------------

     precedence and  associativity,  then the  two  disambiguating  rules
     given at the beginning  of the section  are used, and the  conflicts
     are reported.

4.   If there is a shift/reduce conflict,  and both the grammar rule  and
     the input  character have  precedence  and associativity  associated
     with them, then  the conflict  is resolved  in favor  of the  action
     (shift or  reduce) associated  with the higher  precedence.  If  the
     precedences are the same, then the associativity is used; left asso-
     ciative implies reduce, right associative implies shift, and  nonas-
     sociating implies error.

Conflicts resolved  by  precedence  are  not counted  in  the  number  of
shift/reduce and reduce/reduce  conflicts reported  by Yacc.  This  means
that mistakes in the specification of precedences may disguise errors  in
the input grammar; it is a good idea to be  sparing with precedences, and
use them in an essentially 'cookbook' fashion, until some experience  has
been gained.  The y.output  file is very  useful in deciding whether  the
parser is actually doing what was intended.




7.    ERROR HANDLING

Error handling is an extremely difficult  area, and many of the  problems
are semantic ones.  When an error is found, for example, it may be neces-
sary to reclaim parse tree storage, delete or alter symbol table entries,
and, typically, set switches to avoid generating any further output.

It is seldom acceptable to stop all processing when an error is found; it
is more  useful to  continue scanning  the input to  find further  syntax
errors.  This  leads to  the problem  of getting  the parser  'restarted'
after an error.  A general  class of algorithms to do this involves  dis-
carding a  number of  tokens from  the input  string, and  attempting  to
adjust the parser so that input can continue.

To allow the user some control over this process, Yacc provides a simple,
but reasonably general, feature.  The token name 'error' is reserved  for
error handling.  This name can  be used in grammar  rules; in effect,  it
suggests places where errors are expected, and recovery might take place.
The parser pops its stack until it enters a state where the token 'error'
is legal.  It then behaves as if the token 'error'  were the current loo-
kahead token, and performs the  action encountered.  The lookahead  token
is then reset to  the token that caused  the error.  If no special  error
rules have  been  specified,  the  processing  halts  when  an  error  is
detected.

-------------------------------- Page 24 --------------------------------

In order  to prevent  a  cascade of  error messages,  the  parser,  after
detecting an error, remains in  error state until three tokens have  been
successfully read and shifted.  If an  error is detected when the  parser
is already in error  state, no message is  given, and the input token  is
quietly deleted.

As an example, a rule of the form

        stat    :       error

would, in effect, mean that on a syntax error the parser would attempt to
skip over the statement in which the error was seen.  More precisely, the
parser will scan ahead, looking for three tokens that might legally  fol-
low a  statement, and  start processing  at the  first of  these; if  the
beginnings of statements are not sufficiently distinctive, it may make  a
false start in the middle  of a statement, and end up reporting a  second
error where there is in fact no error.

Actions may be used with these special error rules.  These actions  might
attempt to reinitialize tables, reclaim symbol table space, etc.

Error rules such as the above are very general, but difficult to control.
Somewhat easier are rules such as

        stat    :       error  ';'

Here, when there is an error, the parser attempts to skip over the state-
ment, but will do so  by skipping to the next ';'.  All tokens after  the
error and before the next ';' cannot be shifted, and are discarded.  When
the ';'  is seen, this  rule will  be reduced, and  any 'cleanup'  action
associated with it performed.

Another form of error rule  arises in interactive applications, where  it
may be desirable to permit a line to be reentered after an error.  A pos-
sible error rule might be

        input   :     error '\n' { printf("Reenter line: "); } input
                                {       $$  =  $4;  }

There is one  potential difficulty  with this approach;  the parser  must
correctly process  three  input  tokens  before it  admits  that  it  has
correctly resynchronized after the error.  If the reentered line contains
an error  in  the first  two  tokens, the  parser deletes  the  offending
tokens, and gives  no message;  this is clearly  unacceptable.  For  this
reason, there is  a mechanism  that can be  used to  force the parser  to
believe that an error has been fully recovered from.  The statement

        yyerrok ;

in an action resets the parser to its  normal mode.  The last example  is

-------------------------------- Page 25 --------------------------------

better written

        input   :       error  '\n'
                                {       yyerrok;
                                        printf("Reenter line: ");   }
                        input
                                {       $$  =  $4;  }
                ;

As mentioned above, the token  seen immediately after the 'error'  symbol
is the input token at which the error was discovered.  Sometimes, this is
inappropriate; for  example, an  error recovery  action might  take  upon
itself the job  of finding the  correct place to  resume input.  In  this
case, the previous lookahead token must be cleared.  The statement

        yyclearin ;

in an action  will have  this effect.   For example,  suppose the  action
after error were  to call  some sophisticated resynchronization  routine,
supplied by the user, that attempted to  advance the input to the  begin-
ning of the  next valid  statement.  After this  routine was called,  the
next token returned by  yylex would presumably  be the first  token in  a
legal statement; the old, illegal token must be discarded, and  the error
state reset.  This could be done by a rule like

        stat    :       error
                                {       resynch();
                                        yyerrok ;
                                        yyclearin ;   }
                ;

These mechanisms are admittedly crude, but do allow for a simple,  fairly
effective recovery of the parser from many errors; moreover, the user can
get control to deal with the error actions required by other portions  of
the program.




8.    THE YACC ENVIRONMENT

When the user inputs a specification to Yacc, the  output is a file of  C
programs, called y.tab.c on most  systems (due to local file system  con-
ventions, the names may differ  from installation to installation).   The
function produced  by Yacc  is called  yyparse; it is  an integer  valued
function.  When it is called, it in turn repeatedly calls yylex, the lex-
ical analyzer  supplied  by the  user  (see Section  3) to  obtain  input
tokens.  Eventually, either an error  is detected, in  which case (if  no

-------------------------------- Page 26 --------------------------------

error recovery is possible) yyparse  returns the value 1, or the  lexical
analyzer returns the  endmarker token  and the parser  accepts.  In  this
case, yyparse returns the value 0.

The user must provide a certain amount of environment for this parser  in
order to obtain a working program.  For example, as with every C program,
a program called main must be defined, that eventually calls yyparse.  In
addition, a routine called yyerror  prints a message when a syntax  error
is detected.

These two routines must be supplied in one  form or another by the  user.
To ease the  initial effort of  using Yacc, a  library has been  provided
with default versions of main and yyerror.   The name of this library  is
system dependent; on many systems the library is accessed by  a -ly argu-
ment to the loader.  To  show the triviality  of these default  programs,
the source is given below:

        main(){
                return( yyparse() );
                }

and

        # include <stdio.h>

        yyerror(s) char *s; {
                fprintf( stderr, "%s\n", s );
                }

The argument to yyerror is a string containing an error message,  usually
the string  'syntax error'.   The  average application  will want  to  do
better than this.  Ordinarily, the program should keep track of the input
line number, and print it  along with the message when a syntax error  is
detected.  The external  integer variable yychar  contains the  lookahead
token number at  the time  the error was  detected; this  may be of  some
interest in giving better diagnostics.   Since the main program is  prob-
ably supplied by the user  (to read arguments, etc.) the Yacc library  is
useful only in small projects, or in the earliest stages of larger ones.

The external integer variable yydebug is normally set to 0.  If it is set
to a nonzero value, the  parser will output a verbose description of  its
actions, including a discussion  of which input  symbols have been  read,
and what the parser actions are.  Depending on the operating environment,
it may be possible to set this variable by using a debugging system.

-------------------------------- Page 27 --------------------------------

9.    HINTS FOR PREPARING SPECIFICATIONS

This section contains miscellaneous hints on preparing efficient, easy to
change, and clear specifications.  The individual subsections are more or
less independent.




10.   INPUT STYLE

It is difficult to provide rules with substantial actions and still  have
a readable specification  file.  The  following style hints  owe much  to
Brian Kernighan.

a.   Use all capital letters for token names, all lower case letters  for
     nonterminal names.  This  rule comes under  the heading of  'knowing
     who to blame when things go wrong.'

b.   Put grammar rules and actions on separate lines.  This allows either
     to be changed without an automatic need to change the other.

c.   Put all rules with the same left  hand side together.  Put the  left
     hand side in  only once, and  let all following  rules begin with  a
     vertical bar.

d.   Put a semicolon  only after  the last rule  with a  given left  hand
     side, and put  the semicolon  on a separate  line.  This allows  new
     rules to be easily added.

e.   Indent rule bodies by two tab stops, and action bodies by three  tab
     stops.

The example in Appendix  A is written  following this style,  as are  the
examples in the text of this paper (where space permits).   The user must
make up his own mind about  these stylistic questions; the central  prob-
lem, however, is to make  the rules visible through the morass of  action
code.




11.   LEFT RECURSION

The algorithm used by the Yacc  parser encourages so called 'left  recur-
sive' grammar rules: rules of the form

-------------------------------- Page 28 --------------------------------

        name    :       name  rest_of_rule  ;

These rules frequently arise when writing specifications of sequences and
lists:

        list    :       item
                |       list  ','  item
                ;

and

        seq     :       item
                |       seq  item
                ;

In each of these cases, the first rule will be reduced for the first item
only, and the second rule will be reduced for the second and all succeed-
ing items.

With right recursive rules, such as

        seq     :       item
                |       item  seq
                ;

the parser  would be  a bit  bigger, and  the items  would be  seen,  and
reduced, from right to  left.  More seriously,  an internal stack in  the
parser would be in  danger of overflowing  if a very  long sequence  were
read.  Thus, the user should use left recursion wherever reasonable.

It is worth considering  whether a  sequence with zero  elements has  any
meaning, and if so, consider  writing the sequence specification with  an
empty rule:

        seq     :       /* empty */
                |       seq  item
                ;

Once again, the first rule would  always be reduced exactly once,  before
the first item was read,  and then the second rule would be reduced  once
for each item read.  Permitting empty sequences often leads to  increased
generality.  However, conflicts might  arise if Yacc  is asked to  decide
which empty sequence it has seen, when it hasn't seen enough to know!

-------------------------------- Page 29 --------------------------------

12.   LEXICAL TIE-INS

Some lexical  decisions depend  on  context.  For  example,  the  lexical
analyzer might  want to  delete blanks  normally, but  not within  quoted
strings.  Or names might be entered into a symbol table in  declarations,
but not in expressions.

One way of handling  this situation is  to create a  global flag that  is
examined by the lexical analyzer, and set by actions.  For  example, sup-
pose a program consists of 0 or more declarations, followed by 0 or  more
statements.  Consider:

        %{
                int dflag;
        %}
          ...  other declarations ...

        %%

        prog    :       decls  stats
                ;

        decls   :       /* empty */
                                {       dflag = 1;  }
                |       decls  declaration
                ;

        stats   :       /* empty */
                                {       dflag = 0;  }
                |       stats  statement
                ;

            ...  other rules ...

The flag  dflag is  now 0  when reading  statements, and  1 when  reading
declarations, except for the  first token in  the first statement.   This
token must be seen by the parser before it can tell that the  declaration
section has ended  and the  statements have begun.   In many cases,  this
single token exception does not affect the lexical scan.

This kind of 'backdoor' approach can  be elaborated to a noxious  degree.
Nevertheless, it represents a  way of doing  some things that are  diffi-
cult, if not impossible, to do otherwise.

-------------------------------- Page 30 --------------------------------

13.   RESERVED WORDS

Some programming languages permit the user to use words like 'if',  which
are normally reserved, as label or variable names, provided that such use
does not conflict with the  legal use of these  names in the  programming
language.  This is extremely hard  to do in the framework of Yacc; it  is
difficult to pass information  to the lexical  analyzer telling it  'this
instance of 'if'  is a keyword,  and that instance  is a variable'.   The
user can make a  stab at it,  using the mechanism  described in the  last
subsection, but it is difficult.

A number of ways of making this easier are under advisement.  Until then,
it is better that the keywords be reserved; that is, be forbidden for use
as variable names.  There are  powerful stylistic reasons for  preferring
this, anyway.

-------------------------------- Page 31 --------------------------------

APPENDIX A.    A SIMPLE EXAMPLE

This example gives the complete Yacc specification for a small desk  cal-
culator; the desk calculator has  26 registers, labeled 'a' through  'z',
and accepts arithmetic expressions made up of the operators +, -, *, /, %
(mod operator), & (bitwise and),  | (bitwise or), and assignment.  If  an
expression at the top level is an  assignment, the value is not  printed;
otherwise it  is.  As  in  C, an  integer that  begins with  0 (zero)  is
assumed to be octal; otherwise, it is assumed to be decimal.

As an example of a Yacc specification, the desk calculator does a reason-
able job of showing how precedences and ambiguities are used,  and demon-
strating simple error recovery.  The  major oversimplifications are  that
the lexical analysis phase  is much simpler  than for most  applications,
and the output is produced immediately, line by line.  Note the way  that
decimal and octal integers are read in by the grammar  rules; This job is
probably better done by the lexical analyzer.


%{
#  include  <stdio.h>
#  include  <ctype.h>

int  regs[26];
int  base;

%}

%start  list

%token  DIGIT  LETTER

%left  '|'
%left  '&'
%left  '+'  '-'
%left  '*'  '/'  '%'
%left  UMINUS      /*  supplies  precedence  for  unary  minus  */

%%      /*  beginning  of  rules  section  */

list  :     /*  empty  */
        |   list  stat  '\n'
        |   list  error  '\n'
                        {     yyerrok;  }
        ;

stat  :     expr
                        {     printf( "%d\n", $1 );  }
        |   LETTER  '='  expr
                        {     regs[$1]  =  $3;  }

-------------------------------- Page 32 --------------------------------

        ;

expr  :     '('  expr  ')'
                        {     $$  =  $2;  }
        |   expr  '+'  expr
                        {     $$  =  $1  +  $3;  }
        |   expr  '-'  expr
                        {     $$  =  $1  -  $3;  }
        |   expr  '*'  expr
                        {     $$  =  $1  *  $3;  }
        |   expr  '/'  expr
                        {     $$  =  $1  /  $3;  }
        |   expr  '%'  expr
                        {     $$  =  $1  %  $3;  }
        |   expr  '&'  expr
                        {     $$  =  $1  &  $3;  }
        |   expr  '|'  expr
                        {     $$  =  $1  |  $3;  }
        |   '-'  expr        %prec  UMINUS
                        {     $$  =  -  $2;  }
        |   LETTER
                        {     $$  =  regs[$1];  }
        |   number
        ;

number      :     DIGIT
                        {     $$ = $1;    base  =  ($1==0)  ?  8  :  10;  }
        |   number  DIGIT
                        {     $$  =  base * $1  +  $2;  }
        ;

%%      /*  start  of  programs  */

yylex() {
/*
 *  lexical analysis routine
 *  returns LETTER for a lower case letter, yylval = 0 through 25
 *  return DIGIT for a digit, yylval = 0 through 9
 *  all other characters are returned immediately
 */

        int  c;

        while(  (c=getchar())  ==  ' '  )  {/*  skip  blanks  */  }

        /*  c  is  now  nonblank  */

        if(  islower(  c  )  )  {
                yylval  =  c  -  'a';
                return  (  LETTER  );

-------------------------------- Page 33 --------------------------------

                }
        if(  isdigit(  c  )  )  {
                yylval  =  c  -  '0';
                return(  DIGIT  );
                }
        return(  c  );
        }

-------------------------------- Page 34 --------------------------------

APPENDIX B.    YACC INPUT SYNTAX

This Appendix  has a  description of  the Yacc  input syntax,  as a  Yacc
specification.  Context dependencies, etc.,  are not considered.   Ironi-
cally, the Yacc input specification language is most naturally  specified
as an LR(2) grammar; the sticky part comes when an  identifier is seen in
a rule, immediately following an action.  If this identifier is  followed
by a colon, it is the start of the next rule; otherwise it is a continua-
tion of the current rule, which just  happens to have an action  embedded
in it.  As implemented, the lexical analyzer looks ahead after  seeing an
identifier, and decide whether the next token (skipping blanks, newlines,
comments, etc.) is a  colon.  If so,  it returns the token  C_IDENTIFIER.
Otherwise, it  returns IDENTIFIER.   Literals (quoted  strings) are  also
returned as IDENTIFIERS, but never as part of C_IDENTIFIERs.


            /*  grammar  for  the  input  to  Yacc  */

        /*  basic  entities  */
%token  IDENTIFIER   /* includes identifiers and literals              */
%token  C_IDENTIFIER /* identifier (but not literal) followed by colon */
%token  NUMBER       /* [0-9]+                                         */

        /* reserved words: %type => TYPE, %left => LEFT, etc.          */

%token LEFT  RIGHT  NONASSOC  TOKEN  PREC  TYPE  START  UNION

%token MARK   /*  the  %%  mark  */
%token LCURL  /*  the  %{  mark  */
%token RCURL  /*  the  %}  mark  */

        /*  ascii  character  literals  stand  for  themselves  */

%start spec

%%

spec   :      defs  MARK  rules  tail
        ;

tail    :       MARK    { In this action, eat up the rest of the file }
        |     /*  empty:  the  second  MARK  is  optional  */
        ;

defs   :      /*  empty  */
        |     defs  def
        ;

def    :      START  IDENTIFIER
        |       UNION  {  Copy union definition to output }

-------------------------------- Page 35 --------------------------------

        |       LCURL  {  Copy C code to output file }  RCURL
        |     ndefs  rword  tag  nlist
        ;

rword  :      TOKEN
        |     LEFT
        |     RIGHT
        |     NONASSOC
        |     TYPE
        ;

tag    :      /*  empty:  union  tag  is  optional  */
        |     '<'  IDENTIFIER  '>'
        ;

nlist  :      nmno
        |     nlist  nmno
        |     nlist  ','  nmno
        ;

nmno    :       IDENTIFIER          /* NOTE: literal illegal with %type */
        |       IDENTIFIER  NUMBER  /* NOTE: illegal with %type         */
        ;

        /*  rules  section  */

rules   :       C_IDENTIFIER  rbody  prec
        |     rules  rule
        ;

rule    :       C_IDENTIFIER  rbody  prec
        |     '|'  rbody  prec
        ;

rbody  :      /*  empty  */
        |     rbody  IDENTIFIER
        |     rbody  act
        ;

act     :       '{'  {  Copy  action,  translate  $$,  etc.  }  '}'
        ;

prec   :      /*  empty  */
        |     PREC  IDENTIFIER
        |     PREC  IDENTIFIER  act
        |     prec  ';'
        ;

-------------------------------- Page 36 --------------------------------

APPENDIX C.    AN ADVANCED EXAMPLE

This Appendix gives an example  of a grammar using  some of the  advanced
features discussed in Section 10.  The desk calculator example in  Appen-
dix A is modified to provide a  desk calculator that does floating  point
interval arithmetic.   The  calculator understands  floating  point  con-
stants, the arithmetic operations  +, -, *,  /, unary -,  and =  (assign-
ment), and has 26 floating  point variables, 'a' through 'z'.   Moreover,
it also understands intervals, written

                ( x , y )

where x is less than or equal to y.   There are 26 interval valued  vari-
ables 'A' through  'Z' that may also  be used.  The  usage is similar  to
that in Appendix A; assignments return no value, and print nothing, while
expressions print the (floating or interval) value.

This example explores a  number of  interesting features of  Yacc and  C.
Intervals are  represented by  a structure,  consisting of  the left  and
right endpoint values,  stored as  double's.  This structure  is given  a
type name, INTERVAL,  by using  typedef.  The Yacc  value stack can  also
contain floating  point scalars,  and integers  (used to  index into  the
arrays holding the  variable values).  Notice  that this entire  strategy
depends strongly on being able to assign structures and unions in C.   In
fact, many of the actions call functions that return structures as well.

It is also worth noting  the use of YYERROR  to handle error  conditions:
division by an interval  containing 0, and  an interval presented in  the
wrong order.  In effect, the error recovery mechanism of Yacc is used  to
throw away the rest of the offending line.

In addition to the mixing of types on the value stack, this grammar  also
demonstrates an interesting use of syntax to keep track of the type (e.g.
scalar or interval) of intermediate expressions.  Note that a scalar  can
be automatically promoted to an interval if the context demands an inter-
val value.  This causes a large number  of conflicts when the grammar  is
run through Yacc: 18 Shift/Reduce and 26 Reduce/Reduce.  The problem  can
be seen by looking at the two input lines:

                2.5 + ( 3.5 - 4. )

and

                2.5 + ( 3.5 , 4. )

Notice that the 2.5 is to be used in an interval valued expression in the
second example, but this fact is not known until the ',' is read; by this
time, 2.5 is finished, and the parser cannot go back and change its mind.
More generally, it might be  necessary to look ahead an arbitrary  number
of tokens to decide  whether to convert  a scalar to  an interval.   This

-------------------------------- Page 37 --------------------------------

problem is evaded  by having  two rules for  each binary interval  valued
operator: one when the left  operand is a scalar, and  one when the  left
operand is an interval.  In the second case, the right operand must be an
interval, so the conversion will be applied automatically.  Despite  this
evasion, there are still many  cases where the conversion may be  applied
or not, leading to the above conflicts.  They are resolved by listing the
rules that yield scalars  first in the  specification file; in this  way,
the conflicts will be resolved in the direction of keeping scalar  valued
expressions scalar valued until they are forced to become intervals.

This way of handling  multiple types  is very instructive,  but not  very
general.  If there were many  kinds of expression types, instead of  just
two, the number of rules needed would increase dramatically, and the con-
flicts even more dramatically.  Thus, while this example is  instructive,
it is better practice in  a more normal programming language  environment
to keep the type information as part of the value, and not as part of the
grammar.

Finally, a word about the lexical analysis.  The only unusual feature  is
the treatment of floating point constants.  The C library routine atof is
used to do the actual conversion from a character string to a double pre-
cision value.  If the lexical  analyzer detects an error, it responds  by
returning a token  that is  illegal in  the grammar,  provoking a  syntax
error in the parser, and thence error recovery.

-------------------------------- Page 38 --------------------------------

%{

#  include  <stdio.h>
#  include  <ctype.h>

typedef  struct  interval  {
        double  lo,  hi;
        }  INTERVAL;

INTERVAL  vmul(),  vdiv();

double  atof();

double  dreg[ 26 ];
INTERVAL  vreg[ 26 ];

%}

%start    lines

%union    {
        int  ival;
        double  dval;
        INTERVAL  vval;
        }

%token  <ival>  DREG  VREG    /* indices into dreg, vreg arrays */

%token  <dval>  CONST         /* floating point constant        */

%type  <dval>  dexp           /* expression                     */

%type  <vval>  vexp           /* interval expression            */

        /*  precedence  information  about  the  operators  */

%left   '+'  '-'
%left   '*'  '/'
%left   UMINUS        /*  precedence  for  unary  minus  */

%%

lines   :       /*  empty  */
        |       lines  line
        ;

line    :       dexp  '\n'
                        {    printf(  "%15.8f\n",  $1  );  }
        |       vexp  '\n'

-------------------------------- Page 39 --------------------------------

                        {    printf("(%15.8f,%15.8f)\n",$1.lo,$1.hi);}
        |       DREG  '='  dexp  '\n'
                        {    dreg[$1]  =  $3;  }
        |       VREG  '='  vexp  '\n'
                        {    vreg[$1]  =  $3;  }
        |       error  '\n'
                        {    yyerrok;  }
        ;

dexp    :       CONST
        |       DREG
                        {       $$  =  dreg[$1];  }
        |       dexp  '+'  dexp
                        {       $$  =  $1  +  $3;  }
        |       dexp  '-'  dexp
                        {       $$  =  $1  -  $3;  }
        |       dexp  '*'  dexp
                        {       $$  =  $1  *  $3;  }
        |       dexp  '/'  dexp
                        {       $$  =  $1  /  $3;  }
        |       '-'  dexp       %prec  UMINUS
                        {       $$  =  - $2;  }
        |       '('  dexp  ')'
                        {       $$  =  $2;  }
        ;

vexp    :       dexp
                        {       $$.hi  =  $$.lo  =  $1;  }
        |       '('  dexp  ','  dexp  ')'
                        {
                        $$.lo  =  $2;
                        $$.hi  =  $4;
                        if(  $$.lo  >  $$.hi  ){
                                printf( "interval out of order\n" );
                                YYERROR;
                                }
                        }
        |       VREG
                        {       $$  =  vreg[$1];    }
        |       vexp  '+'  vexp
                        {       $$.hi  =  $1.hi  +  $3.hi;
                                $$.lo  =  $1.lo  +  $3.lo;    }
        |       dexp  '+'  vexp
                        {       $$.hi  =  $1  +  $3.hi;
                                $$.lo  =  $1  +  $3.lo;    }
        |       vexp  '-'  vexp
                        {       $$.hi  =  $1.hi  -  $3.lo;
                                $$.lo  =  $1.lo  -  $3.hi;    }
        |       dexp  '-'  vexp
                        {       $$.hi  =  $1  -  $3.lo;

-------------------------------- Page 40 --------------------------------

                                $$.lo  =  $1  -  $3.hi;    }
        |       vexp  '*'  vexp
                        {       $$  =  vmul( $1.lo, $1.hi, $3 ); }
        |       dexp  '*'  vexp
                        {       $$  =  vmul( $1,  $1,  $3  );  }
        |       vexp  '/'  vexp
                        {       if(  dcheck(  $3  )  )  YYERROR;
                                $$  =  vdiv( $1.lo, $1.hi, $3 ); }
        |       dexp  '/'  vexp
                        {       if(  dcheck(  $3  )  )  YYERROR;
                                $$  =  vdiv(  $1,  $1,  $3  );  }
        |       '-'  vexp       %prec  UMINUS
                        {       $$.hi = -$2.lo; $$.lo = -$2.hi; }
        |       '('  vexp  ')'
                        {       $$  =  $2;  }
        ;

%%

#  define  BSZ  50   /* buffer size for floating point numbers */

        /*  lexical  analysis  */

yylex(){
        register  c;

        while( (c=getchar()) == ' ' ){ /* skip over blanks */ }

        if(  isupper(  c  )  ){
                yylval.ival  =  c  -  'A';
                return(  VREG  );
                }
        if(  islower(  c  )  ){
                yylval.ival  =  c  -  'a';
                return(  DREG  );
                }

        if(  isdigit(  c  )  ||  c=='.'  ){
                /*  gobble  up  digits,  points,  exponents  */

                char  buf[BSZ+1],  *cp  =  buf;
                int  dot  =  0,  exp  =  0;

                for(  ;  (cp-buf)<BSZ  ;  ++cp,c=getchar()  ){

                        *cp  =  c;
                        if(  isdigit(  c  )  )  continue;
                        if(  c  ==  '.'  ){
                                if(  dot++  ||  exp  )
                                       /* will cause syntax error */

-------------------------------- Page 41 --------------------------------

                                       return(  '.'  );
                                continue;
                                }

                        if(  c  ==  'e'  ){
                                if(exp++)
                                       /* will cause syntax error */
                                       return('e');
                                continue;
                                }

                        /*  end  of  number  */
                        break;
                        }
                *cp  =  '\0';
                if((cp-buf) >= BSZ)
                        printf("constant too long: truncated\n");
                else
                        /* push back last char read */
                        ungetc(c, stdin);
                yylval.dval  =  atof(  buf  );
                return(  CONST  );
                }
        return(  c  );
        }

INTERVAL  hilo(  a,  b,  c,  d  )  double  a,  b,  c,  d;  {
/*
 *  returns the smallest interval containing a, b, c, and d
 *  used  by  *,  /  routines
 */
        INTERVAL  v;

        if(  a>b  )  {  v.hi  =  a;    v.lo  =  b;  }
        else  {  v.hi  =  b;    v.lo  =  a;  }

        if(  c>d  )  {
                if(  c>v.hi  )  v.hi  =  c;
                if(  d<v.lo  )  v.lo  =  d;
                }
        else  {
                if(  d>v.hi  )  v.hi  =  d;
                if(  c<v.lo  )  v.lo  =  c;
                }
        return(  v  );
        }

INTERVAL  vmul(  a,  b,  v  )  double  a,  b;    INTERVAL  v;  {
        return(  hilo(  a*v.hi,  a*v.lo,  b*v.hi,  b*v.lo  )  );
        }

-------------------------------- Page 42 --------------------------------

dcheck(  v  )  INTERVAL  v;  {
        if(  v.hi  >=  0.  &&  v.lo  <=  0.  ){
                printf(  "divisor  interval  contains  0.\n"  );
                return(  1  );
                }
        return(  0  );
        }

INTERVAL  vdiv(  a,  b,  v  )  double  a,  b;    INTERVAL  v;  {
        return(  hilo(  a/v.hi,  a/v.lo,  b/v.hi,  b/v.lo  )  );
        }

-------------------------------- Page 43 --------------------------------

APPENDIX D.    OLD FEATURES SUPPORTED BUT NOT ENCOURAGED

This Appendix mentions synonyms and features which are supported for his-
torical continuity, but, for various reasons, are not encouraged.

1.   Literals may also be delimited by double quotes '"'.

2.   Literals may be more than one character long.  If all the characters
     are alphabetic, numeric,  or _,  the type number  of the literal  is
     defined, just as if the literal did  not have the quotes around  it.
     Otherwise, it is difficult to find the value for such literals.

     The  use of  multi-character  literals is  likely to  mislead  those
     unfamiliar with Yacc,  since it  suggests that Yacc  is doing a  job
     which must be actually done by the lexical analyzer.

3.   Most places where % is legal, backslash '\' may be used.  In partic-
     ular, \\ is the same as %%, \left the same as %left, etc.

4.   There are a number of other synonyms:

        %< is the same as %left
        %> is the same as %right
        %binary and %2 are the same as %nonassoc
        %0 and %term are the same as %token
        %= is the same as %prec

5.   Actions may also have the form

        ={ . . . }

     and the curly  braces can be  dropped if  the action is  a single  C
     statement.

6.   C code between %{  and %} used to  be permitted at  the head of  the
     rules section, as well as in the declaration section.

-------------------------------- The End --------------------------------
