LL(1) Parsing

The LL(1) method of parsing is a special case of LL(k) parsing that restricts the amount of lookahead to a single token at a time. This restriction enables the efficient parsing of an LL(1) language, compared to the algorithms that can be used to parse LL(k) languages for values of k that are than greater 1. Because of the wide gap in efficiency between the LL(1) and the more general LL(k) methods, only LL(1) parsers are generally used.

The LL(1) parsing methodology consists of two parts. The first part is the grammar analysis technique that is used to construct an LL(1) parser for a particular grammar. The second part is the algorithm that is used to parse sentences in the language of the LL(1) grammar. There are two primary parsing algorithms that are applied to LL(1) grammars. These are recursive descent parsing and table-driven LL(1) parsing. Both of these algorithms rely on the same type of grammar analysis to construct the parser. The difference between the two methods is primarily implementational, in that they both use recursive techniques to parse an input. In the case of recursive descent, the recursion is implicit. The table-driven method implements explicit recursion. The main focus here is on the table-driven form of LL(1) parsing.

Definition of LL(1)

The class of LL(1) languages is composed of exactly those languages that can be deterministically parsed from the top down using only a single token of lookahead input at a time to make each parsing decision. The class of LL(1) grammars consists of the grammars that can be used to generate the LL(1) languages. In effect, the LL(1) grammars are defined by the parsing method that is associated with them. A more specific definition of the LL(1) grammars will be given. First some relevant sets that apply to the LL(1) grammars must be defined.

    Definition:  The FIRST set of a string of symbols in a grammar 
    is a subset of the terminal symbols of the grammar that may begin a 
    sentential form derivable from the string. The FIRST set can also 
    include epsilon if it is derivable from the string.
    More specifically, for a grammar G = ( N, T, P, S ):

    FIRST ( alpha )  =  { a  |  alpha ==>*  abeta } union
                        { epsilon if alpha ==>* epsilon }

    where

        alpha  in ( N union T )+
        a  in T
        beta  in ( N union T )*

    For the trivial cases of a terminal symbol or an empty string:

        FIRST ( a )  = { a }
        FIRST ( epsilon )  = { epsilon }
    

    Definition:  The FOLLOW set of a string of symbols in a grammar 
    is a subset of the terminal symbols of the grammar that may follow the 
    string of symbols in some sentential form derivable in the grammar.
    More specifically, for a grammar G = ( N, T, P, S ):

    FOLLOW ( beta ) = 
        { b  |  S  ==>* alpha beta gamma and  b in FIRST ( gamma ) }
        union
        { epsilon  if  S  ==>*  alpha beta }

    where

        alpha, gamma  in ( N union T )*
        beta  in ( N union T )+
        b  in T
    

The FIRST and FOLLOW sets are used to define the LL(1) grammars in the following:

    Definition:  A grammar is said to be LL(1) if for every nonterminal "A" in
    the grammar, and for every pair of A-productions in the grammar,
    the following holds:

    For all distinct A-productions

        A  -->  alpha
        A  -->  beta

    FIRST ( alpha FOLLOW (A) )  intersection  FIRST ( beta FOLLOW (A) )  =  Ø
    

There are several implications of the definition of the LL(1) grammars. Some of these are given in figure 3.1.


    

1. No two A-productions in the grammar can have derivations that begin with the same terminal symbol. 2. No more than one A-production in the grammar can be a null production or a nullable production. 3. If an A-production in the grammar is null or nullable, then its FOLLOW set must be disjoint with the FIRST sets of all other A-productions in the grammar.

Figure 3.1 Implications of the LL(1) Definition


The LL(1) property of a grammar can also be expressed in terms of a restriction on the derivations that are possible in the grammar, as follows.

    1.  S  ==>lm*  x1 A delta1  ==>lm  x1 omega1 delta1  ==>lm*  x1 y1
    2.  S  ==>lm*  x2 A delta2  ==>lm  x2 omega2 delta2  ==>lm*  x2 y2
    3.  FIRST ( y1 )  =  FIRST ( y2 )

    Whenever 1, 2, and 3 above hold, it implies that

        omega1  =   omega2
This is a more direct way of specifying that no two A-productions in an LL(1) grammar can have derivations that begin with the same terminal symbol.

The LL(1) Parsing Algorithm

Perhaps the most important implication of the definition of the LL(1) grammars is that it indicates a method by which the strings in an LL(1) language can be parsed. The single-lookahead specification indicates that the string may be parsed by examining each token one at a time. There is no need to consider blocks of tokens, since parsing decisions can be made based on the single next terminal symbol that is encountered in the parsing process. Since no two A-productions in an LL(1) grammar may have derivations that begin with the same terminal symbol, it is possible to predict which of the alternate start symbol productions begins the parse, given only the knowledge of the first token of the string to be parsed. Each of the grammar symbols on the right hand side of the start production can be considered in turn from left to right as the parse proceeds. If the symbol is a terminal symbol, then clearly it should match the next input symbol if the input string is a valid sentence in the language of the grammar. If the symbol is a nonterminal symbol, it can be expanded into another production just as the start symbol was rewritten as the right hand side of the start production.

Since more than one production in the grammar may be being processed at the same time, some method is needed to keep track of the order in which the productions are used in the parsing process. The algorithm indicates that the productions must be processed on a last-in first-out basis. A stack of productions is the obvious way to handle this part of the algorithm. However, the symbols within each production are processed from left to right, indicating a first-in first-out organization of the symbols within each production. A single stack of grammar symbols can be used to keep track of both the productions and the symbols within the productions if each production is pushed onto the stack in reverse order as it is encountered in the parse. That is, the symbols on the right hand side of the production are pushed in the right to left order of their occurrence in the production.

At this point, all of the components that are needed to parse the LL(1) grammars are known. These are summarized in figure 3.2.


    

1. A mapping of the nonterminals and the terminals in the grammar onto the productions of the grammar as follows. N x T --> P 2. A table of the right hand sides of the productions in the grammar. The table is indexed by production number. Each production is in reverse order to simplify pushing it onto the parse stack. 3. A simple stack-based parsing algorithm.

Figure 3.2 The Elements of LL(1) Parsing


The parsing algorithm itself is shown in figure 3.3.


    
    
    Get the first input token
    Push the start symbol of the grammar onto the stack

    While the stack is not empty and input remains
        Pop a grammar symbol from the stack.
        If the symbol is a terminal
            Match it against the lookahead token
            Get the next input token
        Else if the symbol is a nonterminal
            Get the next production number from the PREDICT function
            Push that production onto the stack
        Else if the symbol is an action
            Perform the action
    End while

Figure 3.3 The LL(1) Parsing Algorithm


The LL(1) PREDICT Function

The LL(1) PREDICT function is the most important component of the LL(1) parsing method. Not surprisingly, the analysis of a grammar in order to construct an LL(1) parser for it revolves around the construction of the PREDICT function. The general approach that is taken is to construct the FIRST sets for every symbol in the grammar. Next, the FOLLOW sets are constructed for all of the nonterminal symbols. Finally, the PREDICT sets for each production in the grammar are constructed using the FIRST and the FOLLOW sets. The order of construction of these sets is important because each set is dependent on the previously-constructed set or sets.

The LL(1) PREDICT function is a partial mapping of the nonterminals and the terminals in the grammar onto the productions of the grammar. Not every combination of a nonterminal and a terminal symbol results in a valid mapping, so the PREDICT function is a partial function. The PREDICT function is constructed by combining all of the PREDICT sets of a grammar in a way that produces the desired mapping. In order for the PREDICT function to be valid, the PREDICT sets for all of the A-productions in a grammar must be disjoint. This applies for all of the nonterminals in the grammar.

FIRST Set Construction

A good way to construct the FIRST set for each symbol in a grammar is to start with the construction of the obvious sets, and then to build on that construction iteratively until all sets are complete. The FIRST set of each terminal symbol is trivially the set of the symbol itself. Since a terminal symbol cannot be nullable, this completes construction of the FIRST sets for all of the terminal symbols in the grammar. The FIRST sets of the terminal symbols provide the basis for constructing the FIRST sets of the nonterminal symbols. In the first iteration, any productions of the following two forms are used:

                        A  -->  b  beta 
                        A  -->

In the first case, b is added to the FIRST (A). In the second case, epsilon is added to the FIRST (A). Now the closure of the nonterminal FIRST sets can be computed by considering productions of the following form:

                        A  -->  X1 X2... Xn

In this production, the FIRST (X1) is added to the FIRST (A). If X1 is nullable, then the FIRST (X2) is added to the FIRST (A), and so on as long as the current right hand side symbol is nullable. A symbol is nullable if epsilon is in the FIRST set of that symbol. If all of the right hand side symbols are nullable, then epsilon is also added to FIRST (A). This process is repeated for all productions in the grammar as long as a symbol can be added to some FIRST set. This is necessary because the FIRST sets of the symbols on the right hand sides of the productions may change during an iteration over the grammar. When a full iteration produces no changes in any FIRST set, the closure is complete. An algorithm for computing the FIRST sets of a grammar is given in figure 3.4.


    

for each terminal symbol b in the grammar FIRST (b) = { b } end for do added = false for each production in the grammar for each symbol X on the right hand side of the A-production ( symbols X taken in left to right order ) FIRST(A) = FIRST(A) union FIRST(X) if a symbol was added to FIRST(A) added = true if X is not nullable break end for if the right hand side of the A-production is nullable FIRST(A) = FIRST(A) union { epsilon } end for while added is true

Figure 3.4 Algorithm For Computing FIRST Sets


As an example, consider the following grammar:

                        S  -->  A  B  A                    Grammar 3.1
                        A  -->  C  D
                        A  -->  a
                        B  -->  E  F
                        B  -->  b
                        C  -->  c
                        C  -->
                        D  -->  d
                        E  -->  e  E
                        E  -->
                        F  -->  f  F
                        F  -->
It can be seen from Grammar 3.1 for instance that the FIRST set of nonterminal B will be constructed in order as follows.

                        FIRST (B) = { b, e, epsilon, f }

The FIRST sets for all of the nonterminal symbols of Grammar 3.1 are given in figure 3.5.


    

FIRST (S) = { a, c, d } FIRST (A) = { a, c, d } FIRST (B) = { b, e, epsilon, f } FIRST (C) = { c, epsilon } FIRST (D) = { d } FIRST (E) = { e, epsilon } FIRST (F) = { f, epsilon }

Figure 3.5 FIRST Sets of Grammar 3.1


FOLLOW Set Construction

The FOLLOW sets of a grammar are constructed in a manner that is similar to the way that the FIRST sets are constructed. However, there is a major difference in the content of the two types of sets. The FOLLOW sets are not allowed to contain the null string, even though clearly the null string follows the start symbol of all LL(1) grammars. For the purpose of FOLLOW set construction, it is convenient to augment the terminal set of a grammar with a special terminal symbol that represents the end of an input string in the language of the grammar. The dollar symbol ($) is usually used to represent the end-of-input marker in a grammar. This grammar transformation is equivalent to augmenting the grammar with a new start symbol, S', and a new start production as follows.

                        S'  -->  S  $

This transformation is necessary in order for the LL(1) parsing algorithm to be able to detect that the end of the input string has been reached. All LL grammars will be assumed to be augmented in this way without explicitly listing the extra production or the extra terminal symbol.

The FOLLOW sets of the nonterminal symbols in a grammar are constructed by first setting the FOLLOW (S) to be the end marker for the grammar. Closures of all of the FOLLOW sets of the grammar are then constructed by processing all of the productions in the grammar until no changes to any FOLLOW set can be made. Actions are taken for each production that has the following form:

                        A  -->  alpha B beta

In this case, everything in the FIRST (beta ) is added to the FOLLOW (B), except for epsilon as noted above. If beta is null or nullable, then the FOLLOW (A) is also added to the FOLLOW (B). The string of symbols, beta is nullable if all of its elements are nullable. The FIRST set of a string of symbols is constructed in the obvious way. That is, the FIRST set of the first symbol in the string is always part of FIRST (beta ) for any string of symbols, beta. If the first symbol in beta is nullable, then the FIRST set of the next symbol in the string is added to FIRST (beta ). This continues as long as each successive symbol in the string is nullable, and ends if a non-nullable symbol is encountered in the string. If all of the elements of the string are nullable, then epsilon is in FIRST (beta ). An algorithm for computing the FOLLOW sets of a grammar is given in figure 3.6.


    

FOLLOW (S) = { $ } do added = false for each production in the grammar for each nonterminal B on the right hand side of the A-production for the string beta following B FOLLOW(B) = FOLLOW(B) union ( FIRST(beta ) - epsilon ) if beta is null or nullable FOLLOW(B) = FOLLOW(B) union FOLLOW(A) if a symbol was added to FOLLOW(B) added = true end for end for end for while added is true

Figure 3.6 Algorithm For Computing FOLLOW Sets


As an example of the way that the FOLLOW sets are constructed, consider the following production from Grammar 3.1:

                        B  -->  E  F

Since nonterminal F follows nonterminal E on the right hand side of this production, the FIRST (F) minus epsilon is added to the FOLLOW (E). Since nonterminal F is nullable, the FOLLOW (B) set is also added to FOLLOW (E). The full FOLLOW set of nonterminal E is:

                        FOLLOW (E) = { f, a, c, d }

All of the FOLLOW sets for Grammar 3.1 are given in figure 3.7.


    

FOLLOW (S) = { $ } FOLLOW (A) = { a, c, d, b, e, f, $ } FOLLOW (B) = { a, c, d } FOLLOW (C) = { d } FOLLOW (D) = { a, c, d, b, e, f, $ } FOLLOW (E) = { f, a, c, d } FOLLOW (F) = { a, c, d }

Figure 3.7 FOLLOW Sets of Grammar 3.1


PREDICT Set Construction

A PREDICT set is the set of terminal symbols that can begin any partial string in the language that is derivable from the current point in the parse by the application of the predicted production to expand the instance of its left hand side nonterminal that is currently being parsed. The PREDICT set for each production in a grammar is easily constructed from the FIRST and FOLLOW sets of the grammar by applying the following definition:

    Definition:  The LL(1) PREDICT set for a production in a grammar 
    is a set of terminal symbols as follows.

    PREDICT ( A --> alpha ) =
        if epsilon in FIRST ( alpha )
            ( FIRST ( alpha ) - epsilon )  union  FOLLOW ( A )
        otherwise
            FIRST ( alpha )
    

An algorithm for computing the PREDICT sets of a grammar is given in figure 3.8.


    

for each production A --> alpha in the grammar if alpha is null or nullable PREDICT(A --> alpha ) = ( FIRST(alpha ) - epsilon ) union FOLLOW(A) else PREDICT(A --> alpha ) = FIRST(alpha ) end for

Figure 3.8 Algorithm For Computing PREDICT Sets


If this algorithm is applied to the second production of Grammar 3.1, the resulting PREDICT set is the following:

                        PREDICT ( A --> C  D )  =  { c, d }

The algorithm first adds FIRST (C) to the PREDICT set. This is terminal symbol c. Since C is nullable, FIRST (D) which is symbol d is also added to the PREDICT set. Since nonterminal D is not nullable, the algorithm terminates at this point. The rest of the PREDICT sets for Grammar 3.1 are given in figure 3.9.


   
                        
        PREDICT ( S  -->  A  B  A ) = { a, c, d, }
        PREDICT ( A  -->  C  D ) = { c, d, }
        PREDICT ( A  -->  a ) = { a }
        PREDICT ( B  -->  E  F ) = { a, c, d, e, f }
        PREDICT ( B  -->  b ) = { b }
        PREDICT ( C  -->  c ) = { c }
        PREDICT ( C  --> ) = { d }
        PREDICT ( D  -->  d ) = { d }
        PREDICT ( E  -->  e  E ) = { e }
        PREDICT ( E  --> ) = { a, c, d, f }
        PREDICT ( F  -->  f  F ) = { f }
        PREDICT ( F  --> ) = { a, c, d }
             
    Figure 3.9  PREDICT Sets of Grammar 3.1
    

The algorithm in figure 3.8. is very efficient because it takes full advantage of the precomputed FIRST and FOLLOW sets to reduce the amount of work that would otherwise need to be done. This is of course desirable if the goal is simply to compute the PREDICT sets for use in a parsing application. However, a thorough understanding of what the PREDICT sets really represent requires a more indepth examination of the PREDICT set construction.

An alternate way to construct the PREDICT set without using either the FIRST or the FOLLOW sets would be to examine the grammar directly in a manner similar to the algorithms that are used in the construction of the FIRST and FOLLOW sets. The goal is to find each of the first terminal symbols that can result from the application of the production in some parse derivation in the grammar. Consider again production four of Grammar 3.1:

            B  -->  E  F

What terminal symbols can begin a string that results from the application of this production? Since nonterminal E begins the production, E-productions should be considered. One of these productions begins with terminal symbol e, so e should be included in the PREDICT set. Since E is nullable, it can disappear so the next symbol F should be considered. One of the F-productions begins with terminal symbol f, so f is added to the PREDICT set. Since F is also nullable, the entire B-production also is nullable. This means that productions that have a B on the right hand side should be considered because if B goes to null in one of those productions, what follows B in that production also predicts this B-production.

Production one is the only one that contains a B on the right hand side. The symbol following that instance of B is A, so the A-productions must be considered. A-production two begins with nonterminal C, and a C-production begins with terminal symbol c, so c is added to the PREDICT set. Nonterminal C is nullable, so the next symbol D in production two is checked. The only D-production begins with terminal symbol d, so d is added to the PREDICT set. Since D is not nullable, consideration of production two stops at this point. The other A-production begins with terminal symbol "a", so "a" is also added to the PREDICT set. Since nonterminal A is not nullable, it is not necessary to look at the productions that have an S on the right hand side, even if there were any such productions. There are no more relevant productions to consider, so the search terminates having accumulated the following set:

            PREDICT ( B  -->  E  F ) = { a, c, d, e, f }

This is the same PREDICT set that was computed using the FIRST and FOLLOW sets and the algorithm in figure 3.8. The difference is that the manual computation of the PREDICT set makes visible the underlying algorithm that is hidden in the traditional PREDICT set computation method. The common feature of both algorithms is that they identify relevant derivations in the grammar, and proceed with those derivations until a 1-length terminal string is identified. Those strings comprise the terminal symbols that are the PREDICT set for the production in question.

The preceding manual computation of the PREDICT set for production four consisted of two main parts that correspond to the two parts of the standard PREDICT set construction algorithm given in figure 3.8. Since the right hand side of the production is nullable, the rule from figure 3.8. that applies to this production is:

            PREDICT (B --> E F) = ( FIRST (E F) - epsilon )  union FOLLOW (B)

The two parts of this rule are:

            1.  FIRST (E F) - epsilon
            2.  FOLLOW (B)

Since the manual computation of the PREDICT set does not allow the use of the precomputed FIRST and FOLLOW sets, individual FIRST and FOLLOW sets are computed directly in the context of this production using the underlying notions of FIRST and FOLLOW that are inherent in their definitions. The FIRST (EF) set is computed by applying a variation of the standard FIRST set construction algorithm to the string "EF" only. Whenever a FIRST set for some nonterminal symbol is required, it is computed directly at that point in the algorithm. The FIRST set of a terminal symbol is trivially the symbol itself. At any point that epsilon would be added to the FIRST set by the standard algorithm, it is just ignored. A new iterative algorithm for directly computing the FIRST set of a string of symbols of a grammar is given in figure 3.10.


 

First ( alpha ) { variables: strings_set = set of alpha nonterminals_set = empty first_set = empty do added = false for each beta in the strings_set for each symbol X in beta from left to right if X is a nonterminal symbol if X is not in the nonterminals_set add X to the nonterminals_set added = true end if for each production X --> gamma if gamma is not null if gamma is not in the strings_set add gamma to the strings_set end for end if if X is not nullable break end for end for while added is true for each symbol B in the nonterminals_set for each production B --> beta b gamma if beta is null or nullable add b to the first_set return first_set }

Figure 3.10 Algorithm For Directly Computing FIRST Set of a String


The algorithm in figure 3.10 is split into two main parts. The first part computes the closure of the set of nonterminal symbols that may possibly contribute to the final FIRST set for the string of symbols. This is similar to the closure of the FIRST set of terminal symbols that is computed by more general FIRST set computation algorithm given in figure 3.4. The difference is that here the context is limited to just a single string of grammar symbols. This means that only the relevant FIRST sets are constructed, instead of building all of the FIRST sets for the entire grammar as is done in figure 3.4. This is why the set of relevant nonterminal symbols is assembled first, and then the actual FIRST set is then computed using only the lowest level productions in the grammar. Those are the productions that can begin with a terminal symbol because the preceding string of symbols in the production is null or nullable.

If the grammar in question can be assumed not to be left-recursive, the algorithm in figure 3.10 can be replaced with a simpler recursive version. This new algorithm is given in figure 3.11. The recursive version may not terminate if the grammar is left-recursive. However, if any left-recursion has been removed from the grammar prior to running this algorithm, it must terminate with the same result as that produced by the iterative version.


    

First ( alpha ) { variable: first_set = empty for each symbol X in alpha from left to right if X is a terminal symbol first_set = first_set union { X } break else if X is a nonterminal symbol for each production X --> beta if beta is not null first_set = first_set union First ( beta ) end for end if if X is not nullable break end for return first_set }

Figure 3.11 Recursive Algorithm For Directly Computing FIRST Set of a String


The second phase of directly computing the PREDICT(B --> E F) set is to directly compute FOLLOW(B). The FOLLOW(B) is computed by applying a slight variation of the standard algorithm to the relevant productions in the grammar. As was the case with the direct FIRST set computation, the direct FOLLOW set computation must proceed in stages because of the fact that only a single FOLLOW set of the grammar is being created. This new algorithm is given in figure 3.12.


    

Follow ( B ) { variables: nonterminals_set = set of B follow_set = empty do added = false for each C in the nonterminals_set for each production A --> alpha C beta if beta is null or nullable if A is not in the nonterminals_set add A to the nonterminals_set added = true end if end if end for end for while added is true for each C in the nonterminals_set for each instance of C as in A --> alpha C beta if beta is not null follow_set = follow_set union First ( beta ) if S is in the nonterminals_set follow_set = follow_set union { $ } return follow_set }

Figure 3.12 Algorithm For Directly Computing a FOLLOW Set


The algorithm in figure 3.12 is in two parts. The first part of the algorithm accumulates all of the nonterminal symbols whose occurrence on the right hand side of productions needs to be considered in the construction of the FOLLOW set. The second part examines all occurrences of those nonterminal symbols on the right hand side of the productions of the grammar. If a string of symbols follows the nonterminal symbol, then the FIRST set of that string is added to the FOLLOW set for the nonterminal in question. Since the algorithm does not use precomputed FIRST sets, the algorithm from figure 3.10 or from figure 3.11 can be used at the point where a FIRST set is required. This allows the computation of a FOLLOW set for a single nonterminal symbol in the grammar without the need to compute all FOLLOW sets as is the case in the algorithm in figure 3.6.

The new First and Follow algorithms given above enable the construction of any single PREDICT set without having to precompute all of the FIRST and FOLLOW sets of a grammar. This ability is useful in cases where it is necessary to analyze a part of a grammar in a specific context, such as in the context of a single production. The new algorithm for directly computing a PREDICT set of a grammar is given in figure 3.13.


    
        
    Predict ( A --> alpha )
    {
        variable:
            predict_set = empty

        if alpha is not null
            predict_set = First ( alpha )
        if alpha is null or nullable
            predict_set = predict_set union Follow ( A )

        return predict_set
    }

    Figure 3.13  Algorithm For Directly Computing a PREDICT Set
    

Parse Table Construction

A parse table is the usual implementation of the LL(1) PREDICT function. An LL(1) parse table is constructed from the PREDICT sets that are taken from the FIRST and FOLLOW sets of the symbols in a grammar. As implied in the definition of the FIRST set of a string of symbols, the FIRST set for a single symbol is the set of all terminal symbols that may begin a sentential form that can be derived from that symbol. The FOLLOW set is the set of all terminals that can follow the symbol in some sentential form derivable in the grammar. The parse table maps a nonterminal symbol and a terminal symbol, called the lookahead token, to a unique production in the language. A grammar is LL(1) if and only if the mapping is one to one. The LL(1) parsing method can be used only for LL(1) grammars. Thus the construction of an LL(1) parse table can be used as a means to test a grammar for the LL(1) property.

In practice, the PREDICT sets are not constructed explicitly in the LL(1) grammar analysis method. Generally the LL(1) parse table is constructed directly from the FIRST and FOLLOW sets of the symbols in the grammar. The PREDICT set construction is implicit in this process. Each row in the parse table corresponds to a union of all of the PREDICT sets for a nonterminal symbol. That is, each production in the grammar has a PREDICT set associated with it. The PREDICT sets for a nonterminal symbol are the PREDICT sets for all of the productions in the grammar that have that nonterminal as the left hand side symbol. If any two A-productions in the grammar have overlapping PREDICT sets, then the A-row in the parse table will have multiple entries in the columns corresponding to the terminal symbols that are present in at least two of the A-PREDICT sets. These entries will be the production numbers of the A-productions that have overlapping PREDICT sets.

The LL(1) parse table is constructed in a manner that is quite similar to the construction of a PREDICT set, as might be expected. Each production in the grammar is processed in turn to add entries to a row in the parse table. For each terminal symbol in the FIRST set of the right hand side of the production, the production number is added to the row and column of the parse table corresponding to the left hand side nonterminal and the terminal symbol, respectively. If the production is null or nullable, then the FOLLOW set of the nonterminal is used in the same way to add production numbers to the row. The algorithm for constructing an LL(1) parse table is given in figure 3.14.


    
    ConstructParseTable ( grammar G = ( N, T, P, S ) )
    {
        variable:
            parse_table [ |N| ][ |T| ]

        initialize all parse_table entries to ERROR
        for each production A --> alpha in the grammar
            for each b in ( FIRST ( alpha ) - epsilon )
                parse_table [ A ][ b ] = A --> alpha
            if alpha is null or nullable
                for each c in FOLLOW ( A )
                    parse_table [ A ][ c ] = A --> alpha
        end for
        return parse_table
    }

    Figure 3.14  Algorithm For Constructing an LL(1) Parse Table
    

Consider Grammar 3.1 again as an example. Nonterminal C in this grammar has the two productions that follow:

                        C  -->  c
                        C  -->

Since the FIRST (c) is just terminal symbol c, the entry in the parse table for (C,c) is production number 6. This production is not nullable, so no more entries are added for it. Production 7 is a null production, so the FOLLOW (C) must be processed. It consists of only terminal symbol d, so the entry in the parse table for (C,d) is set to 7. The rest of the productions in the grammar are processed similarly to give the full parse table for Grammar 3.1 that is shown in figure 3.15. The blank entries represent error entries in the table.


    

Terminals a b c d e f $ ------------------------------------------------ | S | 1 1 1 | A | 3 2 2 | B | 4 5 4 4 4 4 Nonterminals | C | 6 7 | D | 8 | E | 10 10 10 9 10 | F | 12 12 12 11 | Figure 3.15 LL(1) Parse Table for Grammar 3.1


Complexity of LL(1)

Sippu and Soisalon-Soininen (1988) have shown that the LL(1) parser can be constructed in time

                        O ( |T| · |G| ) = O ( n² )
where n is the size of the input, which is of course G, the grammar itself. Intuitively, it can be argued that the complexity bound certainly cannot be less than this for constructing a table-driven LL(1) parser. The table-driven method requires the construction of a parse table of size:
                        |N| · |T|
Realizing that the sizes of both N and T are bounded by G, it is obvious that time O(n²) could be used to fill in the entries of the parse table, in the worst case.

One way to view the complexity of the problem of analyzing a reduced context-free grammar would be to look at the work that is done in constructing the FIRST, FOLLOW, and PREDICT sets. The algorithms for each of these are similar, so the overall order of complexity is probably that of any one of these problems. The FIRST sets for a grammar are constructed by iterating over the entire grammar some number of times until a closure is achieved. The number of iterations that is required is dependent on the grammar, complicating the analysis. However, if the grammar is known not to be left-recursive, the direct method of constructing PREDICT sets that was given in figure 3.13 can be used to analyze the grammar. The essence of this method is:

        For each production in the grammar
            Find all 1-length derivations resulting from an application
            of the production to its nonterminal anywhere in the grammar

Finding the 1-length derivations in a recursive manner corresponds to a depth-first search of some subset of the grammar. The constraint that the grammar is not left-recursive ensures this. In the worst case, the search cannot cover more than the entire grammar, so the overall complexity would be:

                O ( |P| · |G| ) = O ( n² )

Parser Complexity

In most parsing applications, the complexity of the parser is of more interest than that of the grammar analysis that is required to build the parser. This is because the parser is usually constructed once, and then used many times. An example is a compiler. The performance of the compiler is very important to the user of the compiler. The amount of time required to compile the compiler itself is of no concern to the end user.

The analysis of the time complexity of an LL(1) parser is quite straight-forward. Since it does no backtracking over the input string that is being parsed, the complexity is always linear or:

                        O ( n )
The parser does some variable amount of work in processing each input token. However, that amount of work is a function of the grammar, and the grammar is not part of the input to the parse. The parse proceeds by consuming the input string one token at a time, and completes when the input string is fully consumed.

Of major concern is the size of the LL(1) parser. The bulk of an LL(1) parser is usually the two tables that it contains:

                        1.  Productions table 
                        2.  Parse table
The space required for the parsing routine is generally insignificant in relation to the space required for the tables. The productions table is just the grammar itself, so its size is |G|. The parse table is a table of nonterminals by terminals, so the total size required for the LL(1) tables is:

                        |G|  +  |N| · |T|

Since the parse table is generally very sparse, much attention has been focused on table compaction methods that can be applied to it. Several general methods are applicable to the LL(1) parse table. In addition, there are some LL(1)-specific table compaction methods that make use of a knowledge of the parsing algorithm. These methods are summarized in Fischer and LeBlanc (1991).

LL(1) Grammar Properties

The predictive nature of the LL(1) grammars places some restrictions on their format that do not apply to other classes of grammar. In particular, the LL(1) grammars must have the following properties:

                        Not left-recursive
                        No common prefixes

A grammar is left-recursive if some nonterminal in the grammar is left-recursive. A production is said to be left-recursive if its left hand side nonterminal appears as the first symbol on the right hand side of the production. A nonterminal is said to be left-recursive if has a left-recursive production, or if some sequence of applications of productions yields an indirect left-recursion. The nonterminal B is left-recursive in the following derivation in the grammar:

            S  ==>*  alpha B beta  ==>+  alpha B gamma

Left-recursion causes a problem for predictive parsers because it would invariably lead to an infinite loop in the parsing process. If the nonterminal that is being expanded is replaced with a production that begins with that same nonterminal, the same expansion will occur indefinitely because the next symbol to be expanded will always be the same as the previous symbol that was expanded. Fortunately, left-recursion can be easily detected in a reduced context-free grammar. In addition, there are efficient algorithms available to remove left-recursion from a grammar. None of the A-productions in an LL(1) grammar can have a common prefix. This restriction is obvious from the definition of LL(1). If two A-productions begin with the same nonterminal symbol, their PREDICT sets are sure to overlap by at least the entire FIRST set of that nonterminal. In the clearest example, if two A-productions begin with the same terminal symbol, there is no way to predict which production to use based on that symbol. The idea of predictive parsing is that the A-productions in a grammar must begin with different terminal symbols. The concepts of FIRST and FOLLOW extend that notion to appropriate derivations in the grammar.

In most cases, productions that have a common prefix can be converted to LL(1) format by a technique known as left-factoring. Consider the following example:

                        A  -->  a  b  c                    Grammar 3.2
                        A  -->  a  c  d

The common prefix "a" can be factored out of these productions to give the following LL(1) grammar:

                        A  -->  a  M                       Grammar 3.3
                        M  -->  b  c
                        M  -->  c  d

As with all other types of deterministic context-free grammar, the LL(1) grammars cannot be ambiguous. Put in predictive terms, if the same string can be derived in two different ways in the grammar, how can it be possible to predict which derivation to use, given only a knowledge of the single next input token. The problems associated with ambiguity and other forms of nondeterminism are the main reason that most parsing methods are designed around deterministic subclasses of the context-free grammars.

Summary

The LL(1) class of context-free grammars and the methods that are used to parse them were considered. LL(1) parsing consists of two phases. The first phase is the analysis of the grammar to produce a parser. The second phase is the actual parsing of sentences in the language of the grammar. A grammar can be decided to be LL(1) or not by constructing a parse table for it. If the parse table has no duplicate entries, the grammar is LL(1). Otherwise the grammar is not LL(1). The parse table is constructed directly from the FIRST and FOLLOW sets of the symbols of the grammar. These sets make it possible to decide which alternate production to use during the parse given only a knowledge of the next input token that is encountered in the input string. This decision can be thought of as a prediction, so LL(1) parsing is also referred to as predictive parsing.

References

Fischer, Charles N. and Richard J. LeBlanc, Jr. (1991). Crafting a Compiler with C. Benjamin/Cummings, Menlo Park, Ca.

Sippu, S. and E. Soisalon-Soininen (1988). Parsing Theory Volume 1: Languages and parsing. Springer-Verlag, New York, N.Y.



Home


Copyright © 2001-2009 SLK Systems