********************** Syntax and Semantics ********************** ================================================== Brief introduction to the formal language theory ================================================== Definition of formal languages ------------------------------ :index:`alphabet` a set of characters :index:`sentence` a sequence of characters :index:`grammar` a set of rules to determine if a sentence is correct or not :index:`language` a set of correct sentences .. index:: single: token We often use *tokens* instead of alphabets because of simplicity. A token is something like a word composed of several letters. .. admonition:: An example of tokens Sentence ``total = total + x;`` can be decomposed into a sequence of characters ``t, o, t, a, l, space, =, space, t, o, ...``, which is cumbersome. Instead, we usually use a sequence of tokens ``total, =, total, +, x, ;``. `Generative Grammar `_ defines a language by generating correct sentences in some manner. :index:`terminal` symbols Tokens used in languages :index:`non-terminal` symbols Symbols representing structures of sentences. One of them is designated as the initial symbol. production rules Rules to rewrite some non-terminal symbols to other symbols. Starting from the initial symbol, we can generate a correct sentence by repeatedly applying rules. .. admonition:: Simple grammar * Terminal: I, you, love * Non-terminal: S, N, V * Initial: S * Rule: S→NVN, N→I, N→you, V→love There are several classes of grammars such as *context-free grammar* and *regular grammar*. Most of programming languages can be defined using `context-free grammar `_. .. _bnf: BNF Notation ------------ :index:`BNF` (Backus-Naur Form) is a convenient notation to write a context-free grammar for practical use. * Terminal symbols are written as they are. * Non-terminal symbols are enclosed by < >. * Production rules are in the form *non-terminal* ``::=`` *symbols* where we can write multiple alternatives by separating with ``|`` in the right side. .. admonition:: Simple arithmetic expression Terminal: ``A, B, C, +, *, (, )`` Non-terminal: ``, , `` Initial: ```` Rule: .. code-block:: bnf ::= A | B | C ::= | ( ) ::= | + | * Parse tree ---------- .. index:: single: parse single: parse tree To *parse* is to find how a sentence is generated by production rules. A *parse tree* is a tree structure that represents the result of parsing. .. admonition:: Generation of ``A*(B+C)`` ```` → ``*`` → ``*`` → ``*`` → ``A*`` → ``A*()`` → ``A*(+)`` → ``A*(+)`` → ``A*(+)`` → ``A*(B+)`` → ``A*(B+)`` → ``A*(B+C)`` .. figure:: parse-tree.png :scale: 70% :align: center Parse tree of ``A*(B+C)`` .. admonition:: Exercise 1 :class: exercise Draw a parse tree of ``A+B*C`` with the grammar shown above. :ref:`[Answer] ` .. admonition:: Exercise 2 :class: exercise In the grammar shown above, there is no precedence between + (add) and * (multiply). Change the grammar so that * (multiply) takes precedence over + (add). For example, when parsing ``A+B*C``, make ``B*C`` be a subtree instead of ``A+B``. Hint: introduce a new non-terminal symbol and a new rule so that + (add) and * (multiply) are parsed in separated levels. :ref:`[Answer] ` Ambiguous grammar ----------------- We say a grammar is :index:`ambiguous` if more than one parsing is possible for one sentence. .. admonition:: In a natural language `Structural ambiguity in English word-formation `_ .. admonition:: An example of ambiguous grammar We have two different parse trees for ``A*B+C`` with the following grammar. .. code-block:: bnf ::= A | B | C ::= | + | * .. admonition:: Dangling-else Languages such as C and Java have the *dangling-else problem* since ``else`` is optional in ``if`` statements. .. code-block:: bnf ::= if ( ) | if ( ) else .. admonition:: Exercise 3 :class: exercise Explain the dangling-else problem in the following statement. .. code-block:: c if ( x > 0) if ( y > 0 ) z = 1; else z = 2; :ref:`[answer] ` .. admonition:: Exercise 4 :class: exercise Explain that the following grammar is ambiguous. .. code-block:: bnf ::= | A :ref:`[answer] ` ================================ Glance at the semantics theory ================================ It is very difficult to define semantics of languages precisely. In C and Java, we usually say the meaning of ``++`` operator is "first evaluate the variable, then increment it". However, we cannot figure out the result of ``x = 1; x = x++;`` with this definition. As long as we use a natural language, ambiguity is inevitable. Semantics of languages are largely depend on their paradigms. Functional languages have lambda calculus, and logical languages have first-order predicate logic as the theoretical foundation of their semantics. We have no obvious theoretical foundation for imperative (procedural) languages. However, there are several formal semantics theories that can give rigorous definition of semantics of imperative languages. Operational semantics --------------------- The `operational semantics `_ describes how a program is executed as a sequence of computational steps. * A computational step is a change of state in the computer. * The state in the computer is, for example, a mapping from variables to values (a model of the memory device). * Using logical inference rules, we can define what kind of computational steps are possible. Axiomatic semantics ------------------- The axiomatic semantics is based on `Hoare logic `_. The meaning of a sentence is determined by a pair of precondition (which is true before the execution of the statement) and postcondition (which is true after the execution). It is closely related to verification of a program. We can say a program is *correct* if it satisfies its specification. A specification is a pair of the initial condition and the final condition. Using Hoare Logic, we can prove a program satisfies its specification by chaining preconditions and postconditions of sentences. Denotational semantics ---------------------- The `denotational semantics `_ is based on recursive functions and domain theory. It constructs mathematical objects that describe the meaning of a program. The interesting part is it includes an object that denotes an infinite loop. Unfortunately, the constructed objects are often very hard to read (cf. `Revised^5 Report on the Algorithmic Language Scheme `_).