CSC402
THE SYNTAX OF A PROGRAMMING LANGUAGE
The syntax of a programming language
can be defined as the set of rules and writing
conventions that allow the formation of correct programs in a language.
This clearly implies that the syntax has nothing to do with
the meaning or runtime behaviors of a program.
A program can be syntactically
correct, but yet meaningless.
Example, in Pascal,
Program computer(input,
output);
Begin
.
End.
This program is
syntactically correct, but of no meaning.
The syntax of a language is however a prerequisite to meaningful expression. A
programming language must have a good syntactic definition before it can
properly support the development of meaningful programs.
The approaches to syntactic
description can be viewed from two major perspectives; the language designers
and the programmer’s perspectives. The language designer aims to achieve a
clean implementation of the language, while the programmer wants maximum
expresivity and convenience in a particular programming domain.
Good languages however tend to incorporate these two aims
when they are complementary, and to compromise them when they are not.
Language Elements
There are two basic elements that
constitute the construction of any programming language. These elements generally
determine the capability of the language to cope with the challenges of any
programming task.
These include:
a.
The
character set
b.
The
Vocabulary
Character Set
The character set of a
language is that set of symbols from which all programs are composed. Several different approaches to
character set have been taken by different language. In the 1950’s and 1960’s,
ALGOL and FORTRAN were among the most visible languages. The FORTRAN
designers felt that a minimum functional character set – only uppercase letters,
digits and 13 special symbols – was appropriate for programming. The main
reason for this decision at that time was the need for FORTRAN to conform
in its typing to the limits of the ‘STANDARD’
programming tools of the day; the IBM 026 keypunch machine.
The ALGOL designers felt on the
other hand, that the character set should support the publication of
programs in journals and books. Thus, the ALGOL character set includes
upper and lower case letters and an extensive complement of special symbols for
punctuation and mathematical expression.
Because the character set is such a
fundamental element of programming language design and more generally, of data
representation in computers, strong efforts were made to develop an
international standard character set and internal machine representation. As a
result, two alphabets have emerged and become the defacto standards for the
industry. They are:
a.
ASCII
(American Standard Code for Information Interchange)
b.
EBCDIC(Extended
Binary Coded Decimal Interchange Code)
Most current programming language
designers conform in their character sets to one or both of these standards,
yet some do not conform to either.
Vocabulary
The vocabulary of a
programming language is that set of characters and words from which programs
are constructed.
Example, in the following Pascal
program;
Program P;
var x, y : integer
Begin
readln(x)
y
:= x + 2.5;
write(y)
End.
The vocabulary include:
Program var integer begin end P x y
readln 2.5 ; := . , ( ) +
The elements
of a language’s vocabulary are called its tokens. These tokens are
usually divided into classes according to their role in the language. For
instance, we have tokens, which are “identifiers” (p, x, y,
read, and write), “constants”
(2.5), “operators”
(:=, +) and “delimiters”
(program, var, :, ;, (, ), ., begin, end).
The set of tokens for a programming
language is defined in such a way that permits efficient recognition by a
compiler and a clear expression by a programmer.
Problems of Describing Syntax
Languages whether natural such as
English or artificial such as BASIC are sets of strings of characters from some
alphabet. The strings of a language are called sentences or
statements. The syntax of a language specify which strings of characters from
the language’s alphabet are legal and are thus in the language.
Language syntax descriptions for
simplicity’s sake, often do not include descriptions of the lowest level
language units. These small syntactic units are called lexemes. The description
of lexemes can be given by a lexical specification, which can be separated from
the syntactic description of the language. English for instance, has a
complex and lengthy collection of rules for specifying the syntax of its
sentences, by comparison, even the largest programming languages are
syntactically very simple.
Language Recognizers
Languages can be formally defined in
two distinct ways: by recognition and by generation.
Suppose we have a language L that uses the alphabet A of
characters, to formally define
L using the recognition method, we would need to construct a mechanism R capable of
inputting strings of characters from the alphabet A. R would need to be
designed so that it should be able to indicate that a given input string was or
was not in L. in effect, R would either accept or reject the given string. Such
devices are like filters, separating correct sentences from those that are
incorrectly formed. If R, when fed all possible strings of characters from A,
accepts exactly those that are in L, then R is a description of L.
The syntax analysis portion of a
compiler is a recognizer for the language the compiler translates.
Language Generators
A language generator is a device
that can be used to generate the sentences of a language. One can think of a
generator as having a button that when pressed, produces a sentence of the
language. Because it is unclear which sentences the generator will generate
when the button is pressed, the generator seems like a much less useful
language descriptor.
FORMAL METHODS OF DESCRIBING SYNTAX
The BNF (Backus-Naur Form) has
remained the predominant and most popular method of concisely describing
programming language syntax. It is remarkable that the BNF is nearly identical
to Chomskey’s generative devices from
context free grammars. BNF is a meta-language (a language that is used to
describe another language). It uses abstractions for syntactic structures.
A Pascal assignment statement for
example might be represented by the abstraction <assign>.
Pointed brackets are often used to
delimit names of abstractions.
The actual definition of <assign> may be given by:
<assign> => <var> :=
<expression>
The symbols on the left side of the
arrow is the abstraction
being defined. The text to the right of the arrow is the definition of
the LHS. It is called the RHS and consists of tokens, lexemes, references to
other abstractions, or some mixture of these. Altogether, the definition is called a rule or production.
In the example rule just given, the
abstraction <var> and <expression> must be defined before the
<assign> definition becomes useful.
This particular rule specifies that
the abstraction <assign> is defined as the sequence of symbols: an
instance of the abstraction <var>, followed by the lexeme :=, followed by
an instance of the abstraction <expression>. One example of the abstract
syntactic structure described by the rule is:
total := sub1 + sub2
The abstractions in a BNF
description or grammar are often called non-terminal symbols or simply
non-terminals, while the lexemes and tokens of the rules are called terminal
symbols or terminals.
A grammar is a collection of rules,
where each non-terminal symbol represents a syntactic structure in the
language. Non-terminal symbols can have two or more distinct definitions, or
forms reflecting two or more possible syntactic forms in the language.
In BNF, we use the following
meta-linguistic symbols as shorthand notations:
1. := “is defined as”
2. {} “any sequence of 0 or more of the
items enclosed”
3. [
] “either 0 or more occurrence of
the items enclosed”
4. | “or” in the exclusive sense
In BNF, multiple definitions can be
written as a single rule, with the different definitions separated by the
symbol |,
representing OR.
For example, a Pascal if-statement can be described as:
<if_stmt> => if
<logic_expr> then <stmt>
<if_stmt> => if
<logic_expr> then <stmt> else <stmt>
OR
<if_stmt> => if
<logic_expr> then <stmt>
| if <logic_expr> then <stmt> else
<stmt>
ALSO,
<letter> => a
|b
| c | d | e |…. |z
|A |B |
C |….
|Z
which is an English interpretation
of a letter is defined as either “a” or “b” or “c” or …..”z” or “A”, “B”, “C”,
……”Z”.
<digit> =>
0|1|2|3|4|5|6|7|8|9
Having these elementary definitions,
we can build new rules which are based on them, such as:
<identifier> =>
<letter> {letter| digit}
GRAMMAR AND DERIVATIONS
BNF is a generative tool for
defining languages. The sentences of the language are generated through
repeated application of the rules, beginning with a special non-terminal of the grammar called the
start symbol.
Such a generation is called a derivation.
In a grammar for a complete
language, the start symbol represents a complete program, and is usually named <Program>.
The following simple grammar is used
to illustrate derivations:
<program> => begin
<stmt_list> end
<stmt_list> => <stmt>
| <stmt>;
<stmt_list>
<stmt> => <var>
:= <expr>
<var> => a|b|c
<expr> => <var>
+ <var>
| <var>
- <var>
| <var>
From the above set of rules, the
following can be deduced:
1.
A
program consists of the special word “begin”.
2.
followed
by a list of statements separated by semicolons
3.
followed
by a special word “end”
4.
The
expressions allow either a single variable or two variables and either a –
(minus) or a + (plus) operator.
5.
The
only allowable variable names are a, b or c.
The derivation of a program in this
language follows:
<program> => begin
<stmt_list> end
ð
begin
<stmt>; <stmt_list> end
ð
begin
<var> := <var> + <var>; <stmt_list> end
ð
begin
a := <var> + <var>; <stmt_list> end
ð
begin
a := b + <var>; <stmt_list> end
ð
begin
a := b + c; <stmt_list> end
ð
begin
a := b + c; <stmt> end
ð
begin
a := b + c; <var> := <var> end
ð
begin
a := b + c; b := <var> end
ð
begin
a := b + c; b := c end
This derivation, like all
derivations begins with the start symbol, in this case <program>. The
symbol => is read “produces” or “derives”.
Each successive string in the
sequence is derived from the previous string by replacing one of the
non-terminals with one of the definitions. Each
of the strings in the derivation, including <program> is called a
sentential form.
In this derivation, the replaced non-terminal is always
the leftmost non-terminal in the previous sentential from. Derivations that use
this order of replacement are called leftmost derivations. The derivation
continues until the sentential from contains no non-terminals. The sentential
form consisting of only terminal symbols is the generated sentence, which is in
the language.
A sample grammar for
simple assignment statement
<assign> => <id>
:= <expr>
<id> => A | B
| C
<expr> => <id>
+ <expr>
| <id> * <expr>
| (<expr>)
| <id>
This grammar describes assignment
statements whose right sides are arithmetic expressions with multiplication and
addition operators and parenthesis. E.g the statement A := B * (A + C) can be
generated by the derivation:
<assign> => <id> := <expr>
=> A := <expr>
=> A := <id> * <expr>
=> A := B * <expr>
=> A := B * (<expr>)
=> A := B * (<id> + <expr>)
=> A := B * (A + <expr>)
=> A := B * (A + <id>)
=> A := B * (A + C)
No comments:
Post a Comment