CSC402

THE SYNTAX OF A PROGRAMMING LANGUAGE

The syntax of a programming language can be defined as the set of rules and writing conventions that allow the formation of correct programs in a language. This clearly implies that the syntax has nothing to do with the meaning or runtime behaviors of a program.

A program can be syntactically correct, but yet meaningless.

Example, in Pascal,

Program computer(input, output);

Begin

End.

This program is syntactically correct, but of no meaning. The syntax of a language is however a prerequisite to meaningful expression. A programming language must have a good syntactic definition before it can properly support the development of meaningful programs.

The approaches to syntactic description can be viewed from two major perspectives; the language designers and the programmer’s perspectives. The language designer aims to achieve a clean implementation of the language, while the programmer wants maximum expresivity and convenience in a particular programming domain.

Good languages however tend to incorporate these two aims when they are complementary, and to compromise them when they are not.

Language Elements

There are two basic elements that constitute the construction of any programming language. These elements generally determine the capability of the language to cope with the challenges of any programming task.

These include:

a. The character set

b. The Vocabulary

Character Set

The character set of a language is that set of symbols from which all programs are composed. Several different approaches to character set have been taken by different language. In the 1950’s and 1960’s, ALGOL and FORTRAN were among the most visible languages. The FORTRAN designers felt that a minimum functional character set – only uppercase letters, digits and 13 special symbols – was appropriate for programming. The main reason for this decision at that time was the need for FORTRAN to conform in its typing to the limits of the ‘STANDARD’ programming tools of the day; the IBM 026 keypunch machine.

The ALGOL designers felt on the other hand, that the character set should support the publication of programs in journals and books. Thus, the ALGOL character set includes upper and lower case letters and an extensive complement of special symbols for punctuation and mathematical expression.

Because the character set is such a fundamental element of programming language design and more generally, of data representation in computers, strong efforts were made to develop an international standard character set and internal machine representation. As a result, two alphabets have emerged and become the defacto standards for the industry. They are:

a. ASCII (American Standard Code for Information Interchange)

b. EBCDIC(Extended Binary Coded Decimal Interchange Code)

Most current programming language designers conform in their character sets to one or both of these standards, yet some do not conform to either.

Vocabulary

The vocabulary of a programming language is that set of characters and words from which programs are constructed.

Example, in the following Pascal program;

Program P;

var x, y : integer

Begin

readln(x)

y := x + 2.5;

write(y)

End.

The vocabulary include:

Program var integer begin end P x y

readln 2.5 ; := . , ( ) +

The elements of a language’s vocabulary are called its tokens. These tokens are usually divided into classes according to their role in the language. For instance, we have tokens, which are “identifiers” (p, x, y, read, and write), “constants” (2.5), “operators” (:=, +) and “delimiters” (program, var, :, ;, (, ), ., begin, end).

The set of tokens for a programming language is defined in such a way that permits efficient recognition by a compiler and a clear expression by a programmer.

Problems of Describing Syntax

Languages whether natural such as English or artificial such as BASIC are sets of strings of characters from some alphabet. The strings of a language are called sentences or statements. The syntax of a language specify which strings of characters from the language’s alphabet are legal and are thus in the language.

Language syntax descriptions for simplicity’s sake, often do not include descriptions of the lowest level language units. These small syntactic units are called lexemes. The description of lexemes can be given by a lexical specification, which can be separated from the syntactic description of the language. English for instance, has a complex and lengthy collection of rules for specifying the syntax of its sentences, by comparison, even the largest programming languages are syntactically very simple.

Language Recognizers

Languages can be formally defined in two distinct ways: by recognition and by generation.

Suppose we have a language L that uses the alphabet A of characters, to formally define L using the recognition method, we would need to construct a mechanism R capable of inputting strings of characters from the alphabet A. R would need to be designed so that it should be able to indicate that a given input string was or was not in L. in effect, R would either accept or reject the given string. Such devices are like filters, separating correct sentences from those that are incorrectly formed. If R, when fed all possible strings of characters from A, accepts exactly those that are in L, then R is a description of L.

The syntax analysis portion of a compiler is a recognizer for the language the compiler translates.

Language Generators

A language generator is a device that can be used to generate the sentences of a language. One can think of a generator as having a button that when pressed, produces a sentence of the language. Because it is unclear which sentences the generator will generate when the button is pressed, the generator seems like a much less useful language descriptor.

FORMAL METHODS OF DESCRIBING SYNTAX

The BNF (Backus-Naur Form) has remained the predominant and most popular method of concisely describing programming language syntax. It is remarkable that the BNF is nearly identical to Chomskey’s generative devices from context free grammars. BNF is a meta-language (a language that is used to describe another language). It uses abstractions for syntactic structures.

A Pascal assignment statement for example might be represented by the abstraction <assign>.

Pointed brackets are often used to delimit names of abstractions. The actual definition of <assign> may be given by:

The symbols on the left side of the arrow is the abstraction being defined. The text to the right of the arrow is the definition of the LHS. It is called the RHS and consists of tokens, lexemes, references to other abstractions, or some mixture of these. Altogether, the definition is called a rule or production.

In the example rule just given, the abstraction <var> and <expression> must be defined before the <assign> definition becomes useful.

This particular rule specifies that the abstraction <assign> is defined as the sequence of symbols: an instance of the abstraction <var>, followed by the lexeme :=, followed by an instance of the abstraction <expression>. One example of the abstract syntactic structure described by the rule is:

total := sub1 + sub2

The abstractions in a BNF description or grammar are often called non-terminal symbols or simply non-terminals, while the lexemes and tokens of the rules are called terminal symbols or terminals.

A grammar is a collection of rules, where each non-terminal symbol represents a syntactic structure in the language. Non-terminal symbols can have two or more distinct definitions, or forms reflecting two or more possible syntactic forms in the language.

In BNF, we use the following meta-linguistic symbols as shorthand notations:

1. := “is defined as”

2. {} “any sequence of 0 or more of the items enclosed”

3. [ ] “either 0 or more occurrence of the items enclosed”

4. | “or” in the exclusive sense

In BNF, multiple definitions can be written as a single rule, with the different definitions separated by the symbol |, representing OR.

For example, a Pascal if-statement can be described as:

<if_stmt> => if <logic_expr> then <stmt>

<if_stmt> => if <logic_expr> then <stmt> else <stmt>

<if_stmt> => if <logic_expr> then <stmt>

| if <logic_expr> then <stmt> else <stmt>

ALSO,

<letter> => a |b | c | d | e |…. |z |A |B | C |…. |Z

which is an English interpretation of a letter is defined as either “a” or “b” or “c” or …..”z” or “A”, “B”, “C”, ……”Z”.

<digit> => 0|1|2|3|4|5|6|7|8|9

Having these elementary definitions, we can build new rules which are based on them, such as:

<identifier> => <letter> {letter| digit}

GRAMMAR AND DERIVATIONS

BNF is a generative tool for defining languages. The sentences of the language are generated through repeated application of the rules, beginning with a special non-terminal of the grammar called the start symbol. Such a generation is called a derivation.

In a grammar for a complete language, the start symbol represents a complete program, and is usually named <Program>.

The following simple grammar is used to illustrate derivations:

<program> => begin <stmt_list> end

<stmt_list> => <stmt>

| <stmt>; <stmt_list>

<var> => a|b|c

| <var> - <var>

| <var>

From the above set of rules, the following can be deduced:

1. A program consists of the special word “begin”.

2. followed by a list of statements separated by semicolons

3. followed by a special word “end”

4. The expressions allow either a single variable or two variables and either a – (minus) or a + (plus) operator.

5. The only allowable variable names are a, b or c.

The derivation of a program in this language follows:

<program> => begin <stmt_list> end

ð begin <stmt>; <stmt_list> end

ð begin <var> := <var> + <var>; <stmt_list> end

ð begin a := <var> + <var>; <stmt_list> end

ð begin a := b + <var>; <stmt_list> end

ð begin a := b + c; <stmt_list> end

ð begin a := b + c; <stmt> end

ð begin a := b + c; <var> := <var> end

ð begin a := b + c; b := <var> end

ð begin a := b + c; b := c end

This derivation, like all derivations begins with the start symbol, in this case <program>. The symbol => is read “produces” or “derives”.

Each successive string in the sequence is derived from the previous string by replacing one of the non-terminals with one of the definitions. Each of the strings in the derivation, including <program> is called a sentential form.

In this derivation, the replaced non-terminal is always the leftmost non-terminal in the previous sentential from. Derivations that use this order of replacement are called leftmost derivations. The derivation continues until the sentential from contains no non-terminals. The sentential form consisting of only terminal symbols is the generated sentence, which is in the language.

A sample grammar for simple assignment statement

<id> => A | B | C

| <id> * <expr>

| (<expr>)

| <id>

This grammar describes assignment statements whose right sides are arithmetic expressions with multiplication and addition operators and parenthesis. E.g the statement A := B * (A + C) can be generated by the derivation:

=> A := <expr>

=> A := <id> * <expr>

=> A := B * <expr>

=> A := B * (<expr>)

=> A := B * (<id> + <expr>)

=> A := B * (A + <expr>)

=> A := B * (A + <id>)

=> A := B * (A + C)

focus

Monday, September 3, 2012

Language Elements

Character Set

Vocabulary

Problems of Describing Syntax

Language Recognizers

Language Generators

FORMAL METHODS OF DESCRIBING SYNTAX

GRAMMAR AND DERIVATIONS

No comments:

Post a Comment

About Me

Blog Archive