UMUC Asia - CMSC430 - Theory of Language Translation

Configuring A Linux Environment - Installing Cygwin for 32-bit Windows 7
Week 1 Discussion
Week 2 Discussion
Week 2 Project 1 Discussion
Week 2 Project 1 Skeleton Code
Week 2 Project 1 Decoupling the Listing Code
Week 2 Test Data for Project 1
Week 3 Discussion
Week 4 Project 2 Submit Files
[C++ for Java Programmers] Inheritance in C++
[C++ for Java Programmers] Separating the Interface from the Implementation
[C++ for Java Programmers] Memory Allocation of Objects
[C++ for Java Programmers] Managing Include Preprocessor Commands

Configuring A Linux Environment - Installing Cygwin for 32-bit Windows 7 †

Access to http://www.cygwin.com/install.html
Download and run setup-x86.exe
Select Root Intall Directory -> C:\cygwin
Select Local Package Directory -> C:\Users\username\Downloads
Select Internet Connection -> Direct Connection
Choose A Download Site -> choose a mirror site closer（.ac.jp）
Select Packages

Devel (Cur)
- Bin
  - binutils: GNU assembler, linker, and similar utilities
  - gcc-core: GNU Compiler Collection (C, OpenMP)
  - gettext: GNU Internationalization development utilities
  - mingw-gcc-core: GNU Compiler Collection (C, OpenMP)
  - mingw-gcc-core-g++: GNU Compiler Collection (C, OpenMP)
  - mingw-runtime: MinGW.org MSVC & compiler runtime headers and runtime
- Bin/Src
  - bison: GNU yacc-compatible parser generator
  - flex: A fast lexical analyzer generator
  - gcc-g++: GNU Compiler Collection (C++)
  - make: The GNU version of the 'make' utility
Resolving Dependencies -> Select required packages
The installation is complete within 10 minutes

$ whereis gcc
gcc: /usr/bin/gcc.exe /usr/lib/gcc /usr/share/man/man1/gcc.1.gz /usr/src/gcc-4.9.2-1.src/gcc.cygport

$ whereis bison
bison: /usr/src/bison-3.0.2-1.src/bison.cygport

$ whereis flex
flex: /usr/src/flex-2.5.39-1/flex.mknetrel /usr/src/flex-2.5.39-1/flex.skl

$ whereis make
make: /usr/src/make-4.0-2/make.1 /usr/src/make-4.0-2/make.lnk /usr/src/make-4.0-2/make.mknetrel

Additional Packages
- Editors
  - vim: VI IMproved - enhanced vi editor
  - vim-common: VI IMproved - enhanced vi editor (common runtime)
- Net
  - bind: DNS server
  - curl: Multi-potocol file transfer tool
  - whois: GNU Whois
  - ping: A basic network tool to test IP network conectivity
- Web
  - wget: Utility to retrieve files from the WWW via HTTP and FTP

↑

Week 1 Discussion †

Question

Languages can be fully compiled, fully interpreted or partially compiled and then interpreted. Choose a language from the list of the top 20 currently most popular languages at http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html. Choose one that has not already been chosen by one of your classmates. Discuss which of the three translations mechanisms is used with that language. Offer some rationale for why that mechanism was chosen.

Answer

I chose Java. When you run a Java program, your source code is partially compiled by the Java compiler and then interpreted by Java Virtual Machine (JVM). The JVM allows you to develop programs without dependence on operating systems.

A major function of JVM is supporting garbage collection, which automatically frees dynamically allocated memory on the heap. When you create a instance of an object by using new operator, memory is automatically allocated on the heap. Therefore, you don't have to care about memory allocation and deallocation unlike C.

Professor's comment

What is the difference between the original JVMs and the newer JIT compilers?

My answer

Professor,

The original JVM was designed as an interpreter called the Java interpreter for the Java bytecode. The bytecode is an abstract machine language for the JVMs. It is analyzed, and then executed by byte by the interpreter. On the other hand, JIT compilers load the bytecode into memory and compile it to the native code by method before execution. Although a JIT compiler uses more memory because it stores the compiled native code in memory temporarily, it will enhance execution speed of the application.

Using Java HotSpot VM is a hybrid approach between the interpreter and JIT compilers. It compiles only repetitive methods to the native code.

↑

Week 2 Discussion †

Question

Define some language using a regular expression. The language must contain an infinite number of strings. The underlying alphabet must have at least three different characters. Draw a deterministic finite state automaton that accepts the strings of that language. Give two character strings that are accepted by that finite state automaton and two that are not.

There is a nice tool available for helping you understand regular expressions and draw finite state automata. It is called JFLAP. I would encourage you to download it and use it for this exercise.

↑

Week 2 Project 1 Discussion †

Be sure to read all the attached posts before beginning the first project. Also post any questions about the project as a response to this topic.

In particular you should be sure to read the following three topics I have posted:

Project 1 Skeleton Code
Test Data for Project 1
Decoupling the Listing Code

↑

Week 2 Project 1 Skeleton Code †

Project 1 involves writing the lexical analyzer for the compiler. It requires that you provide an input file for flex that generates the lexical analyzer. The main method can be included in that file. I would recommend that you provide a separate C++ file for the code that generates the compilation listing. I suggest also providing a file named tokens.h, which contains the token definitions as an enumerated type. In project 2, this file will be automatically generated by bison. You must also include a makefile that builds your program. To help you with this project, I have provided a skeleton of the flex input, a skeleton tokens.h file and a possible makefile. None of these are requirements, just some help getting you started if you need it:

scanner.l http://nova.umuc.edu/~jarc/cmsc430/scanner.html
tokens.h http://nova.umuc.edu/~jarc/cmsc430/tokens.html
makefile http://nova.umuc.edu/~jarc/cmsc430/makefile.html

In addition, here are examples of how the outputs should look for a correct program and one that contains two lexical errors:

successful compilation http://nova.umuc.edu/~jarc/cmsc430/success.html
lexical error http://nova.umuc.edu/~jarc/cmsc430/failure.html

↑

Week 2 Project 1 Decoupling the Listing Code †

As a matter of good object-oriented design, it is a good idea to decouple the code that displays the error messages from the flex source code. In the later stages of this project, there will be syntactic and semantic errors. The code for displaying and counting these errors should be separate from the flex and bison code since it will be called from both places.

In the skeleton code that I have provided you I am calling the functions as members of a Listing class. If defined as a class, this would really be defining a singleton object. One way to create a singleton is to make all the functions and data static. The standard practice in C++ is to put the class definition in the header file listing.h and the bodies of the member functions in a corresponding .cc file.

Because, unlike Java, C++ does not require all functions to be put in classes, another alternative would be to define these as ordinary functions that are not members of any class. In that case the function prototypes would still go in listing.h and the function bodies in the corresponding .cc file.

In the call to the appendError function I have passed in a parameter named LEXICAL. This is to designate that this error is a lexical error. My suggestion would be to define an enumerated type as follows:

enum ErrorType {LEXICAL, SYNTAX, SEMANTIC};

The listing.h file would be a good place to put this enumerated type definition. By supplying the error type, the Listing class can keep a count of the number of messages of each kind, which should be displayed when the end of the program is reached. Adding another function to the Listing class to display the error count is best way to accomplish this.

Finally, let me explain the purpose of the appendError function. It should queue up the error messages so they are displayed at the end of the line. All error messages that occurred on that line can then be displayed by the nextLine function once the line is complete.

↑

Week 2 Test Data for Project 1 †

Attached are the three test cases you should use for project 1. The first is an example of a correct program. The second is one that contains lexical errors. The third one is to test to be sure you are generating correct tokens.

test1.txt(191 Bytes) https://learn.umuc.edu/d2l/le/47318/discussions/posts/7416507/ViewAttachment?fileId=2017001
test2.txt(191 Bytes) https://learn.umuc.edu/d2l/le/47318/discussions/posts/7416507/ViewAttachment?fileId=2017002
test3.txt(313 Bytes) https://learn.umuc.edu/d2l/le/47318/discussions/posts/7416507/ViewAttachment?fileId=2017003

↑

Week 3 Discussion †

Question

Define a language that is not a regular language using a context-free grammar that is not ambiguous. The alphabet of the language must contain at least five characters. Describe in English the strings of the language. Choose any string that is in the language and demonstrate that it is in the language by showing the left-most derivation that derives it. Draw the parse tree that corresponds to the left-most derivation. Give an example of one string that is not in the language.

Answer

My language:

A -> I = E I -> E | x | y | z | E -> E + T | T T -> T * F | F F -> ( E ) | I

The left-most derivation:

A -> I = E

> x = E
> x = E + T
> x = T + T
> x = F + T
> x = ( E ) + T
> x = ( T ) + T
> x = ( T * F ) + T
> x = ( F * F ) + T
> x = ( I * F ) + T
> x = ( x * F ) + T
> x = ( x * I ) + T
> x = ( x * y ) + T
> x = ( x * y ) + F
> x = ( x * y ) + ( E )
> x = ( x * y ) + ( T )
> x = ( x * y ) + ( T * F )
> x = ( x * y ) + ( F * F )
> x = ( x * y ) + ( ( E ) * F )
> x = ( x * y ) + ( ( E + T ) * F )
> x = ( x * y ) + ( ( T + T ) * F )
> x = ( x * y ) + ( ( F + T ) * F )
> x = ( x * y ) + ( ( I + T ) * F )
> x = ( x * y ) + ( ( x + T ) * F )
> x = ( x * y ) + ( ( x + F ) * F )
> x = ( x * y ) + ( ( x + I ) * F )
> x = ( x * y ) + ( ( x + y ) * F )
> x = ( x * y ) + ( ( x + y ) * I )
> x = ( x * y ) + ( ( x + y ) * z )

An example of one string that is not in the language:

x = ( a * b ) + ( ( x + y ) * z )

↑

Week 4 Project 2 Submit Files †

The second project involves writing the syntactic analyzer for the compiler that was begun in the previous project. The grammar of the language is the following:

program:
　{function}

function:
　FUNCTION IDENTIFIER [parameters] RETURNS type ; body

parameters:
　parameter {, parameter}

parameter:
　IDENTIFIER : type

type:
　INTEGER | REAL | BOOLEAN

body:
　{variable} BEGIN statement END ;

variable:
　IDENTIFIER : type IS statement

statement:
　expression ; |
　IF expression THEN statement ELSE statement ENDIF ;

expression:
　IDENTIFIER |
　IDENTIFIER (expression {, expression}) |
　INT_LITERAL | REAL_LITERAL | BOOLEAN_LITERAL |
　NOT expression |
　expression operator expression |
　(expression)

operator: ADDOP | MULOP | RELOP | AND | OR

In the above grammar, the red symbols are nonterminals, the blue symbols are terminals and the black punctuation are EBNF metasymbols. The braces denote repetition 0 or more times and the brackets denote optional.

The grammar must be rewritten to eliminate the EBNF brace and bracket metasymbols and to incorporate the significance of parentheses, operator precedence and left associativity for all operators. Among arithmetic operators the multiplying operators have higher precedence than the adding operators. All relational operators have the same precedence. Among the binary logical operators, and has higher precedence than or. Of the categories of operators, the unary logical operator has highest precedence, the arithmetic operators have next highest precedence, followed by the relational operators and finally the binary logical operators. The directives to specify precedence and associativity, such as %prec and %left, may not be used

The syntactic analyzer should be created using bison. It should detect and recover from syntax errors to the extent possible. The semicolon should be used as the primary synchronization token. The compiler should produce a listing of the program with error messages included after the line in which they occur. A count of the number of lexical and syntactic errors and the number of total errors should follow the compilation listing.

Your parser should, however, be able to correctly parse any syntactically correct program without any problem.

You will lose points from the design portion of your grade if your bison input produces any shift/reduce or reduce/reduce errors.

The 40 points that you will receive for the functionality portion of your grade on this project will be based two criteria shown below: Parses all syntactically correct programs 22 points Detects and recovers from errors in the function header 3 points Detects and recovers from errors in variable declarations 3 points Detects and recovers from errors in conditional expressions 3 points Detects and recovers from errors in arithmetic expressions 3 points Detects and recovers from errors in the function body 3 points Detects and recovers from multiple errors 3 points

Test data will be provide to test each of the above cases.

The next two phases of the project do not require that you implement error checking.

↑

[C++ for Java Programmers] Inheritance in C++ †

Like Java, C++ is an object oriented language, however, there are some important differences. First when one class extends another class the syntax is a bit different as C++ does not have extends as a reserved word. It is written as follows:

class Derived: public Base ...

It is important to specify the base class as public to ensure the type of inheritance that is familiar to you in Java because having it public is not the default. The syntax for calling the constructor of a super class is different also. Unlike Java, C++ does not have the reserved word super.

class Derived: public Base
{
public:
     Derived(int value) : Base(value)
     { ...

C++ member functions (methods) are not bound dynamically by default like they are in Java. To ensure dynamic binding, a member function must be declared as virtual in the base class. Failing to declare it as such causes a member function in a derived class with the same prototype (signature) to be considered redefining the function, rather than overriding it.

C++ has the equivalent of abstract methods, but they are called pure virtual functions and are declared as follows:

class Base
{
public:
      virtual void abstract_method() = 0;

Interfaces can be created in C++, although C++ has no reserved word interface. To create an interface, we simply create a class that contains only pure virtual functions. To implement an interface, we use ordinary public class inheritance, as C++ has no reserved words implements. Because C++ allows multiple inheritance of classes, it is possible to extend a class and implement an interface.

↑

[C++ for Java Programmers] Separating the Interface from the Implementation †

Unlike, Java, which allows the complete class definition to be written together, it customary practice in C++ to put the class definition, which contains only the function prototypes (method signatures) and the definition of the member variables (instance variables), in a header file whose name is the name of the class with a .h extension. The bodies of the member functions are then placed in a source file, whose name is the name of the class with a .cpp extension. Because the member function bodies are declared outside the class definition, the function names must be prepended with the class name followed by the scope resolution operator (::) as follows:

void ClassName::functionName() ...

C++ does allow function bodies to be placed in the class definition in a fashion that resembles Java, however, the signficance of doing so is different. When the body is included in the class definition, it is an implicit recommendation to the compiler to place the function inline, wherever it is called. Consequently, it is best only to place short, one or two line functions, in the class definition

↑

[C++ for Java Programmers] Memory Allocation of Objects †

Unlike Java, C++ does not allocate objects on the heap. By default they are allocated on the compiler's run-time stack like ordinary variables of primitive types. In most cases, this is approach is adequate and because it is simplest since it does not require the instantiation step, so a declaration of an object in C++ looks as follows:

ClassName object;

The above declaration allocates space for that object on the stack. When calling the member functions (methods) of such objects, we use the dot operator just as we do in Java.

It is sometimes necessary or preferable to dynamically allocate objects on the heap. When that is the case, we must use a pointer and we must instantiate the object much like we do in Java. For example:

ClassName* objectPointer = new ClassName();

When we call a member function (method) of an object that has been allocated on the heap, we must use the arrow (->) operator rather than the dot operator.

↑

[C++ for Java Programmers] Managing Include Preprocessor Commands †

One of the greatest difficulties that Java programmers have when writing C++ programs is dealing with the preprocesssor include commands. Unlike Java, in C++ it standard practice to place the class definition in a header (.h) file of same name as the class and the bodies of the member functions (methods) in a corresponding source (.cpp) file. In order for one class to reference another class in any way, the client class must include the header file of the class that it is referencing. There are several approaches to managing include files.

The simplest approach is to place all the include commands in a single include file which is commonly referred to as stdafx.h, and have all the .cpp files include that header file. What is most important to understand is that the order in which the commands are listed is critically important. Every header file must be included after all other header files that it depends on. These dependencies are defined by whether a class corresponding to a header file references another class corresponding to another header file. Keep in mind that there are several ways that one class can reference another. It can be the result of calling a method in another class, containing an object whose type is another class, or finally because of inheritance, in which one class is derived from another class. If we view these dependencies as a directed graph, the order can be considered any topological order of the graph. Of course, such an order only exists when there are no circular dependencies. In that case, which occurs in project 2, an incomplete class definition is required to eliminate the inherent forward reference.

Although this approach is the simplest, one drawback to this approach is that whenever any header file is modified, the entire program must be recompiled. In the early days of C and C++, such an approach would have been unthinkable for large programs because compilers were much slower. Given the speed of computers and compilers today, this concern is much less important unless the program is very large.

Another approach is to put include commands only in the source files and to include only those header files that are needed. In this case, order is still vitally important. The advantage to this approach is that is minimizes recompilation. This is the approach I have used in my examples.

The final approach is to allow include commands in both source and other header file. Because a class definition can depend on another class, allowing header files to include other header file can simplify determining the dependencies. Like the second approach, this approach minimizes recompilation but it introduces the possibility of a header file being included more than once. There are two ways to circumvent this problem. The traditional approach was to use conditional compilation as follows:

#define A_CLASS
#ifndef A_CLASS

  The class definition

#endif

Many newer compilers provide a pragma for the approach, which is written as follows:

#pragma once

This pragma has the same effect as using conditional compilation.

You may choose whichever of these approaches that you prefer.