317 courses ePortfolio Forums Blog FAQ

Compilers

Purpose of Course  showclose

Because we have compiler programs, software developers often take the process of compilation for granted.  However, as a software developer, you should cultivate a solid understanding of how compilers work in order to develop the strongest code possible and fully understand its underlying language.  In addition, the compilation process comprises techniques that are applicable to the development of many software applications.  As such, this course will introduce you to the compilation process, present foundational topics on formal languages and outline each of the essential compiler steps: scanning, parsing, translation and semantic analysis, code generation, and optimization.  By the end of the class, you will have a strong understanding of what it means to compile a program, what happens in the process of translating a higher-level language into a lower-level language, and the applicability of the steps of the compilation process to other applications.

Course Information  showclose

Welcome to CS304.  Below, please find some general information on the course and its requirements.

Course Designer: JM Perry

Primary Resources:

Note: The MIT, Stanford, and Berkeley resources overlap.  However, they reinforce each other, have differences in presentation, examples, and topic emphasis, and provide the opportunity to review the material.  Moreover, they offer a choice of perspective and style, which some students may find appealing.  There are several ways to use these resources.  All students should use the Mogensen text.  Of the MIT, Stanford, and Berkeley resources, a student may 1) choose the one whose style is most appealing and stick with it, occasionally using the others where they are beneficial, 2) use all three resources totally, or 3) for each topic/subunit select the resource you find most helpful.  Lastly, the Berkeley resource includes video lectures, which can be useful for explaining certain points of a topic.  

Requirements for Completion: Passing grade on quizzes and the final exam; achieve the objectives of each unit and the goals of the course; ability to apply the concepts taught in the course.

Time Requirements: This course should take approximately 146 hours to complete.

Tips/Suggestions: As you study the topics in the course, keep in mind the application of compiler techniques and tools to the development of other types of software.



Learning Outcomes  showclose

Upon successful completion of this course, the student will be able to:

  • Describe the compilation process and explain the function of the components that comprise the structure of a compiler.

  • Apply concepts of formal languages and finite-state machines to the translation of computer languages.

  • Identify the compiler techniques, methods, and tools that are applicable to other software applications.

  • Describe the challenges and state-of-the-practice of compiler theory and practice.

Course Requirements  showclose

In order to take this course, you must:

√    Have access to a computer.

√    Have continuous broadband Internet access.

√    Have the ability/permission to install plug-ins or software (e.g., Adobe Reader or Flash).

√    Have the ability to download and save files and documents to a computer.

√    Have the ability to open Microsoft files and documents (.doc, .ppt, .xls, etc.).

√    Have competency in the English language.

√    Have read the Saylor Student Handbook.

√    Have a basic mastery of the material in CS101, CS102, CS201, CS202, and CS303.

Unit Outline show close


Expand All Resources Collapse All Resources
  • Unit 1: Introduction to Compilers  

    The compilation process is one of the stepsin executing a program.  Understanding how compilerswork and what goes on “behind the scenes” will help you get better at developing software.  This unit will first provide you with an introduction to the compiler, its history, compiler structure and design, and the types of compilers.  By the end of this unit, you will be able to describe the steps of the  compilation process.

    Unit 1 Time Advisory   show close
    Unit 1 Learning Outcomes   show close
  • 1.1 System View of Compilers  

    Note: A compiler is a complex software system, and, as such, has a system life cycle that starts with a need, transitions to design and construction/implementation, undergoes testing, transitions to use in its intended environment, undergoes maintenance over its life, and ends with “disposal” and archival storage.  The activities, various work products, and documentation that comprise the processes used during the system’s life must be carefully planned.  The part of the system life cycle that deals with the design and construction of the compiler is called the development life cycle.  This course focuses on the development life cycle, but the overall system life cycle must be kept in mind. 

    The term “compiler process” is related to, but different from the system (life cycle) process.  The “compiler process” refers to the processing that a compiler does when it performs its function of translation of a source program to a target program.  It also refers to the approach taken to design a compiler.  Design, in fact, is one activity in the system (life cycle) process.

  • 1.2 Overview of Compilers  
  • 1.2.1 Compiler Definitions, Terminology, and Anatomy of a Compiler  
  • 1.2.2 History  
  • 1.2.2.1 Translation  
    • Reading: Torben Ægidius Mogensen’s Basics of Compiler Design: “Chapter 1: Introduction”

      Link: Torben Ægidius Mogensen’s Basics of Compiler Design: “Chapter 1: Introduction” (PDF)
       
      Instructions: Select the link and click “Get the book contents” to download a PDF version of the book.  Please read sections 1.1, 1.2, and 1.3. Note that “compile” means to translate from a high-level language, used by humans, to a low-level language used by a computer.  A high-level language uses concepts, objects, and tasks performed by a human, whereas a low-level or machine language uses concepts, objects, and tasks performed by machines.  This reading also applies to the subsections 1.2.2.1.1 – 1.2.2.1.3, 1.2.2.2, and to section 1.3 of this course.
       
      Terms of Use: Please respect the copyright and terms of use displayed on the webpage above.

  • 1.2.2.1.1 Translate Source Language to Target Language  

    Note: The reading for this subsection is covered by the reading for 1.2.2.1 above.  The input language to a compiler, typically a programming language, is called the source language.  The output language of a compiler is called the target language or object code, typically an assembly language or machine language. 

  • 1.2.2.1.2 Object Code and Executables  

    Note: The reading for this subsection is covered by the reading for 1.2.2.1.

  • 1.2.2.1.3 Platform Independent Compilers  

    Note: The reading for this subsection is covered by the reading for 1.2.2.1 above.  In the textbook reading, the author describes seven phases of a compiler.  The middle phase is called Intermediate Code Generation.  The intermediate code is independent of a particular target machine.  One of the back-end phases is called Machine Code Generation, which translates the machine-independent code to machine-dependent code for a particular computer.

  • 1.2.2.2 Comparison between Compiler and Interpreter  

    Note: The reading for subsection 1.2.2.1 above covers this subsection.  

  • 1.2.3 Other Applications  
    • Reading: Torben Ægidius Mogensen’s Basics of Compiler Design: “Chapter 1: Introduction”

      Link: Torben Ægidius Mogensen’s Basics of Compiler Design: “Chapter 1: Introduction” (PDF)
       
      Instructions: Select the link and click “Get the book contents” to download a PDF version of the book.  Please read section 1.4, which gives reasons for studying compilers.  In the discussion, the author suggests other applications for compiler techniques and tools.  He specifically mentions domain-specific languages.  Other applications derive from techniques and methods used in compilers.  For example, if the input to a software system can be described using a formal grammar, then lexical- and syntax-phase techniques may be the most effective and efficient to use for inputting, verifying, and representing the data for processing.  Front-end techniques can also be applied via software tools to the enforcement of development and programming standards and procedures.  Knowledge of compiler techniques and methods helps in the design of new programming languages, new hardware architectures, generation of software applications, and the design of software tools. 
       
      Terms of Use: Please respect the copyright and terms of use displayed on the webpage above.

  • 1.2.4 Hardware Compilation  
  • 1.3 Systems Process for Development of a Compiler  
  • 1.3.1 Need for Compilers  

    Note: The reading in section 1.2.2.1 above applies to this section.  The need for compilers arises from the need for high-level languages, which are more relevant for problem solving and software development by humans.  Moreover, high-level languages are machine independent.  

  • 1.3.2 Development Life Cycle Process for a Compiler  

    Note: Since development of a compiler is a relatively complex system-development effort, typically having many users and developers, and will be maintained over a life of many years, a formal process should be used for its development.  The development process should extend from requirements through verification and validation, and should include reviews, tests, analysis and measures, quality assurance, configuration control, and key documentation.  The development process is a part of the overall system life cycle process, which additionally, includes deployment, maintenance, disposal and archival storage.  The compiler development process should consist of procedures for writing and documenting the needs and requirements, the architecture and design, construction, integration, and verificaton and validation of the compiler.  Documentation should also include the formal foundationsand techniques used, tradeoffs made, alternatives evaluated, and the chosen alternative for design or implementation.  Full coverage of all of these indetail is beyond the scope of this course.  As we proceed through this course, however, we include high-level needs, requirements, functions, performance considerations, and verification and validation issues for a compiler and its parts.

  • 1.4 Compiler Design Overview  
  • 1.4.1 One-Pass vs. Multi-Pass  
  • 1.4.1.1 One-Pass  

    Note: A “pass” refers to reading and processing the input, i.e., source or source representation.  The reading for 1.4.1 gives the rationale for a one-pass design—namely, a one-pass compiler is simpler to implement.  However, more sophisticated processing, such as optimizations, may require multiple passes over the source representation.  

  • 1.4.1.2 Multi-Pass  

    Note: The reading for 1.4.1 above includes information on a multi-pass design.  The compiler is composed of components, which perform the steps of the compilation process.  Some of the components may make a separate pass over the source representation—thus, the name of multi-pass.  Multi-pass compilers are used for translating more complex languages (for example, high-level language to high-level language), and for performing more complete optimizations.  Moreover, multi-pass compilers are easier to verify and validate, because testing and proof of correctness can be accomplished for the smaller components.  Multi-pass compilers are also used for translation to an assembler or machine language for a theoretical machine, and for translation to an intermediate representation such as bytecode.  Bytecode, or pcode (portable code), is a compact encoding that uses a one-byte instruction, and parameters for data or addresses, to represent the results of syntax and semantic analysis.  The bytecode or pcode is independent of a particular machine.  It is executed by a virtual machine residing on the target machine. 

  • 1.4.2 Structure of a Compiler  
    • Reading: Stanford University: Keith Schwarz’s CS143 Compilers: Handout 01: “CS143 Course Overview”

      Link: Stanford University: Keith Schwarz’s CS143 Compilers: Handout 01: “CS143 Course Overview” (PDF)
       
      Instructions: Select the link and read the CS143 Course Overview handout (not the lecture).  You have studied the lecture “Introduction to Compilers” as part of section 1.2.1 above.  The handout gives an overview of the structure of a compiler with a good explanation.  This handout also applies to subsections 1.4.2.1 and 1.4.2.2 below.
       
      Terms of Use: Please respect the copyright and terms of use displayed on the webpage above.

  • 1.4.2.1 Front End  

    Note: The Course Overview handout, referenced in section 1.4.2 above, pages 2–5, to the end of Intermediate Code Generation, describes the front-end process, or analysis stage, of the compilation process.  The intermediate code used as an example is TAC, three-address code.  The front end, or analysis, consists of lexical analysis, syntax analysis or parsing, semantic analysis, and intermediate code generation.  Lexical analysis may be preceded by preprocessing, depending on the language and the development environment.

  • 1.4.2.2 Back End  

    Note: The Course Overview handout, referenced in section 1.4.2 above, also applies to this subsection.  Read the handout pages 5–7, up to Historical Perspective.  These pages explain the back-end, or synthesis state, of the compilation process, which includes intermediate code optimization, object code generation, and object code optimization.  The symbol table and error-handling topics apply to both the front end and to the back end.  The section on one-pass versus multi-pass complements the previous explanation, in sections 1.4.1.1 and 1.4.1.2, of the number of “passes” that a compiler uses.

  • 1.4.2.3 Some More History  

    Note: The Course Overview handout has a very good section on history, which helps us understand why the structure of a compiler is as described in previous sections.  Read the Historical Perspective section, which shows the important influence of programming languages on the development of compilers. 

  • Unit 2: Formal Languages and Formal Grammar  

    Formal languages and formal grammars are the theoretical foundation for computer languages and the compilation process.  Formal languages are defined by their grammars, which specify the syntax of the languages.

    Unit 2 Time Advisory   show close
    Unit 2 Learning Outcomes   show close
  • 2.1 Motivation for Formal Languages  
  • 2.2 Formal Grammars  
  • 2.3 Syntax of Formal Languages  
  • 2.4 Semantics of Formal Languages  
  • Unit 3: Finite State Machines  

    Finite state machines (FSM), also called finite state automata (FSA), are conceptual models for recognizing, parsing, and generating strings in a formal language.  An FSM can be used to recognize (i.e., determine) whether a string adheres to the syntax of a language.  Moreover, an FSM can be used to build a syntax tree, which shows the derivation (i.e., how the string was constructed) of the string.  This unit introduces (or reviews) FSMs, which are covered in detail in other courses (for example, CS202: Discrete Structures).

    Unit 3 Time Advisory   show close
    Unit 3 Learning Outcomes   show close
  • 3.1 Definitions  
  • 3.2 Some Results and Examples of FSAs  
  • 3.3 Some Applications of FSA  
    • Reading: University of California, Berkeley: Paul Hilfinger’s CS164 Programming and Compilers: Notes for Lecture 3 “FSA”

      Link: University of California, Berkeley: Paul Hilfinger’s CS164 Programming and Compilers: Notes for Lecture 3 “FSA” (PDF)

      Instructions: Select the PDF link for lecture notes 3 and study slides 8 to the end.  These notes repeat some of the material from section 3.2 and provide additional practice in applying FSAs.  

      Terms of Use: Please respect the copyright and terms of use displayed on the webpages above.

    • Assessment: MIT: S. Amarasinghe and M. Rinard, 6.035 Computer Language Engineering: “Practice Quiz 1”

      Link: MIT: S. Amarasinghe and M. Rinard, 6.035 Computer Language Engineering: “Practice Quiz 1” (PDF)
       
      Instructions: Select the PDF for Practice Quiz 1.  In this assessment, we will extend and do exercises 1, 2.  In addition, you will need to complete exercise 3.   For exercise 1: Do the FSA part of this exercise.  This exercise gives you practice in forming an FSA (finite state automaton) to recognize a language.  The exercise takes the following steps:
       
                  regular language → regular expression → FSA that recognizes the given language
       
      For exercise 2, give the FSA for the regular expression of this exercise.   Then, complete exercise 3 as listed.
       
      The reference's solutions are given at the end of the PDF.  For the exercises on FSAs, you should use paper and pencil to draw the FSA as it is formed.  Additional exercise and solution information is available in The Saylor Foundation’s “Answer Key to CS304 Assessment2.”(PDF)

  • Unit 4: Scanning and Lexical Analysis  

    Lexical analysis is performed by a scanner, one of the front-end components of a compiler.  The foundation for lexical analysis is provided by regular grammars and finite state automata.  This unit studies scanners and lexical analysis in terms of development process products: requirements, functions, design, construction, and test.  The verification of a scanner is done through testing.  Validation is based on the programming language specifications, and operation of the scanner as a component of the compiler or application system that uses it.

    Unit 4 Time Advisory   show close
    Unit 4 Learning Outcomes   show close
  • 4.1 Lexical Analysis Introduction and Overview  
  • 4.2 Requirements for a Scanner  
  • 4.3 Functions of a Scanner  

    Note: The reading in 4.2 also applies to this subunit.  Pages 1 and 2 of the notes includescanner functions.  Also, study lecture slides 13 – 46.

  • 4.4 Review of Regular Expressions, FSAs, and Regular Languages  
    • Reading: Torben Ægidius Mogensen’s Basics of Compiler Design: “Chapter 2: Lexical Analysis”

      Link: Torben Ægidius Mogensen’s Basics of Compiler Design: “Chapter 2: Lexical Analysis” (PDF)
       
      Instructions: Select the link and click “Get the book contents” to download a PDF version of the book.  Please look over Chapter 2 on Lexical Analysis.  If you are comfortable with this material just review it.  If you feel you are somewhat uncomfortable with this material, read or study it to become comfortable with it; it is the foundation for much of our work on compilers.  This reading applies to subsections 4.4.1 – 4.4.4.
       
      Terms of Use: Please respect the copyright and terms of use displayed on the webpage above.

  • 4.4.1 Regular Expressions to NFAs  

    Note: The reading of 4.4 applies to this subsection.  A regular expression has associated non-deterministic finite automata (NFA) that accept it. 

  • 4.4.2 NFAs to DFAs  

    Note: The reading of 4.4 applies to this subsection.  An NFA can be converted to a DFA, which is a finite automaton where the state transition is determined by the input symbol and the current state. 

  • 4.4.3 Space and Speed Optimization of DFAs  

    Note: The reading of 4.4 applies to this subsection.  The recognition of a regular expression in a regular language can be done in several ways: by operating on the expression directly, using an NFA, or using a DFA.  Also, an automaton can be transformed to an equivalent having fewer states.  The size (that is, the number of states) and the speed of the automaton determine the most efficient way to recognize a regular expression.  Note in section 2.7 of the reading the time estimates for processing an expression by a DFA and by an NFA.

  • 4.4.4 Power of Regular Languages  

    Note: The reading of 4.4 applies to this subsection also.  Regular languages include many languages and can be used for many applications, including scanners.  Further, given regular languages, their union, concatenation, repetition, set difference, and intersection are also regular languages—we say that regular languages are closed under these operations.  Regular languages can be expressed, equivalently, using regular expressions, NFAs, or DFAs.  A key limiting characteristic of regular languages is seen from DFAs.  A DFA is a finite automaton, and, thus, can remember only a finite number of symbols.  Hence, a DFA cannot recognize a string of the type an b cn, for any n (because for any n it will require an infinite number of states to remember the number of a’s).  Most computer languages are not regular and we will need to use a larger formal language class to parse them.

  • 4.5 Design of a Scanner  
  • 4.6 Construction of a Scanner  
  • 4.7 Verification and Validation of a Scanner  

    Note: Scanners for production compilers or software systems must be verified (by proof, demonstration, review, or test, that shows that the requirements, design, and performance specifications are met) and validated (also by proof, demonstration, review, or test that shows that the scanner satisfies the needs of its users and its role in a larger containing system—either compiler or system application).  The amount of effort expended in verifying and validating a scanner is dependent on the purpose and intended use of the scanner.  For both verification and validation, the description of the input language must be shown to be correct.  If the scanner is generated, then the quality of the scanner depends on the reliability and effectiveness of the generator.  

    • Assessment: ePaperPress: Tom Niemann's Tutorial on LEX and YACC

      Link: ePaperPress: Tom Niemann's Tutorial on Lex and Yacc (PDF)
       
      Instructions: Click on the link above to access the ePaperPress website.  Then, click on the hyperlink for the PDF format.  In this exercise, you will be introduced to LEX and YACC, which stand for Lexer (short for lexical analyzer) and Yet Another Compiler-Compiler.  They are generators, i.e. programs that generate other programs, in this case, a scanner and a parser.  Our focus in this Unit 4 Assessment will be on LEX, actually, FLEX – which stands for Fast Lexer.  Read the entire PDF tutorial on LEX; it provides a concise description of LEX and YACC.  Then, please follow these instructions for each exercise:
                 
      Exercise #1: Draw a high level block diagram depicting LEX and a C compiler, together with the input and output files for each of them.
       
      Exercise #2: Give the LEX input file for replacing 00 with the text 'error' in the language L of Exercise #1 from Assessment 1. 
       
      Exercise #3 (optional): For this exercise, we will be using several software tools: a generator for scanners (lexical analyzers), a C compiler, and text editor for creating input files.  We will use a scanner generator, called FLEX for creating scanners.  FLEX is based on the earlier, original LEX.  There are several good compilers available, cyngwin and gcc.  We will use gcc.  There are, also, several editors for creating input files; we'll use Notepad++. 
       
      a) Install FLEX on your computer.  Installation guidance is given below.
      b) If you do not have a C compiler, install a C compiler on your computer.  Instructions for installing the gcc compiler can be found on YouTube linked here.
      c) Check your answer to Exercise 2 by using FLEX and your C compiler to create a scanner; test the scanner by running it on a collection of sample input strings that are in the language of Exercise #1 from Assessment 1.  Also, run it on some strings that are not in the language of Exercise #1.
       
      You can check your answers with The Saylor Foundation’s “Answer Key” (PDF). This assessment should take approximately 3 hours to complete (not including installation of Flex and the Gcc compiler).
       
      Installation Notes for FLEX, Bison, and the Gcc (Gnu Compiler Collection) Compiler
       
      These notes give suggestions on the installation of FLEX, Bison (open source software for the original Lex and Yacc), and the gcc compiler.
       
      To download and install FLEX (used for Exercise #2 and Exercise #3, and Bison used for Exercise #3):

      1) Use your search engine to find a free software version of Flex and Bison to download; for example, type in 'free software foundation fast lexical analyzer' or 'free software foundation software for Yacc;’ look over the results and select a copy of FLEX and Bison to investigate.  Be careful to avoid risky sites.
      Look for a specific version of FLEX, for example, search for 'FLEX for Windows.'

      OR
       
      2) Go to the location here for Flex, here for a Windows version for Flex, and here for Bison.
       
      3) After selecting a 'reliable' site, such as gnu.org, look for a version of FLEX for your operating system and follow the instructions for downloading and installing it, including documentation.  This should also give you Bison.  After downloading FLEX, if Bison is not included, repeat the steps 1), 2), 3) for Bison. 
       
      4) For a C compiler, you may already have one installed on your machine.  If not use your search engine to search for “cyngwin C compiler” or “gcc C compiler” and look for a free software version that you can download.  Be caution of risky web sites.  For the gcc compiler, for example, search for “codeblocks.org gcc compiler”.   Some of the results may refer to mingcc (minimal gcc compiler), which will work.  Installation of a compiler can be involved and the use of an installer is recommended.  A good tutorial on the installation of the gcc compiler can be viewed on YouTube.
       
      Make sure that the version of FLEX or LEX, and Yacc and Bison, you obtain are compatible.

      You can check your answers with The Saylor Foundation’s “Answer Key” (PDF)
       
      Terms of Use: Please respect the copyright and terms of use on the webpages displayed above.

  • Unit 5: Parsing and Syntax Analysis  

    The next step of the compilation process is parsing.  This step also has a foundation in formal languages and automata.  Parsing takes input from the Lexical Analysis step and builds a parse tree, which will be used in future steps to develop the machine code.  In this unit, we will define parsing and identify its uses.  We will also discuss two parsing strategies, Top-Down Parsing and Bottom-Up Parsing, examining what it means to approach parsing from each standpoint and taking a look at an example of each.  By the end of the unit, you will understand parsing techniques with regards to compilers, and be able to discuss each of the two main approaches.

    Unit 5 Time Advisory   show close
    Unit 5 Learning Outcomes   show close
  • 5.1 Parser Introduction and Overview  
  • 5.2 Requirements of a Parser  
    • Reading: Wikipedia: “Parsing”

      Link: Wikipedia: “Parsing” (PDF)
       
      Instructions: Please read the following three paragraphs: “Programming languages,” “Overview of the process,” and “Types of Parsers.”  Requirements of a parser include: build internal representation of the input tokens (which come from the scanner); check that the input string of tokens is legal in the language of the compiler (i.e., that it can be derived from the start symbol); and determine the derivation steps for the input string of tokens.  In addition, the parser should be reliable, efficient (how efficient depends on the intended use), user friendly (i.e., provide clear, accurate messages), and supportable (assuming that the parser will be used for a long time). 
       
      Terms of Use: The article above is released under a Creative Commons Attribution-Share-Alike License 3.0 (HTML).  You can find the original Wikipedia version of this article here (HTML). 

  • 5.3 Functions of a Parser  
    • Reading: Torben Ægidius Mogensen’s Basics of Compiler Design: “Chapter 3: Syntax Analysis”

      Link: Torben Ægidius Mogensen’s Basics of Compiler Design: “Chapter 3: Syntax Analysis” (PDF)
       
      Instructions: Select the link and click “Get the book contents” to download a PDF version of the book.  Please read subsections 3.4 – 3.6.  The functions of a parser include: building an internal representation of the derivation tree and related parser information, and resolving ambiguities of the language pertaining to the input string of tokens. 
       
      Terms of Use: Please respect the copyright and terms of use displayed on the webpage above.

    • Reading: Stanford University: Keith Schwarz’s CS143 Compilers: Lecture 3: “Top Down Parsing”

      Link: Stanford University: Keith Schwarz’s CS143 Compilers: Lecture 3: “Top-Down Parsing” (PDF)
       
      Instructions: Select the PDF for Lecture 3 and read slide 4.  The first bullet is a requirement statement and the third bullet is a function statement.  An additional function is the output of meaningful and accurate messages, including error messages. 
       
      Terms of Use: Please respect the copyright and terms of use displayed on the webpage above.

  • 5.4 Formal Language Considerations  
    • Reading: Torben Ægidius Mogensen’s Basics of Compiler Design: “Chapter 3: Syntax Analysis”

      Link: Torben Ægidius Mogensen’s Basics of Compiler Design: “Chapter 3: Syntax Analysis” (PDF)
       
      Instructions: Note that this subsection is just a review of what has been covered in units 2 and 3.  Select the link and click “Get the book contents” to download a PDF version of the book.  Please read subsections 3.2 and 3.3.  The main idea is that context free languages are used to build efficient parsers, but are supplemented with special techniques to resolve ambiguities. 
       
      Terms of Use: Please respect the copyright and terms of use displayed on the webpage above.

  • 5.5 Design of a Parser  
  • 5.5.1 Top-Down Parsers  
  • 5.5.2 Bottom-Up Parsers  
  • 5.6 Construction of a Parser  
  • 5.7 Verification and Validation of a Parser  

    Note: The note in section 4.7 on Verification and Validation (V&V) of a Scanner also applies to the V&V of a parser.  If a parser is constructed using a parser generator, V&V depends on the V&V of the generator and the correctness of the input to the generator.  Formal proof methods, reviews, and testing can be used to verify the correctness of a parser (i.e., that it produces the correct derivation for a legal program and that it outputs correct error messages for incorrect syntax in the program).  Formal proof methods alone are not practical for current programming languages, but in conjunction with testing can be used to verify the correctness of a parser.  Validation (i.e., satisfaction of the requirements of a parser) requires testing and operational use. 

    • Assessment: ePaperPress: Tom Niemann's “Tutorial on LEX and YACC”

      Link: ePaperPress: Tom Niemann's “Tutorial on LEX and YAC” (PDF)
       
      Instructions: Select the link above and choose the PDF format.  In this exercise, you will be introduced to LEX and YACC, which stand for Lexer (short for lexical analyzer) and Yet Another Compiler-Compiler.  They are generators, i.e. programs that generate other programs, in this case, a scanner and a parser.  Our focus in this Assessment will be on YACC.  Select the link for the PDF for Niemann's Tutorial above, and read the part of the tutorial on YACC.  This tutorial explains YACC and how YACC and LEX interface.  LEX and YACC are the original programs, and just as Flex is an open software version for LEX, Bison is an open software version for YACC.

      Exercise #1: Add a block diagram depicting YACC, C compiler, and the connections with their input and out files, and the connections to LEX.
       
      Exercise #2: In Assessment 3, we developed a scanner for recognizing the language L of Exercise #1 of Assessment 1.  Here, instead of using LEX, we will use YACC.  YACC builds a context free syntax analyzer, i.e. a parser for the language we want to analyze; LEX builds a regular expression analyzer.  Because the former is more powerful than the latter, whatever LEX can do, YACC should be able to also do.  For this exercise, describe L by a grammar that can be input to YACC to create a parser for recognizing the strings of L.
       
      Exercise #3 (Optional): For this exercise, we will be build the parser for the language L, used in #3 above, to check our solution to #3.  We will use several software tools: a generator for scanners (lexical analyzers), a C compiler, and text editor for creating input files, and a parser generator.  We will use a scanner generator, called FLEX, based on the original LEX.  The particular version of Flex we will use is Win_Flex, a FLEX scanner generator that runs on Windows.  There are several good compilers available,  cyngwin and gcc.  We will use gcc.  There are also several editors for creating input files; we will use Notepad++.  For the parser generator, we will use Bison, an open source parser generator based on the original YACC.  The version of Bison that we will use is Win_Bison, a version that runs on Windows.  You may download the executables for Win_Flex and Win_Bison.
       
      You can check your answers with The Saylor Foundation’s “Answer Key” (PDF). This assessment should take approximately 3 hours to complete (not including installation of Bison and the Gcc compiler).
       
      Installation Guidance for installing Flex, Bison, and the gcc compiler were given at the end of the problem statement for Asseesment 3 Exercise #4.  Additional guidance is included in the following problem statement.
       
      a) Install Bison on your computer.  (FLEX should already be installed from Exercise #3).  Installation of Bison may have been done when you installed FLEX.  If not, see the Installation guidance in the instructions for Assessment 3.  The installation of Bison includes documentation.  Good manuals can be found here for FLEX for Windows and here for Bison for Windows.

      b) If you do not have a C compiler, install a C compiler on your computer.  Instructions for installing the gcc compiler can be found on YouTube.

      c) Check your answer to Exercise 2 by using Bison and your C compiler to generate a simple parser for language L – strings of 0's and 1's, not including 00, and ending in 1.  Test the parser by running it on a collection of sample input strings that are in the language L.  Also, run it on some strings that are not in L.
       
      Terms of Use: Please respect the copyright and terms of use on the webpages displayed above.

  • Unit 6: Syntax Directed Translation and Semantic Analysis  

    Semantic Analysis takes input from the parsing process and prepares the code for the code-generation step.  In this unit, we will discuss this process in detail, learning about scope and type-checking, type expression, type equations, and type inference.  

    Unit 6 Time Advisory   show close
    Unit 6 Learning Outcomes   show close
  • 6.1 Syntax-Directed Translation and Attribute Grammars  
  • 6.2 Intermediate Representation  
  • 6.2.1 Representation of Program Data—Symbol Tables  
  • 6.2.2 Representation of Program Computation  

    Note: The reading of section 6.2, slides 68 to the end, cover this subsection topic.  

  • 6.3 Functions of Semantic Analysis  

    Note: Semantic analysis includes checking the scope of names in a program and checking for type errors.

  • 6.3.1 Scope Checking of Names in a Program  
  • 6.3.2 Scope Checking with Inheritance  
  • 6.3.3 Static vs. Dynamic Scope Checking  
  • 6.3.4 Type Checking  

    Note: Type checking is one of the static (i.e., compile-time) checks of semantic analysis.  Others are flow-of-control and uniqueness checks.  Checking is done when the intermediate representation is being created.  

  • 6.3.4.1 Type Expressions, Type Equivalence, Type Inference, and What to Check  
  • 6.3.4.2 Type Systems as Proof Systems—Type Checking as Proofs  
  • 6.3.4.3 Applications of Type Proofs  
  • 6.3.4.4 Type Equations, Unification and Binding of Type Expressions  
  • 6.4 Verification and Validation of Semantic Analysis  

    Note: The note in section 5.7 on Verification and Validation (V&V) of a Parser also applies to the V&V of semantic analysis.  By verification, we mean that the semantic analyzer performs a correct analysis relative to the programming language definition.  Formal proof methods using a type calculus, reviews, and testing can be used to verify the correctness of the semantic analysis.  Verification includes showing that the analyzer performs the required analysis, adheres to the programming language definition, and outputs correct error messages.  “Valid” means that the semantic analysis satisfies its end requirements in the overall compilation process.  Requirements include processing time and memory space targets, maintainability, and reliability.

    • Assessment: University of Copenhagen: Torben Ægidius Mogensen’s Basics of Compiler Design: “Chapter 6: Type Checking”

      Link: University of Copenhagen: Torben Ægidius Mogensen’s Basics of Compiler Design:Chapter 6: Type Checking” (PDF)
       
      Instructions: Please click on the link above, and select “Get the Book Contents” to download the PDF file of Mogensen’s “Chapter 6: Type Checking.”  Complete the following:
       
      Exercise #1: Do Exercise 6.1 in Mogensen’s “Chapter 6 Type Checking” on page 143.
      Exercise #2: Do Exercise 6.2 and 6.3 in Mogensen’s “Chapter 6 Type Checking” on pages 143 and 144.
       
      These exercises will give you some practice with semantic analysis and are based on Chapter 6, which covers type checking.  Then, please check your answers against the solutions in the Saylor Foundation’s “Answer Key to Assessment 5.” (PDF)
       
      This assessment should take you approximately 3 hours to complete.
       
      Terms of Use: Please respect the copyright and terms of use displayed on the webpage above.

  • Unit 7: Runtime Environment  

    Runtime environment considerations include organization of the compiled program, storage, building blocks of a compiled program, and different runtime configurations.  This unit has some overlap with Unit 8.  The emphasis in this unit is on representation of needed data structures and techniques for objects and inheritance.  The emphasis of Unit 8 is on instruction generation.  

    Unit 7 Time Advisory   show close
    Unit 7 Learning Outcomes   show close
  • 7.1 Overview of Runtime Environments  
    • Reading: Stanford University: Keith Schwarz’s CS143 Compilers: Handout 22: “Runtime Environments”

      Link: Stanford University: Keith Schwarz’s CS143 Compilers: Handout 22: “Runtime Environments” (PDF)
       
      Instructions: Select the PDF for Handout 22.  Read the handout, which discusses data representations and their organization in memory for a program.  This reading also applies to subunits 7.1.3 through 7.1.5 below.  After semantic analysis, intermediate representations are encoded.  This is the last step of the “front-end” of the compilation process.  The relationship of front ends to back ends can be many-to-one or one-to-many.  In the former case, a single back end is used for several languages, each handled by its own front end.  In the latter case, one front end handles the input, source language, and the back end is used for several target machines, each having its own back end.
       
      Terms of Use: Please respect the copyright and terms of use displayed on the webpage above.

  • 7.1.1 Functions and Requirements of Intermediate Representation (IR)  
  • 7.1.2 Design of the IR Step  

    Note: The reading of 7.1.1 applies to this subunit.

  • 7.1.3 Data Structure Encoding in Memory  

    Note: The reading of 7.1 applies to this subunit.  Encoding of primitive types and arrays are discussed.  From the 7.1.1 reading, see slides 19 through 37 for additional coverage. 

  • 7.1.4 Scope and Extent of Variables  

    Note: The reading of 7.1 applies to this subunit.  Scope of a variable is the lexical area of a program in which the variable can be used.  Extent, or lifetime, is the period of time that a variable exists.  

  • 7.1.5 Runtime Stack and Parameter Passing  

    Note: In addition to the following reading, the reading of 7.1 applies to this subunit.

  • 7.2 More Complicated Representations: Objects and Inheritance  
  • 7.3 Encoding of Exceptions and OOP (Object-Oriented Programs)  
  • Unit 8: Code Generation  

    This unit is closely related to Unit 7, where the emphasis was on representation of data structures needed for run-time.  While there will be some overlap, the emphasis in this unit is on instruction-level intermediate code generation and from intermediate code to target code.

    The lastphase (or next to the last phase if there is a code optimization phase) of the compilation process is code generation, where the output from the previous steps is finally translated into machine code, ready to execute on the target platform.  In this unit, we will start with a discussion of code generation in general before moving on to a more detailed description of the code generation process.  This will include an in-depth discussion of three main areas: Instruction Selection, Instruction Scheduling, and Register Allocation.  By the end of this unit, you will have a firm understanding of the code generation process.

    Unit 8 Time Advisory   show close
    Unit 8 Learning Outcomes   show close
  • 8.1 Introduction to Intermediate Code Generation  
  • 8.1.1 Requirements and Overview  

    Note: The reading of 8.1 applies to this subunit.  Requirements of intermediate-code generation are included in sections 7.1 and 7.2.  

  • 8.1.2 Examples of Abstract Syntax Tree to Intermediate-Code Generation  
    • Reading: Stanford University: Keith Schwarz’s CS143 Compilers: Handout 23: “Intermediate Representation”

      Link: Stanford University: Keith Schwarz’s CS143 Compilers: Handout 23: “Intermediate Representation” (PDF)
       
      Instructions: Select the PDF for Handout 23.  Intermediate representation of a source program may be part of the front end, may be part of the back end, depending on the design of the compiler and its intended use, or may be simply called the “middle part” of the compiler between the front end and back end.  Handout 23 covers IR in the context of code generation.
       
      Terms of Use: Please respect the copyright and terms of use displayed on the webpage above

  • 8.2 Detailed Example: Three Address Code (TAC)  
  • 8.2.1 TAC Language and Generation from an AST  
    • Reading: Stanford University: Keith Schwarz’s CS143 Compilers: Handout 24: “TAC Examples” and Lecture 13 “TAC”

      Link: Stanford University: Keith Schwarz’s CS143 Compilers: Handout 24: “TAC Examples” (PDF) and Lecture 13 “TAC” (PDF)
       
      Instructions: Select the PDF for Handout 24.  TAC is a generic assembly language.  Read Handout 24.  Slides 1 through 149 give examples of TAC statements, function calls and encoding of objects. 

      Note: This reading also applies to 8.2.2.  Slides 150 to 173 give examples of generation of TAC from an AST (abstract syntax tree).  Select the link for the video, skim it and, if it is helpful to you, watch it.
       
      Terms of Use: Please respect the copyright and terms of use displayed on the webpages above.

  • 8.2.2 Generation of TAC  

    Note: The reading of 8.2.1 applies to this subunit.  The lecture slides 150 through 173 describe the generation of TAC.

  • 8.3 Additional Intermediate Language Generation Examples  
  • 8.4 Generation of Machine Code  
  • 8.4.1 Introduction: Intermediate Code to Machine Code  
    • Reading: Torben Ægidius Mogensen’s Basics of Compiler Design: “Chapter 8: Intermediate-Code Generation”

      Link: Torben Ægidius Mogensen’s Basics of Compiler Design: “Chapter 8: Machine-Code Generation” (PDF)
       
      Instructions: Select the link and click “Get the book contents” to download a PDF version of the book.  Please read Chapter 8.  It describes the translation from a low-level intermediate language to machine code.  This translation addresses the handling of restrictions on the machine language; for example, a finite number of registers is a restriction of a target machine and its machine language.  The intermediate language assumes an unlimited number of registers, i.e., virtual register machine.
       
      Terms of Use: Please respect the copyright and terms of use displayed on the webpage above.

  • 8.4.2 Details: Intermediate Code to Machine Code  
  • 8.5 Verification and Validation of Code Generation  

    Note: As with previous parts of a compiler and previous steps of the compilation process, we are interested in the verification (i.e., does the code generator work correctly) and the validation (i.e., does the code generator satisfy the requirements as part of the compiler).  The same kinds of techniques used for the previous parts of a compiler apply, as well, to the code generator. 

    If our code generator produces intermediate code and we have a known “correct” generator from intermediate code to machine code, then our V&V efforts can focus on the translation to intermediate code.  Moreover, if we have a known “correct” compiler from another source language that is similar to our source language and satisfies similar requirements, we could write a simple translator between them and show that the simple translator is correct.  Again, if we want to verify intermediate code to assembler code, and we have a known “correct” generator from the same intermediate language to another assembler language, we could write a simple translator between the assembler languages and then verify that translator. 

    The use of an intermediate language has another benefit: it can make the use of formal proofs of correctness easier than their use on AST to target language generation. Lastly, measurements of process time and memory usage will be needed to validate time and space requirements.  

    • Assessment: University of Copenhagen: Torben Ægidius Mogensen’s Basics of Compiler Design: “Chapter 7: Intermediate Code Generation”

      Link: University of Copenhagen: Torben Ægidius Mogensen’s Basics of Compiler Design:Chapter 7: Intermediate-Code Generation” (PDF)

      Instructions: Please click on the link above, and then select “Get the book contents” to download a PDF version of the book.  Complete the following:

      Exercise #1: Do Exercise 7.1 in Mogensen’s “Chapter 7 Intermediate-Code Generation” on page 173.

      Exercise #2:  Do Exercise 8.2 in Mogensen’s “Chapter 8 Machine-Code Generation” on page 189.

      These exercises will give you some practice with code generation and are based on Chapter 7, which covers intermediate code generation, and Chapter 8, which covers machine code generation.  Intermediate code can be high level, i.e. close to the input language, or low level, i.e. close to the target machine language, and, hence, though Exercise #2 jumps to Chapter 8, the process of syntax directed translation is the same.  These exercises illustrate the process for syntax directed translation to intermediate code or machine code.  You can check your answers with the Saylor Foundation’s “Answer Key” (PDF).
       
      This assessment should take you approximately 3 hours to complete.
       
      Terms of Use: Please respect the copyright and terms of use displayed on the webpage above.

  • Unit 9: Code Optimization  

    Simply compiling and executing a program is not enough to get the most out of your code.  It is the optimization process that allows your code to run as effectively and efficientlyas possible.  In this unit, we will first take a look at optimization, learning what it is and why we are interested in it.  Next, we will review different optimization categories, including Peephole, Local, Loop, Language Dependent, and Machine Dependent.  We will conclude with a discussion of different optimization techniques.  By the end of this unit, you will have a basic understanding of a wide range of optimization techniques and how they improve the effectiveness of your program.

    Unit 9 Time Advisory   show close
    Unit 9 Learning Outcomes   show close
  • 9.1 The What and Why of Code Optimizations  
  • 9.2 Fundamentals of Code Optimization  
  • 9.2.1 Basic Blocks and CFG (Control Flow Graph)  

    Note: The reading for 9.2 applies to this subsection.

  • 9.2.2 Types of Optimizations  

    Note: The reading for 9.2 applies to this subsection.

  • 9.3 Local Intermediate Code Optimizations: Definitions and Examples  

    Note: The link for 9.2 applies to this subsection.  Read slides 9 through 24.  Slide 27 lists some challenges.  Slides 25 and 26 apply to subsection 9.5 below on Code Optimization.  Peephole optimization is a technique for improving assembly code.

  • 9.4 Global Intermediate Code Optimizations: Definitions and Examples  
  • 9.5 Code Optimization  

    Note: Slides 25 and 26 of the reading for 9.3, on peephole optimization, apply to this section.  

  • 9.6 Verification and Validation of Code Optimization  
    • Reading: Stanford University: Keith Schwarz’s CS143 Compilers: Handout 26: “Code Optimization” and Lecture 14: “IR Optimization”

      Link: Link: Stanford University: Keith Schwarz’s Handout 26: “Code Optimization” (PDF) and Lecture 14: “IR Optimization” (PDF)
       
      Instructions: Select the PDF for Lecture 14.  Read slides 18 through 20 and slides 24 through 32.  These slides essentially state that an optimization must not change the behavior of the generated code.  This includes introducing errors and changing the semantics.  The V&V (verification and validation) techniques mentioned for the other parts of a compiler also apply to optimization.  In addition, these slides suggest that the formalisms for optimization—e.g., soundness—are directly applicable. 
       
      Terms of Use: Please respect the copyright and terms of use displayed on the webpages above.

    • Assessment: University of Copenhagen: Torben Ægidius Mogensen’s Basics of Compiler Design: “Chapter 11: Analysis and Optimization”

      Link: University of Copenhagen: Torben Ægidius Mogensen’s Basics of Compiler Design:Chapter 11: Analysis and Optimization” (PDF)
       
      Instructions: Please click on the link above, and select “Get the book contents” to download a PDF version of the book.  Complete the following:
       
      Exercise #1: Do Exercise 11.1 parts a) and b) in Mogensen’s “Chapter 11 Analysis and Optimization” on page 254.
       
      This exercise will give you some practice with code optimization and is based on Chapter 11, which covers analysis and optimization.  The exercise uses common subexpression elimination as an example of optimization.  You can check your answers with the Saylor Foundation’s “Answer Key” (PDF).
       
      This assessment should take you approximately 3 hours to complete.
       
      Terms of Use: Please respect the copyright and terms of use displayed on the webpage above.

  • Unit 10: Compiler Verification and Validation  

    The verification and validation (V&V) of a compiler consists of the V&V of its parts: scanner or lexical analyzer, syntax analyzer, semantic analyzer, Intermediate Code Generator, Intermediate Code Optimizer, Code Generator, and Code Optimizer. See the V&V notes for each of these.  V&V is based on a careful plan, a well-defined compiler process (including version control or more extensive configuration control, and efficient and effective documentation), peer reviews, measurements, (including defect analysis), component tests, formal proofs, and integration tests.  It can utilize known correct tools, such as parser generators, code generation templates, and other compilers and compiler parts.  In addition, a formal certification of the compiler can be done by an independent organization. 

    Remember that verification pertains to correctness, which means satisfaction of specifications; and validation pertains to user needs, which means satisfaction of user operational requirements.  Verification addresses, for example, correctness of results of scanning, semantic analysis, code generation, and code optimization.  Validation includes meeting operational, functional, performance, and physical requirements for processing time and memory space, and also, the “ilities”—reliability, availability, maintainability, and usability. 

    V&V depends on the complications of the source and target programming languages, the intended use of the compiler, number of front ends to back ends, available tools, run-time environments and machine architecture, and the system that it will interface with. 

    Finally, certification of a compiler—i.e., compliance to a standard—may be necessary for safety or security of critical programs.  

    Unit 10 Time Advisory   show close
    Unit 10 Learning Outcomes   show close
    • Reading: Wikipedia: “ANSI C”

      Link: Wikipedia: “ANSI C” (PDF)
       
      Instructions: Select the link and read about certification of ANSI C compilers.
       
      Terms of Use: Thearticle above is released under a Creative Commons Attribution-Share-Alike License 3.0 (HTML).  You can find the original Wikipedia version of this article here (HTML).

    • Assessment: The Saylor Foundation’s “Assessment 8”

      Link: The Saylor Foundation’s “Assessment 8” (PDF)

      Instructions: Please review subunits 4.7, 5.7, 6.4, 8.5, and 9.6 of this course.  They discuss verification and validation of lexical analyzers, parsers, semantic analyzers, code generators, and code optimizers, respectively.  In this unit (Unit 10) and in these exercises, we combine the ideas of those sections and look at an integrated approach to the verification and validation of the compiler as a whole.
       
      Exercise #1: Outline an approach for verifying software that translates from one programming language to another.
       
      Exercise #2: Outline an approach for validating software that translates from one programming language to another.
       
      This assessmentshould take you approximately 1 hour and 30 minutes to complete.
       
      You can check your answers with the Saylor Foundation’s “Answer Key” (PDF)

  • Unit 11: Compiler Summary and Next Steps  

    This unit concludes our course study of compilers.  We summarize the topics we have studied in the course and look ahead to further study to build upon what we have learned in this course.  The topics covered may at first seem limited in their application to the development of compilers.  However, their application is much broader, and helps us write better programs.  The techniques, abstractions, and data structures are applicable to many other applications, including safety, security, high-performance applications, data and control flow analysis, internal or intermediate representation and encoding of structures, and optimization for external dependencies.

    Unit 11 Time Advisory   show close
    Unit 11 Learning Outcomes   show close
  • 11.1 Compiler Summary  
  • 11.2 Next Steps  

    Note: The link of 11.1 also applies to this subunit.  See Stanford slides 235 to 239 and the last slide of the Berkeley notes.  

  • Final Exam  

« Previous Unit Next Unit » Back to Top