Compilation Stages

Compilation Stages [C, GCC, Linux context]

Compilation involves different stages, one of them being called compilation.
The stages of compilation are:

  • preprocessing
  • compilation
  • Assembly
  • linking

To build a program:

gcc: Invoke Compiler
-Wall: Treat warning as error

-o: Tell output file name, default without -o option is a.out.


$ gcc -Wall --save-temps prog.c -o prog
# ls
prog*  prog.c  prog.i  prog.o  prog.s
PreProcessing: Pre-processed file has extension ,i [prog.i].
  • Macro Substitution:
  • Strip off Comments
  • Expand included file
In .i file, all #define MACRO will be replaced, all comments will be removed, and entire content of #include header fill be replaced in-place.

Library functions like printf are marked extern, i.e. defination of printf is yet to be found. It places a placeholder for printf address which shall be filled later by Linking stage.

Compiling: Intermediate compiled output has extension .s [prog.s]

Assembly: Intermediate compiled output is converted into object file [prog.o]. At this stage, the code is machine readable and function name don't resolve anymore.
This file has ELF [Executable Link Format] at start and rest all is jumbled code.

Linking:All unresolved symbols (such as printf etc) must be resolved at this point. By resolution, we mean compiler needs to put deterministic address of printf at its placeholder put in Preprocessing stage.

Useful commands to analyze compiled output files
objdump prog : display information from object files.
nm prog : list symbols from object files
There are different ways to represent stage of compilation, but the core process remains as explained above.
e.g. Preprocessing, Parser, Translation, Assembling and Linking. Here, Parsing and Translation can be thought as Compilation stage.

So, in summary, the stages from .c/.h to .out are
Preprocessor -> [pre-processed code, .i] -> Compiler -> [assembly code, .s] -> Assembler -> [Relocatable machine code, .o] -> Linker -> [Executable Machine Code, .out] -> Loader -> Memory

Linker takes library, other relocatable modules as input.

Compiler vs Interpretor:
Interpreter: Reads a statement from input, converts to intermediate code, executes. Stops if error, else read next statement.
Compiler reads whole source, creates token, check semantics, generates intermediate code, executes whole program

Compiler has two phase:
1. Analysis/Froneend: Read source, divide into parts, check lexical, grammar, syntax error, outputs intermediate representation of src file, symbol table
2. Synthesis/Backend: Generate target program

Phase vs Pass:
Pass: Traversal of compiler through entire program. It can have >=1 phase
Phase: Takes input from previous stage, processes, yields output for next state.

The phases of compiler are as below:
Source -> Lexical -> Syntax -> Semantic -> IntermediataCodeGen->Machine Independent Code Optimizer -> Code Generator -> Machine dependent code optimizer -> target code.

Analyzers:
Lexical: Removes space, comments, breaks syntax to tokens, uses regex, pattern rules.
              int val = 100;   keyword(int) identifier(val) operator(=) constant(100) symbol(;)
              Tokens can be Identifiers, Keywords, Operators, Special Symbols, Constants

Syntax/Parser: Analyse token streams, checks error and outputs parser tree which is group of tokens into syntantical structure called expressions
              Can not validate tokens, things like whether token declared/initialized before use, validate operations on tokens
              The parsers can be top down, bottom up, universal.
              Most common parser is bottom up -> shift reduce -> LR(left-right)
              LR it is most general non backtracking shift reduce parsing method, can detect syntactical error as quick as they can be detected.

Semantic: Does Scope resolution, type checking, array bound checks etc.
              Can detect type mismatch, undeclared variables, reserved identifier misuse, multiple declaration, out of scope variable access, actual vs formal param mismatch etc.

Lexical Analyser takes regex, finite automata as inputs
Syntax Analyser takes ContextFreeGrammer(CFG) as input

Regex can not check balancing e.g. (). Thus CFG use is required.

Example: pos = x + rate*100;
Lexical -> id1=id2+id3*id4  -> Syntax -> (1)parser tree(tokens arranged into tree like object) -> Semantic -> (2) -> Intermediate Code -> (3) -> Code Optimizer -> (4) -> Code Generator -> (5)
           
(1) Syntax Analyser Output  
 (2) Semantic Analyser Output


(3) Intermediate Code
tmp1=int to read 60
tmp2=id3*tmp1
tmp3=id2+tmp2
id1=tmp3

(4) Code Optimizer Output
tmp1=id3*60.0
id1=id2+tmp

(5) Code Generator Output
MOV id3, r2
MULF *60.0, r2
MOVF id2, r2
ADDF r2, r1
MOVF r1, id1

Example of Optimizations:
1) A=B+C+D; P=B+C+R;
Optimization: T=B+C; A=T+D; P=T+R

2) Move computation that produces same output out of the loop
etc.

Some Good Links:
COMPILER, ASSEMBLER, LINKER AND LOADER: A BRIEF STORY


No comments:

Post a Comment