Phases of translation
The C source file is processed by the compiler as if the following phases take place, in this exact order. Actual implementation may combine these actions or process them differently as long as the behavior is the same.
Phase 1
- The source character set is a multibyte character set which includes the basic source character set as a single-byte subset, consisting of the following 96 characters:
Phase 2
Phase 3
If the input has been parsed into preprocessing tokens up to a given character, the next preprocessing token is generally taken to be the longest sequence of characters that could constitute a preprocessing token, even if that would cause subsequent analysis to fail. This is commonly known as maximal munch.
int foo = 1; int bar = 0xE+foo; // error: invalid preprocessing number 0xE+foo int baz = 0xE + foo; // OK int pub = bar+++baz; // OK: bar++ + baz int ham = bar++-++baz; // OK: bar++ - ++baz int qux = bar+++++baz; // error: bar++ ++ +baz, not bar++ + ++baz.
The sole exception to the maximal munch rule is:
- Header name preprocessing tokens are only formed within a
#include
directive and in implementation-defined locations within a#pragma
directive.
#define MACRO_1 1 #define MACRO_2 2 #define MACRO_3 3 #define MACRO_EXPR (MACRO_1 <MACRO_2> MACRO_3) // OK: <MACRO_2> is not a header-name
Phase 4
Phase 5
Note: the conversion performed at this stage can be controlled by command line options in some implementations: gcc and clang use -finput-charset to specify the encoding of the source character set, -fexec-charset and -fwide-exec-charset to specify the encodings of the execution character set in the string literals and character constants that don't have an encoding prefix (since C11).
Phase 6
Adjacent string literals are concatenated.
Phase 7
Compilation takes place: the tokens are syntactically and semantically analyzed and translated as a translation unit.
Phase 8
Linking takes place: Translation units and library components needed to satisfy external references are collected into a program image which contains information needed for execution in its execution environment (the OS).
References
- C11 standard (ISO/IEC 9899:2011):
- 5.1.1.2 Translation phases (p: 10-11)
- 5.2.1 Character sets (p: 22-24)
- 6.4 Lexical elements (p: 57-75)
- C99 standard (ISO/IEC 9899:1999):
- 5.1.1.2 Translation phases (p: 9-10)
- 5.2.1 Character sets (p: 17-19)
- 6.4 Lexical elements (p: 49-66)
- C89/C90 standard (ISO/IEC 9899:1990):
- 2.1.1.2 Translation phases
- 2.2.1 Character sets
- 3.1 Lexical elements