[컴파일러] Chapter 3. Syntax Analysis(Part 1)

4학년 1학기 전공/컴파일러

[컴파일러] Chapter 3. Syntax Analysis(Part 1)

시데브 2025. 4. 12. 17:51

경희대학교 허선영 교수님의 컴파일러 수업을 기반으로 정리한 글입니다.

1. Front-end Compilation

- Lexical Analysis

문장을 토큰 단위로 분류하는 과정
정규 표현식을 사용하여 문장을 의미 단위로 구분

- Syntax Analysis

Lexical Analysis 과정에서 생성된 토큰을 기반으로 parse tree를 생성

- Semantic Analysis

프로그램의 선언과 진술이 의미론적으로 정확한지 확인하는 작업
syntax analysis와 달리 type checking 진행

2. Syntax Analysis

Abstract Syntax Tree(AST)를 빌드 -> token stream이 valid한지 체크
if the token stream is not valid -> syntax error & recover
ex) c++에서는 if 다음에 () 괄호 구조가 등장하지 않으면 문법 오류 -> syntax error

expression: value를 가지는 표현
- 1 + 2 -> 3
statement: 명령, side-effect
- c = 1 -> 명령 자체가 따로 값을 가지진 않음
ex) print("x")라는 명령어 자체는 expression X, but x라는 string은 expression

-> 토큰의 가장 말단에는 token이 위치

-> IF-THEN-ELSE도 어떤 값을 가지는 건 아니므로 statement

Why do we need a parser in addiotion to a lexer? -> 왜 Lexer에서는 문법을 체킹하지 못하는가
- 몇몇 프로그램은 recursive structures를 가질 수 있음
- Finite automata는 recursive contructs를 잡아내지 못함
- ex) digits = [0-9]+, expr = {digits} | "(" {expr} "+" {expr} ")"
(((1+2)+3)+4)와 같은 구조는 괄호가 얼마나 깊게 중첩되었는지 기억해야 함 -> but N개의 상태를 가지는 Finite Automata는 N보다 더 깊게 괄호가 중첩되면 기억이 불가능함

3. Context-Free Grammar

Regular Expressions: 토큰의 lexical structure 결정
Context-Free Grammars(CFG): 프로그램의 syntactic structure 결정

Grammar: set of productions
- production: 하나의 symbol이 여러 개의 symbol로 확장되는 구조
Symbol에는 2가지 종류가 존재
- Terminal: lexical anlysis에서 정의한 Token -> 확장이 불가능한 symbol
- Nonterminal: production의 left-side에 위치한 symbol -> 확장이 가능한 symbol
  - One of the nonterminals가 grammar의 start symbol이 됨

- example: Syntax for Straight-line program

straight-line program: 하나의 statement 당 세미콜론으로 한 줄씩 존재하는 프로그램

3.1. Derivation

Derivation(Execution of Parsing)
1. start symbol에서 시작
2. nonterminal을 right-hand sides를 이용하여 교체
Derivation을 수행하는 2가지 방식
- Leftmost Derivation: leftmost nonterminal 교체 in each step
- Rightmost Derivation: rightmost nonterminal 교체 in each step

-> leftmost or rightmost derivation

-> Start Symbol에서 시작해서 주어진 program과 동일하게 derivation 성공 시에 well-formed program

-> leftmost derivation, rightmost derivation

3.2. Parse Tree

derivation의 시각적 표현
internal node(leaf nodes를 제외한 nodes): nonterminal
leaf node: terminal

3.3. Ambiguous Grammars

parse tree를 통해서 grammar가 ambiguous한지 판단
같은 프로그램을 2개 이상의 parsing trees로 유도 가능 -> A grammar is ambiguous

-> grammar 재작성 필요!

- Problem

Different parse trees may have different meanings -> resulting in different interpreted results
example) 1 + 2 * 3

-> 둘의 결과가 다름!!

- Solution: 문법의 재정의

operators에 상대적 우선순위 부여
- *는 +보다 높은 우선순위 가짐
같은 우선순위를 가지는 경우에는 associativity로 묶기
- 1 + 2 + 3 -> ( 1 + 2 ) + 3 or 1 + ( 2 + 3 )

3.4. Inefficiency in Parse Tree

Concrete parse tree
- internal node - nonterminal
- leaf node - terminal
Concrete parse tree는 사용 불편
- 불필요한 터미널이 많아짐
Make parse trees simple -> Abstract Syntax Tree(AST)

-> tree 구조가 완성되고 나서는 (, ), ;, := 등의 symbol 정보는 필요없어짐

Abstract Syntax Tree(Abstract Parse Tree): concrete parse tree에서 불필요한 토큰 제거

example)

3.5. End-of-File Marker

parsing의 종료를 위해서는 EOF 존재해야 함
- program이 file 형태로 존재하기 때문
EOF marker: $
S' -> S$

4. Recursive Descent Parsing(Predictive Parsing)

다음 입력 심볼만 보고, 어떤 production을 적용할지 결정하는 기법
LL(k) parsing - token을 보고 미리 예측, k는 token의 개수
각 subexpression의 첫 번째 terminal symbol만 보고, 어떤 production을 사용해야 할지 결정할 수 있을 때 사용

- Example

-> Lookahead symbol이 IF면 if E ...로, BEGIN이면 begin S L로 확장

eat(): 현재 토큰이 expected token과 일치하면, 그 토큰을 소비(consume)하고 다음 토큰으로 넘어가는 함수
- consume? ['number', '+', 'number'] 상태에서, 파서가 처음에는 tokens[0]를 pointer로 가리키고 있다가, consume 처리 시에 해당 토큰을 처리됐다고 보고 포인터를 다음 토큰으로 이동

4.1. Contructing a Parser

- Basic Functions

nullable(X): X로부터 empty string(ε) 유도가 가능하면, X는 nullable
- production에 존재하는 모든 symbol이 nullable -> 해당 Symbol은 nullable

-> Z가 terminal이 되려면 반드시 d로 확장해야 함

FIRST(𝛾): 𝛾로부터 유도된 strings의 시작점에 위치할 수 있는 모든 terminals의 집합

A -> B C D일 때
- B -> ε | b
- C -> c
- D -> d
FIRST(B)는 FIRST(A)에 포함, 추가적으로 B가 nullable이므로 B 자리가 비고, 그 자리에 C가 들어갈 수도 있으므로 FIRST(C) 또한 FIRST(A)에 포함됨

FOLLOW(X): derivation 후에 X 뒤에 위치할 수 있는 모든 terminal symobols의 집합
- Z -> XYZ(Z가 start symbol)
  - X -> Y: FIRST(Y)는 FOLLOW(X)에 포함
  - Y가 nullable이므로, FIRST(Z)는 FOLLOW(Z)에 포함
- X -> 𝛾Y
  - FOLLOW(X)는 FOLLOW(Y)에 포함, Y의 마지막과 X의 마지막이 동일하기 때문
- 𝑋 → 𝛾1𝑌𝛾2
  - FIRST(𝛾2)는 FOLLOW(Y)에 포함
  - 𝛾2가 nullable일 때, FOLLOW(X)는 FOLLOW(Y)에 포함

- Example

1) Initial

2) Interation 1

3) Iteration 2

4) constructing a parser

nullable, first, follow 계산 이후 contructing a parser
nullable, first, follow -> predictive parsing table
S -> 𝛾
- if 𝛾 is nullable -> row: S, col: FOLLOW(𝛾)에 포함되는 모든 요소
- if 𝛾 is not nullable -> row: S, col: FIRST(𝛾)에 포함되는 모든 요소

- Another Example

nullable, first, follow table
- 간단한 것들(소문자 symbols) 위주로 테이블에 삽입
- 이후 대문자 symbols 사이의 관계 rules 정립 -> 테이블 반영
- 정립한 rules가 table 내에서 모두 성립하는지 점검
each production에 대해서 predictive parsing table 빌드