[컴파일러] Chapter 2. Lexical Analysis

Notice

Recent Posts

Recent Comments

Link

« 2025/10 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

SYDev

[컴파일러] Chapter 2. Lexical Analysis 본문

4학년 1학기 전공/컴파일러

[컴파일러] Chapter 2. Lexical Analysis

시데브 2025. 3. 20. 02:47

경희대학교 허선영 교수님의 컴파일러 수업을 기반으로 정리한 글입니다.

1. Front-end Compilation

source program -> target program 번역 과정에는 IR(Intermediate Representation)이라는 중간 언어 번역 과정 존재
Front-end Compilation: source program -> IR
1. Lexical Analysis
2. Syntax Analysis
3. Semantic Analysis
4. IR Code Generation
Back-end Compilation: IR -> target program

2. Lexical

언어를 이해하는 첫 번째 step
- words를 구분 -> the smallest unit above letters
- ex) This is a sentence -> white space 기준 token 단위로 분리
stream of characters로부터 stream of names, keywords, and punctuation marks 생성

tokens 사이에 존재하는 white space와 comments(주석) 제외

2. 1. Lexical Token

Token: unit으로 취급되는 sequence of characters

some tokens have associated semantic information
- ID, NUM, REAL -> 추가적인 정보 저장 필요
- num1이라는 이름의 NUM 변수 -> num1이 어떤 값을 내포하고 있는지 추가적으로 정보 필요
- 반면 punctuation mark(,)는 그 자체로 의미 파악 가능
reserved words cannot be used as identifiers(e.g., IF, VOID, RETURN)

https://en.wikipedia.org/wiki/C_preprocessor

2.2. Lexical Analyzer

Lexical Analyzer(Lexer): lexical analysis를 수행하는 프로그램
compilation의 first phase
how to implement a Lexer?
- write lexer from scartch
- use lexical analyzer generator

each lexical token을 regular expressions라는 formal한 언어로 정의
generator는 전달받은 regular expressions를 deterministic finite automata로 변환

automata?: input이 들어오면 output을 내는 machine, 해당 표현 방식 중 dfa 존재

Automata
- automata theory: 계산 능력이 있는 추상 기계와 그 기계를 이용해서 풀 수 있는 문제들을 연구하는 컴퓨터과학의 분야
-> 여기서 abstract machine을 automata 또는 automaton이라 부름

Deterministic Finite Automata(DFA)
- 각 symbol에 대해서 유일한 상태 변화를 취하는 특징을 가짐
- Deterministic -> 계산의 유일함

출처:
https://ko.wikipedia.org/wiki/%EA%B2%B0%EC%A0%95%EB%A1%A0%EC%A0%81_%EC%9C%A0%ED%95%9C_%EC%83%81%ED%83%9C_%EA%B8%B0%EA%B3%84
https://ko.wikipedia.org/wiki/%EC%98%A4%ED%86%A0%EB%A7%88%ED%83%80_%EC%9D%B4%EB%A1%A0

오토마타 이론 - 위키백과, 우리 모두의 백과사전

위키백과, 우리 모두의 백과사전. 오토마타 벤 다이어그램 (각 레이어를 클릭하면 해당 글로 이동합니다.) 결정적 유한 오토마타의 예. S1, S2는 상태이고, 1과 0은 기계가 입력으로 받아들이는 문

ko.wikipedia.org

3. Regular Expression

Language: A set of strings
String: 유한한 알파벳에서 선택한 symbols의 유한한 sequence -> regular expression으로 stirng을 단어로 표현 가능
- symbols의 ex) ASCII
Regular Expression: 문자열들의 집합을 표현하는 방식

3.1. Basic Notations

symbol: 하나의 알파벳을 의미
- a -> { a }
- 1 -> { 1 }
Epsilon ϵ: empty string을 의미
- ϵ -> { }
Alternation M | N: M or N, 즉 M과 N의 합집합을 의미
- a | b -> { a, b }
Concatenation MN: M과 N의 concatenated 집합
- ( a | b )( a | c ) -> { aa, ac, ba, bc }
- Th( is | at ) = { This | That }
Repetition M* (kleene closure): M 집합에 속하는 모든 strings를 반복 조합해서 나올 수 있는 모든 집합 -> infinite
- ( a | b )* -> { ϵ, a, b, aa, ab, ba, bb, aaa, aab, ... }

3.2. Additional Notations

[ abcd ] : ( a | b | c | d )
[ b - g ] : [ bcdefg ]
[ b-gM-Qkr ] : [ bcdefgMNOPQkr ]
M? : ( M | ϵ )
- a?b : ( a | ϵ )b = { ab, b }
M+ : MM*
"a.+*" : 문자열을 인용

3.2. Examples

( 0 | 1 )* : Binary numbers -> { ϵ, 0, 01, 10, ... }
( a | b )*aa( a | b )* : { aa, aab, baa, aaa, ... }
- aa가 ( a | b )* 사이에 들어가는 이유? -> ( a | b )*aa의 경우 aa가 뒤에 붙어야 한다는 제한이 있지만, ( a | b )*aa( a | b )*의 경우 aa 위치에 제한이 없음
Lexical tokens

Token Type	Regular Expression
IF	if
ID	- [ a - z ][ a-z0-9 ] - 대문자 허용하는 경우: [ a - zA - Z ][ a - zA - Z0 - 9 ]*
NUM	- [ 0 - 9 ]+ -> { 0, 01, 10, 20, 030, ... } - 앞에 0이 나오는 경우 제외 -> 0 \| [ 1 - 9 ][ 0 - 9 ]* - 단독으로 0인 경우를 제외한 0이 나오는 경우 모두 제외
REAL	*([ 0 - 9 ]+"."[ 0 - 9 ]) \| ([ 0 - 9 ]"."[ 0 - 9 ]+)* - *([ 0 - 9 ]+"."[ 0 - 9 ]): 소수점 뒤 숫자가 존재하지 않을 수도 있음 -> 123. - ([ 0 - 9 ]"."[ 0 - 9 ]+)*: 소수점 앞 숫자가 존재하지 않을 수도 있음 -> .456 - .만 존재하는 경우를 제외한 모든 경우의 수 존재

regular expression으로 표현 불가능한 language 존재
- String of a's and b's where are more a's than b's
- { aab, aaab, abaa, ... } -> RE 표현 불가

4. Finite Automata

Finite Automata: limited memory를 가진 automata
Finite number of states
- A start state
- One or more final states
edge -> labeled with a single symbol

4.1. Deterministic Finite Automata

Deterministic Finite Automata(DFA)
- 각 symbol에 대해서 유일한 상태 변화를 취하는 특징을 가짐
- Deterministic -> 계산의 유일함

4.2. Nondeterministic Finite Automata

NFA
- node를 떠나는 edges 같은 label로 설정 가능
- ϵ label 설정 가능
RE에서 직접적으로 DFA로 변환하는 과정은 어려울 수 있다. -> 따라서 RE에서 NFA로의 변환 과정을 거치고, 이후 DFA로 변환하는 것이 일반적

- Converting RE to NFA - Rules

- Converting RE to NFA - Example

- Converting NFA to DFA

NFA는 모든 가능성을 용인하기 때문에, DFA보다 비효율적임 -> 때문에 보통 DFA로 변환하여 사용

4.3. Converting NFA to DFA

- Basic Functions

edge(s, c): state s에서 symbol c를 따라가서 도달할 수 있는 모든 states의 집합
- edge(1, a) = { 2 }
- edge(2, ϵ) = { 5 }
closure(S): states의 집합 S에 속하는 모든 states로부터, 해당 inputs를 소모하지 않고 도달할 수 있는 모든 NFA의 states의 집합(= ϵ만으로 도달할 수 있는 모든 states의 집합)
- closure({1}) = {1} -> 1번 state로부터 epsilon을 소모하지 않고 이동 가능한 states는 X
- closure({2}) = {2, 3, 5}
- closure({1, 4}) = {1, 3, 4, 5}
DFAedge(d, c): d로부터 시작해서, symbol c와 ϵ만을 소모해서 도달할 수 있는 모든 NFA states의 집합(edge를 사용해서 c로 갈 수 있는 states를 구하고, closure를 통해서 그 edges로부터 갈 수 있는 모든 states의 집합을 얻을 수 있음)
- 𝐷𝐹𝐴𝑒𝑑𝑔𝑒( 𝑑, 𝑐 )= 𝑐𝑙𝑜𝑠𝑢𝑟𝑒(∪𝑠∈𝑑 𝑒𝑑𝑔𝑒 𝑠, 𝑐 )
- DFAedge({1}, a) = {2, 3, 5}

- Algorithm

하나의 state에 대해서 모든 symbols(characters)를 확인하고, 해당 state에 대해서 모든 transition을 확인했기 때문에, 다음 state로 넘어가서 확인

- Example

-> closure({1})을 통해서 starting state를 구함

-> symbol i에 대해서 start state로부터 i, a-z, error로 각각 향할 수 있음

states s1 and s2 are equivalent

-> 최적화를 통해서 DFA로의 변환을 쉽게 만들 수 있음

-> 최적화 적용

- Rule Disambiguation

"if8"은 if와 8로 나눌 것인지, if8로 생각할 것인지 애매한 부분 존재
Longest Match: regular expression과 match되는 longest initial substring of input이 다음 토큰으로 선택됨
- if8 matches ID("if8"), not iF() and NUM(8)
- 가장 최근 매칭된 final state(token)을 저장
- 끝까지 렉싱을 시도하다가 실패하면, 다시 그 마지막 토큰으로 돌아옴
Rule Priority: rules 간에 우선순위 적용
- if 키워드를 ID("if")가 아닌 IF()로 매칭하고 싶을 때, 우선순위 설정 가능

5. How Does Lexer Work?

-> final state에 도달했을 때, 정해진 action을 수행

- IF: if라는 토큰을 생성

- ID: id라는 토큰을 생성

- continue(): 무시하고 계속 수행

- error(): 에러 발생

6. Real-world Lexer Generators

DFA를 만드는 과정 필요 X
그저 Regular Expression을 정의해놓고, 그에 해당하는 Actions만 잘 구현하면 됨

참고자료

Appel, Andrew W. 저자(글), Cambridge University Press · 2004년 07월 08일

728x90

'4학년 1학기 전공 > 컴파일러' 카테고리의 다른 글

[컴파일러] Chapter 3. Syntax Analysis(Part 2) (0)	2025.04.14
[컴파일러] Chapter 3. Syntax Analysis(Part 1) (0)	2025.04.12
[컴파일러] Assignment 1. Lexical Analysis (0)	2025.04.12
[컴파일러] Assignment 0. Your First Functional Programming (1)	2025.03.21
[컴파일러] Chapter 1. Introduction (0)	2025.03.18

'4학년 1학기 전공/컴파일러' Related Articles

SYDev

[컴파일러] Chapter 2. Lexical Analysis 본문

[컴파일러] Chapter 2. Lexical Analysis

1. Front-end Compilation

2. Lexical

2. 1. Lexical Token

2.2. Lexical Analyzer

3. Regular Expression

3.1. Basic Notations

3.2. Additional Notations

3.2. Examples

4. Finite Automata

4.1. Deterministic Finite Automata

4.2. Nondeterministic Finite Automata

- Converting RE to NFA - Rules

- Converting RE to NFA - Example

- Converting NFA to DFA

4.3. Converting NFA to DFA

- Basic Functions

- Algorithm

- Example

- Rule Disambiguation

5. How Does Lexer Work?

6. Real-world Lexer Generators

'4학년 1학기 전공 > 컴파일러' 카테고리의 다른 글

티스토리툴바