๋ฐ˜์‘ํ˜•
kkh1902
Steadily
kkh1902
์ „์ฒด ๋ฐฉ๋ฌธ์ž
์˜ค๋Š˜
์–ด์ œ
  • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (178) N
    • DataEngineering (20) N
      • Spark (7) N
      • Airflow (2) N
      • DBT (2) N
      • Architecture (3) N
      • Data Quality (5) N
      • Infra (1) N
    • ๐Ÿค– AI (12) N
      • ML (7)
      • DL (0)
      • LLM (5) N
    • ๐Ÿ“š Study (74)
      • DataEngineering (0)
      • Spring (9)
      • Java (2)
      • Html, css (10)
      • JS, JQuery (29)
      • DB (5)
      • DevOps (13)
      • roadmap (2)
      • Architecture (1)
      • Flutter (2)
    • ๐Ÿ’ป Computer Science (28)
      • Datastructure (0)
      • Algorithm (2)
      • Design pattern (0)
      • Network (1)
      • DB (13)
      • Operating System (0)
      • Software Engineering (4)
      • CS interview (5)
      • topcit (3)
    • โš’๏ธ Etc (6)
      • Error (3)
      • Trouble_Shooting (2)
      • Dev_environment (1)
    • ๐Ÿ“ฐ News (24)
      • daily (7)
      • think (17)
    • ๐Ÿ“˜ Hobby (13)
      • English (13)

๋ธ”๋กœ๊ทธ ๋ฉ”๋‰ด

  • ๐Ÿ“‹ ์ด๋ ฅ์„œ
  • โšก๏ธ ๊นƒํ—ˆ๋ธŒ
  • ํƒœ๊ทธ
  • ๋ฐฉ๋ช…๋ก

๊ณต์ง€์‚ฌํ•ญ

์ธ๊ธฐ ๊ธ€

ํƒœ๊ทธ

  • React # JSX
  • ์†Œํ”„ํŠธ์›จ์–ด ๊ณตํ•™ #project๋งŒ๋“ค๋•Œ ์ค‘์š”
  • ์†Œํ”„ํŠธ์›จ์–ด ๊ณตํ•™ # chapter1
  • db
  • testcode
  • Qr_payment project # CSS ํ•ด์„ # Basic ๋งจ์œ„ ํ•ด์„
  • Linear Regression
  • git stash
  • gitaction
  • SpringBootTest
  • Wonder # word
  • git
  • React๋ฅผ ๋ฐฐ์›Œ์•ผํ•˜๋Š” ์ด์œ 
  • sourcetreee
  • think #bootstrap์„ ์จ์•ผํ•˜๋Š” ์ด์œ 
  • junit5
  • React JS #์ž์Šต์„œ
  • React JS # 2 The Basic of React
  • React JS # ์ž์Šต์„œ # Component์™€ Props
  • Flutter

์ตœ๊ทผ ๋Œ“๊ธ€

์ตœ๊ทผ ๊ธ€

ํ‹ฐ์Šคํ† ๋ฆฌ

250x250
hELLO ยท Designed By ์ •์ƒ์šฐ.
๊ธ€์“ฐ๊ธฐ / ๊ด€๋ฆฌ์ž
kkh1902

Steadily

DataEngineering/Data Quality

GE(Great Expectations) ์ด๋ž€?

2026. 2. 1. 21:30
728x90
๋ฐ˜์‘ํ˜•

 

๐Ÿ“Œ ๋ชฉ์ฐจ

  1. GE(Great Expectations)๋ž€?
  2. GE๋ฅผ ์™œ ์“ฐ๋‚˜? (DE ๊ด€์ )
  3. GE ํ•ต์‹ฌ ๊ฐœ๋…
  4. GE ๊ธฐ๋ณธ ์‚ฌ์šฉ ํ๋ฆ„
  5. DE ์‹ค๋ฌด์—์„œ GE ์“ฐ๋Š” ํŒจํ„ด
  6. ์–ธ์ œ GE๊ฐ€ ์ž˜ ๋งž๊ณ , ์•ˆ ๋งž๋‚˜

1๏ธโƒฃ GE(Great Expectations)๋ž€?

GE = Great Expectations๋Š”
๐Ÿ‘‰ ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ(Data Quality)์„ ์ฝ”๋“œ์ฒ˜๋Ÿผ ์ •์˜ํ•˜๊ณ  ์ž๋™ ๊ฒ€์ฆํ•˜๋Š” ์˜คํ”ˆ์†Œ์Šค ํˆด์ด์•ผ.

ํ•œ ์ค„ ์š”์•ฝํ•˜๋ฉด,

“๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ๊ธฐ๋Œ€ ์กฐ๊ฑด(Expectation)์„ ์ •์˜ํ•˜๊ณ 
์‹ค์ œ ๋ฐ์ดํ„ฐ๊ฐ€ ๊ทธ ๊ธฐ๋Œ€๋ฅผ ๋งŒ์กฑํ•˜๋Š”์ง€ ์ž๋™์œผ๋กœ ๊ฒ€์‚ฌ”

๐Ÿ“Œ ์˜ˆ:

  • row ์ˆ˜๊ฐ€ ๋„ˆ๋ฌด ์ค„์ง€ ์•Š์•˜๋Š”๊ฐ€?
  • NULL ์ด ์žˆ์œผ๋ฉด ์•ˆ ๋˜๋Š” ์ปฌ๋Ÿผ์— NULL ์ด ์žˆ๋Š”๊ฐ€?
  • ๊ฐ’์ด ํ—ˆ์šฉ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚˜์ง€ ์•Š์•˜๋Š”๊ฐ€?

์ด๊ฑธ ์‚ฌ๋žŒ์ด ๋ˆˆ์œผ๋กœ ์•ˆ ๋ณด๊ณ , ์ž๋™์œผ๋กœ ๊ฒ€์‚ฌํ•ด์คŒ.


2๏ธโƒฃ GE๋ฅผ ์™œ ์“ฐ๋‚˜? (DE ๊ด€์ )

โŒ GE ์—†์ด DQ ํ•˜๋ฉด

  • SQL ์—ฌ๊ธฐ์ €๊ธฐ ํฉ์–ด์ง
  • ๊ธฐ์ค€์ด ๋ฌธ์„œ๋กœ๋งŒ ์กด์žฌ
  • ๋ˆ„๊ฐ€ ์–ด๋–ค DQ๋ฅผ ํ•˜๋Š”์ง€ ๋ชจ๋ฆ„
  • ์‹คํŒจํ•ด๋„ ๊ทธ๋ƒฅ ์ง€๋‚˜๊ฐ

โœ… GE ์“ฐ๋ฉด

  • DQ ๊ธฐ์ค€์ด ์ฝ”๋“œ๋กœ ๊ด€๋ฆฌ
  • ํŒŒ์ดํ”„๋ผ์ธ์— ์ž๋™ ํฌํ•จ
  • ์‹คํŒจ ์‹œ ์ฆ‰์‹œ ๊ฐ์ง€
  • ๋ฆฌํฌํŠธ/๋กœ๊ทธ ์ž๋™ ์ƒ์„ฑ

๐Ÿ‘‰ ๊ทธ๋ž˜์„œ GE๋Š”
**“DE์šฉ DQ ํ…Œ์ŠคํŠธ ํ”„๋ ˆ์ž„์›Œํฌ”**๋ผ๊ณ  ๋ณด๋ฉด ๋”ฑ ๋งž์•„.


3๏ธโƒฃ GE ํ•ต์‹ฌ ๊ฐœ๋… (์ด๊ฑฐ ์ค‘์š” โญ)

GE๋Š” ์•„๋ž˜ 4๊ฐ€์ง€๋งŒ ์ดํ•ดํ•˜๋ฉด ๋ผ.

๐Ÿ”น 1. Data Source

  • ๋ฐ์ดํ„ฐ๊ฐ€ ์–ด๋”” ์žˆ๋‚˜?
  • DB, DWH, S3, Pandas, Spark ๋“ฑ

๐Ÿ”น 2. Expectation

  • ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŒ์กฑํ•ด์•ผ ํ•  ์กฐ๊ฑด
  • ์˜ˆ:
    • NOT NULL
    • UNIQUE
    • ๋ฒ”์œ„ ์ฒดํฌ
    • row count ๋น„๊ต

๐Ÿ”น 3. Expectation Suite

  • Expectation๋“ค์˜ ๋ฌถ์Œ
  • ํ…Œ์ด๋ธ” ๋‹จ์œ„ / ํŒŒ์ดํ”„๋ผ์ธ ๋‹จ์œ„

๐Ÿ”น 4. Validation

  • ์‹ค์ œ ๋ฐ์ดํ„ฐ vs Expectation ๋น„๊ต ๊ฒฐ๊ณผ
  • PASS / FAIL

4๏ธโƒฃ GE ๊ธฐ๋ณธ ์‚ฌ์šฉ ํ๋ฆ„ (์ˆœ์„œ ๊ธฐ์–ตํ•˜๋ฉด ๋)

๐Ÿงญ ์ „์ฒด ํ๋ฆ„

๋ฐ์ดํ„ฐ ์—ฐ๊ฒฐ
   ↓
Expectation ์ •์˜
   ↓
Validation ์‹คํ–‰
   ↓
๊ฒฐ๊ณผ ์ €์žฅ / ์‹คํŒจ ์ฒ˜๋ฆฌ

๐Ÿ”น Step 1. ๋ฐ์ดํ„ฐ ์—ฐ๊ฒฐ

  • DB (BigQuery, Snowflake, Postgres ๋“ฑ)
  • ํŒŒ์ผ (CSV, Parquet)
  • Spark DataFrame

๐Ÿ”น Step 2. Expectation ์ •์˜ (DQ ๋ฃฐ ์ž‘์„ฑ)

๋Œ€ํ‘œ์ ์ธ Expectation ์˜ˆ:

  • NOT NULL
  • UNIQUE
  • ๊ฐ’ ๋ฒ”์œ„
  • row count
  • ์ •๊ทœ์‹ ํฌ๋งท

๐Ÿ“Œ “ํ…Œ์ŠคํŠธ ์ฝ”๋“œ ์“ฐ๋“ฏ์ด” DQ๋ฅผ ์ •์˜ํ•จ


๐Ÿ”น Step 3. Validation ์‹คํ–‰

  • ํŒŒ์ดํ”„๋ผ์ธ ์ค‘๊ฐ„์— ์‹คํ–‰
  • ๊ฒฐ๊ณผ๊ฐ€ FAIL์ด๋ฉด:
    • Airflow task ์‹คํŒจ
    • downstream ์ฐจ๋‹จ
    • Slack ์•Œ๋ฆผ

๐Ÿ”น Step 4. ๊ฒฐ๊ณผ ๋ฆฌํฌํŠธ

  • ์–ด๋–ค ์ปฌ๋Ÿผ์ด
  • ์–ด๋–ค ์กฐ๊ฑด์—์„œ
  • ์™œ ์‹คํŒจํ–ˆ๋Š”์ง€

๐Ÿ‘‰ ์‚ฌ๋žŒ์ด ๋ฐ”๋กœ ์ดํ•ด ๊ฐ€๋Šฅ


5๏ธโƒฃ DE ์‹ค๋ฌด์—์„œ GE ์“ฐ๋Š” ํŒจํ„ด โญโญโญ

โœ… ๊ฐ€์žฅ ํ”ํ•œ ํŒจํ„ด

๐ŸŸข Airflow + GE

  • ETL ๋ → GE ์‹คํ–‰
  • FAIL → DAG ์‹คํŒจ
  • SUCCESS → ๋‹ค์Œ ๋‹จ๊ณ„ ์ง„ํ–‰

๐ŸŸข DWH ํ…Œ์ด๋ธ” ๊ฒ€์ฆ

  • ์ ์žฌ ์งํ›„ DQ ์ฒดํฌ
  • ๋ถ„์„๊ฐ€ ์‚ฌ์šฉ ์ „ ๋ฐฉ์–ด

๐ŸŸข ๋ฐฐ์น˜ ํŒŒ์ดํ”„๋ผ์ธ

  • ๋งค์ผ ๋™์ผํ•œ Expectation ์‹คํ–‰
  • ํ’ˆ์งˆ ์ถ”์ด ์ถ”์ 

๐Ÿ” ์‹ค์ œ๋กœ ๋งŽ์ด ์“ฐ๋Š” DQ ์˜ˆ

  • row count ์ „์ผ ๋Œ€๋น„ ±X%
  • PK ์ค‘๋ณต ์—†์Œ
  • NOT NULL ์ปฌ๋Ÿผ NULL = 0
  • ๋‚ ์งœ ํฌ๋งท ์˜ค๋ฅ˜ ์—†์Œ
  • ์ตœ์‹  ๋ฐ์ดํ„ฐ ์กด์žฌ

6๏ธโƒฃ ์–ธ์ œ GE๊ฐ€ ์ž˜ ๋งž๊ณ , ์•ˆ ๋งž๋‚˜?

โœ… GE๊ฐ€ ์ž˜ ๋งž๋Š” ๊ฒฝ์šฐ

  • ๋ฐฐ์น˜ ETL
  • ํ…Œ์ด๋ธ” ๋‹จ์œ„ DQ
  • SQL ๊ธฐ๋ฐ˜ DWH
  • Airflow ์‚ฌ์šฉ ์ค‘

โŒ GE๊ฐ€ ์•ˆ ๋งž๋Š” ๊ฒฝ์šฐ

  • ์ดˆ์‹ค์‹œ๊ฐ„ ์ŠคํŠธ๋ฆฌ๋ฐ
  • ms ๋‹จ์œ„ ๊ฒ€์ฆ
  • ์ดˆ๋Œ€์šฉ๋Ÿ‰ Spark ํ†ต๊ณ„ ๊ธฐ๋ฐ˜ (→ Deequ๊ฐ€ ๋” ์ ํ•ฉ)

๐ŸŽฏ ํ•ต์‹ฌ ์š”์•ฝ

GE๋Š” “DQ๋ฅผ ํ…Œ์ŠคํŠธ ์ฝ”๋“œ์ฒ˜๋Ÿผ ๊ด€๋ฆฌํ•˜๋Š” ๋„๊ตฌ”

  • ์‚ฌ๋žŒ์ด ๋ˆˆ์œผ๋กœ โŒ
  • ๋ฌธ์„œ๋กœ๋งŒ โŒ
  • ์ž๋™ + ์ฝ”๋“œ ๊ธฐ๋ฐ˜ โœ”

 

728x90
๋ฐ˜์‘ํ˜•

'DataEngineering > Data Quality' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

GE ์‹ค์ œ ์ฝ”๋“œ ์˜ˆ์‹œ  (0) 2026.02.01
DQ Tools  (0) 2026.02.01
Data Quality ๊ฒ€์ฆ์€ ์–ด๋–ป๊ฒŒ ํ•˜๋Š”๊ฐ€?  (0) 2026.02.01
DQ(Data Quality)๋ž€?  (0) 2026.02.01
    'DataEngineering/Data Quality' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
    • GE ์‹ค์ œ ์ฝ”๋“œ ์˜ˆ์‹œ
    • DQ Tools
    • Data Quality ๊ฒ€์ฆ์€ ์–ด๋–ป๊ฒŒ ํ•˜๋Š”๊ฐ€?
    • DQ(Data Quality)๋ž€?
    kkh1902
    kkh1902
    1Day 1 Commit ๋ชฉํ‘œ ๊ณต๋ถ€ํ•œ๊ฒƒ๋“ค ๋งค์ผ ๊ธฐ๋กํ•˜๊ธฐ

    ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”