๋ฐ˜์‘ํ˜•
kkh1902
Steadily
kkh1902
์ „์ฒด ๋ฐฉ๋ฌธ์ž
์˜ค๋Š˜
์–ด์ œ
  • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (178) N
    • DataEngineering (20) N
      • Spark (7) N
      • Airflow (2) N
      • DBT (2) N
      • Architecture (3) N
      • Data Quality (5) N
      • Infra (1) N
    • ๐Ÿค– AI (12) N
      • ML (7)
      • DL (0)
      • LLM (5) N
    • ๐Ÿ“š Study (74)
      • DataEngineering (0)
      • Spring (9)
      • Java (2)
      • Html, css (10)
      • JS, JQuery (29)
      • DB (5)
      • DevOps (13)
      • roadmap (2)
      • Architecture (1)
      • Flutter (2)
    • ๐Ÿ’ป Computer Science (28)
      • Datastructure (0)
      • Algorithm (2)
      • Design pattern (0)
      • Network (1)
      • DB (13)
      • Operating System (0)
      • Software Engineering (4)
      • CS interview (5)
      • topcit (3)
    • โš’๏ธ Etc (6)
      • Error (3)
      • Trouble_Shooting (2)
      • Dev_environment (1)
    • ๐Ÿ“ฐ News (24)
      • daily (7)
      • think (17)
    • ๐Ÿ“˜ Hobby (13)
      • English (13)

๋ธ”๋กœ๊ทธ ๋ฉ”๋‰ด

  • ๐Ÿ“‹ ์ด๋ ฅ์„œ
  • โšก๏ธ ๊นƒํ—ˆ๋ธŒ
  • ํƒœ๊ทธ
  • ๋ฐฉ๋ช…๋ก

๊ณต์ง€์‚ฌํ•ญ

์ธ๊ธฐ ๊ธ€

ํƒœ๊ทธ

  • git
  • React JS #์ž์Šต์„œ
  • Wonder # word
  • SpringBootTest
  • ์†Œํ”„ํŠธ์›จ์–ด ๊ณตํ•™ #project๋งŒ๋“ค๋•Œ ์ค‘์š”
  • Linear Regression
  • Qr_payment project # CSS ํ•ด์„ # Basic ๋งจ์œ„ ํ•ด์„
  • Flutter
  • db
  • React JS # 2 The Basic of React
  • think #bootstrap์„ ์จ์•ผํ•˜๋Š” ์ด์œ 
  • sourcetreee
  • ์†Œํ”„ํŠธ์›จ์–ด ๊ณตํ•™ # chapter1
  • junit5
  • React JS # ์ž์Šต์„œ # Component์™€ Props
  • gitaction
  • React๋ฅผ ๋ฐฐ์›Œ์•ผํ•˜๋Š” ์ด์œ 
  • React # JSX
  • git stash
  • testcode

์ตœ๊ทผ ๋Œ“๊ธ€

์ตœ๊ทผ ๊ธ€

ํ‹ฐ์Šคํ† ๋ฆฌ

250x250
hELLO ยท Designed By ์ •์ƒ์šฐ.
๊ธ€์“ฐ๊ธฐ / ๊ด€๋ฆฌ์ž
kkh1902

Steadily

DataEngineering/Data Quality

GE ์‹ค์ œ ์ฝ”๋“œ ์˜ˆ์‹œ

2026. 2. 1. 21:35
728x90
๋ฐ˜์‘ํ˜•

 

๐Ÿ“Œ ๋ชฉ์ฐจ

  1. GE ์ฝ”๋“œ ์ „์ฒด ํ๋ฆ„
  2. ๊ธฐ๋ณธ ์„ธํŒ… (๊ฐ€์žฅ ๋‹จ์ˆœํ•œ ๋ฐฉ์‹)
  3. ์‹ค๋ฌด์—์„œ ๊ฐ€์žฅ ๋งŽ์ด ์“ฐ๋Š” Expectation ์˜ˆ์‹œ
  4. Validation ์‹คํ–‰ & ์‹คํŒจ ์ฒ˜๋ฆฌ
  5. DE ์‹ค๋ฌด ํŒ

1๏ธโƒฃ GE ์ฝ”๋“œ ์ „์ฒด ํ๋ฆ„ (์ด๊ฑฐ ๊ธฐ์–ตํ•˜๋ฉด ๋)

๋ฐ์ดํ„ฐ ๋กœ๋“œ
   ↓
Expectation ์ •์˜
   ↓
Validation ์‹คํ–‰
   ↓
๊ฒฐ๊ณผ ํ™•์ธ (PASS / FAIL)

GE๋Š” ํ…Œ์ŠคํŠธ ์ฝ”๋“œ ์“ฐ๋“ฏ์ด DQ๋ฅผ ์ž‘์„ฑํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋ผ.


2๏ธโƒฃ ๊ธฐ๋ณธ ์„ธํŒ… (Pandas ๊ธฐ์ค€, ์ œ์ผ ์‰ฌ์›€)

DB / Spark๋„ ๊ฑฐ์˜ ๋™์ผํ•œ ๊ตฌ์กฐ๋ผ
๊ฐœ๋… ์ตํžˆ๊ธฐ์—” Pandas๊ฐ€ ์ตœ๊ณ 

pip install great-expectations
import great_expectations as ge
import pandas as pd
# ๋ฐ์ดํ„ฐ ๋กœ๋“œ
df = pd.read_csv("users.csv")

# GE DataFrame ์ƒ์„ฑ
ge_df = ge.from_pandas(df)

3๏ธโƒฃ ์‹ค๋ฌด์—์„œ ๊ฐ€์žฅ ๋งŽ์ด ์“ฐ๋Š” Expectation ์˜ˆ์‹œ โญโญโญ

๐Ÿ”น 1. NOT NULL ์ฒดํฌ

ge_df.expect_column_values_to_not_be_null("user_id")

๐Ÿ‘‰ ์˜๋ฏธ

user_id ์ปฌ๋Ÿผ์— NULL ์žˆ์œผ๋ฉด FAIL


๐Ÿ”น 2. PK ์œ ๋‹ˆํฌ ์ฒดํฌ

ge_df.expect_column_values_to_be_unique("user_id")

๐Ÿ‘‰ ์˜๋ฏธ

user_id ์ค‘๋ณต ์žˆ์œผ๋ฉด FAIL


๐Ÿ”น 3. ๊ฐ’ ๋ฒ”์œ„ ์ฒดํฌ (๋‚˜์ด, ๊ธˆ์•ก)

ge_df.expect_column_values_to_be_between(
    "age",
    min_value=0,
    max_value=120
)

๐Ÿ”น 4. ํ—ˆ์šฉ ๊ฐ’ ๋ชฉ๋ก ์ฒดํฌ

ge_df.expect_column_values_to_be_in_set(
    "gender",
    ["M", "F"]
)

๐Ÿ”น 5. ๋‚ ์งœ ํฌ๋งท ์ฒดํฌ

ge_df.expect_column_values_to_match_regex(
    "signup_date",
    r"\d{4}-\d{2}-\d{2}"
)

๐Ÿ”น 6. row count ์ตœ์†Œ ๋ณด์žฅ

ge_df.expect_table_row_count_to_be_between(
    min_value=1000
)

๐Ÿ‘‰ ์˜๋ฏธ

๋ฐ์ดํ„ฐ๊ฐ€ ๋„ˆ๋ฌด ์ ์œผ๋ฉด FAIL


4๏ธโƒฃ row count ์ „์ผ ๋Œ€๋น„ ±X% ์˜ˆ์‹œ โญ (์‹ค๋ฌด ํ•ต์‹ฌ)

GE์—๋Š” “์ „์ผ ๋Œ€๋น„”๊ฐ€ ๋‚ด์žฅ๋ผ ์žˆ์ง„ ์•Š์•„์„œ
๐Ÿ‘‰ ์–ด์ œ row count๋ฅผ ๋ณ€์ˆ˜๋กœ ๋„ฃ์–ด์„œ ๋น„๊ตํ•˜๋Š” ๋ฐฉ์‹์ด ์ผ๋ฐ˜์ ์ด์•ผ.

yesterday_cnt = 100000
today_cnt = ge_df.shape[0]

diff_pct = abs(today_cnt - yesterday_cnt) / yesterday_cnt

assert diff_pct <= 0.3, "Row count anomaly detected"

๐Ÿ“Œ ์‹ค๋ฌด์—์„ :

  • ์–ด์ œ row count → ๋ฉ”ํƒ€ ํ…Œ์ด๋ธ”์—์„œ ์กฐํšŒ
  • ์ด assert ์‹คํŒจ → Airflow task FAIL

5๏ธโƒฃ Validation ์‹คํ–‰ & ๊ฒฐ๊ณผ ํ™•์ธ

result = ge_df.validate()
print(result["success"])
  • True → DQ ํ†ต๊ณผ
  • False → DQ ์‹คํŒจ (ํŒŒ์ดํ”„๋ผ์ธ ์ค‘๋‹จ)

์‹คํŒจ ์‹œ ์–ด๋–ค ์ •๋ณด๊ฐ€ ๋‚˜์˜ค๋‚˜?

  • ์‹คํŒจํ•œ ์ปฌ๋Ÿผ
  • ์‹คํŒจํ•œ Expectation
  • ์‹คํŒจํ•œ row ๊ฐœ์ˆ˜ ๋น„์œจ

๐Ÿ‘‰ ์‚ฌ๋žŒ์ด ๋ฐ”๋กœ ์›์ธ ํŒŒ์•… ๊ฐ€๋Šฅ


6๏ธโƒฃ Airflow / ๋ฐฐ์น˜์—์„œ ์“ฐ๋Š” ํŒจํ„ด (์ค‘์š”)

โœ” ์‹ค๋ฌด ํŒจํ„ด

result = ge_df.validate()

if not result["success"]:
    raise Exception("Data Quality Check Failed")

๐Ÿ‘‰ ์ด ํ•œ ์ค„๋กœ:

  • Airflow task FAIL
  • downstream ์ฐจ๋‹จ
  • Slack ์•Œ๋ฆผ ํŠธ๋ฆฌ๊ฑฐ ๊ฐ€๋Šฅ

7๏ธโƒฃ DE ์‹ค๋ฌด ํŒ โญโญโญ

โŒ ํ•˜์ง€ ๋ง ๊ฒƒ

  • ๋ชจ๋“  ์ปฌ๋Ÿผ์— Expectation
  • ํ•˜๋ฃจ์— 50๊ฐœ DQ
  • ๋„ˆ๋ฌด ๋นก์„ผ ๊ธฐ์ค€

โœ… ์ด๋ ‡๊ฒŒ ํ•ด๋ผ

  • ํ…Œ์ด๋ธ”๋‹น 5~10๊ฐœ ํ•ต์‹ฌ DQ
  • PK / row count / freshness ์šฐ์„ 
  • ์ ์ง„์ ์œผ๋กœ ์ถ”๊ฐ€

๐ŸŽฏ ํ•ต์‹ฌ ์š”์•ฝ

GE๋Š” “DQ๋ฅผ ํ…Œ์ŠคํŠธ ์ฝ”๋“œ๋กœ ์ž‘์„ฑํ•˜๋Š” ๋„๊ตฌ”

  • expect = DQ ๋ฃฐ
  • validate = ์‹คํ–‰
  • FAIL = ํŒŒ์ดํ”„๋ผ์ธ ์ค‘๋‹จ

 

728x90
๋ฐ˜์‘ํ˜•

'DataEngineering > Data Quality' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

GE(Great Expectations) ์ด๋ž€?  (0) 2026.02.01
DQ Tools  (0) 2026.02.01
Data Quality ๊ฒ€์ฆ์€ ์–ด๋–ป๊ฒŒ ํ•˜๋Š”๊ฐ€?  (0) 2026.02.01
DQ(Data Quality)๋ž€?  (0) 2026.02.01
    'DataEngineering/Data Quality' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
    • GE(Great Expectations) ์ด๋ž€?
    • DQ Tools
    • Data Quality ๊ฒ€์ฆ์€ ์–ด๋–ป๊ฒŒ ํ•˜๋Š”๊ฐ€?
    • DQ(Data Quality)๋ž€?
    kkh1902
    kkh1902
    1Day 1 Commit ๋ชฉํ‘œ ๊ณต๋ถ€ํ•œ๊ฒƒ๋“ค ๋งค์ผ ๊ธฐ๋กํ•˜๊ธฐ

    ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”