๋ฐ˜์‘ํ˜•
kkh1902
Steadily
kkh1902
์ „์ฒด ๋ฐฉ๋ฌธ์ž
์˜ค๋Š˜
์–ด์ œ
  • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (178) N
    • DataEngineering (20) N
      • Spark (7) N
      • Airflow (2) N
      • DBT (2) N
      • Architecture (3) N
      • Data Quality (5) N
      • Infra (1) N
    • ๐Ÿค– AI (12) N
      • ML (7)
      • DL (0)
      • LLM (5) N
    • ๐Ÿ“š Study (74)
      • DataEngineering (0)
      • Spring (9)
      • Java (2)
      • Html, css (10)
      • JS, JQuery (29)
      • DB (5)
      • DevOps (13)
      • roadmap (2)
      • Architecture (1)
      • Flutter (2)
    • ๐Ÿ’ป Computer Science (28)
      • Datastructure (0)
      • Algorithm (2)
      • Design pattern (0)
      • Network (1)
      • DB (13)
      • Operating System (0)
      • Software Engineering (4)
      • CS interview (5)
      • topcit (3)
    • โš’๏ธ Etc (6)
      • Error (3)
      • Trouble_Shooting (2)
      • Dev_environment (1)
    • ๐Ÿ“ฐ News (24)
      • daily (7)
      • think (17)
    • ๐Ÿ“˜ Hobby (13)
      • English (13)

๋ธ”๋กœ๊ทธ ๋ฉ”๋‰ด

  • ๐Ÿ“‹ ์ด๋ ฅ์„œ
  • โšก๏ธ ๊นƒํ—ˆ๋ธŒ
  • ํƒœ๊ทธ
  • ๋ฐฉ๋ช…๋ก

๊ณต์ง€์‚ฌํ•ญ

์ธ๊ธฐ ๊ธ€

ํƒœ๊ทธ

  • Qr_payment project # CSS ํ•ด์„ # Basic ๋งจ์œ„ ํ•ด์„
  • React # JSX
  • React๋ฅผ ๋ฐฐ์›Œ์•ผํ•˜๋Š” ์ด์œ 
  • React JS # ์ž์Šต์„œ # Component์™€ Props
  • Wonder # word
  • think #bootstrap์„ ์จ์•ผํ•˜๋Š” ์ด์œ 
  • junit5
  • React JS # 2 The Basic of React
  • ์†Œํ”„ํŠธ์›จ์–ด ๊ณตํ•™ #project๋งŒ๋“ค๋•Œ ์ค‘์š”
  • git
  • db
  • sourcetreee
  • React JS #์ž์Šต์„œ
  • Linear Regression
  • git stash
  • Flutter
  • ์†Œํ”„ํŠธ์›จ์–ด ๊ณตํ•™ # chapter1
  • testcode
  • gitaction
  • SpringBootTest

์ตœ๊ทผ ๋Œ“๊ธ€

์ตœ๊ทผ ๊ธ€

ํ‹ฐ์Šคํ† ๋ฆฌ

250x250
hELLO ยท Designed By ์ •์ƒ์šฐ.
๊ธ€์“ฐ๊ธฐ / ๊ด€๋ฆฌ์ž
kkh1902

Steadily

DataEngineering/Spark

spark ui ์—์„œ ๋ณ‘๋ชฉ์ฐพ๋Š”๋ฒ•

2026. 1. 30. 19:40
728x90
๋ฐ˜์‘ํ˜•

Image


๐Ÿ” Spark UI ๋ณ‘๋ชฉ ์ฐพ๋Š” ์ˆœ์„œ (์ด๋Œ€๋กœ ๋ณด๋ฉด ๋จ)

0๏ธโƒฃ ํ•œ ์ค„ ์š”์•ฝ

๐Ÿ‘‰ Stages → Tasks → Shuffle → Executors

์ด ์ˆœ์„œ๋กœ ๋ณด๋ฉด 90% ๋ณ‘๋ชฉ ์ฐพ์Œ


1๏ธโƒฃ Jobs ํƒญ – “์–ด๋””์„œ ์˜ค๋ž˜ ๊ฑธ๋ ธ๋‚˜?”

๋จผ์ € ๋ณด๋Š” ๊ฒƒ

  • Job ์‹คํ–‰ ์‹œ๊ฐ„
  • ์–ด๋–ค Job์ด ์œ ๋… ๊น€?

๐Ÿ‘‰ ์ œ์ผ ์˜ค๋ž˜ ๊ฑธ๋ฆฐ Job ํ•˜๋‚˜๋ฅผ ๊ณ ๋ฅธ๋‹ค


2๏ธโƒฃ Stages ํƒญ – โญ ํ•ต์‹ฌ

์—ฌ๊ธฐ์„œ ๊ฑฐ์˜ ๋‹ค ๋ณด์ธ๋‹ค

์ฒดํฌ ํฌ์ธํŠธ

  • Stage ์‹คํ–‰ ์‹œ๊ฐ„
  • Shuffle Read / Write ์žˆ๋Š” Stage
  • Stage ๊ฐœ์ˆ˜ (๋„ˆ๋ฌด ๋งŽ๊ฑฐ๋‚˜ ์ ์€์ง€)

๐Ÿ‘‰ ์‹œ๊ฐ„ ๋งŽ์ด ๋จน๋Š” Stage = ๋ณ‘๋ชฉ ํ›„๋ณด


โš ๏ธ ํ”ํ•œ Stage ๋ณ‘๋ชฉ ํŒจํ„ด

  • Exchange, Sort, Aggregate
  • join ์งํ›„ Stage

๐Ÿ‘‰ ๋Œ€๋ถ€๋ถ„ shuffle ๋ฌธ์ œ


3๏ธโƒฃ Tasks ํƒญ – “๋ช‡ ๊ฐœ๊ฐ€ ๋А๋ฆฐ๊ฐ€?”

Stage ํด๋ฆญ → Tasks ๋ณด๋ฉด ๐Ÿ‘‡

๋ด์•ผ ํ•  ๊ฒƒ

  • Task Duration ๋ถ„ํฌ
  • ๋ช‡ ๊ฐœ Task๋งŒ ์œ ๋… ๊น€? (๊ผฌ๋ฆฌ ๊ธธ๋ฉด ๋ฌธ์ œ)

์˜๋ฏธ

  • Task ์‹œ๊ฐ„ ๊ณ ๋ฅด์ง€ ์•Š์Œ
  • → ๋ฐ์ดํ„ฐ ์Šคํ(skew) ๊ฐ€๋Šฅ์„ฑ

๐Ÿ‘‰ ํŠน์ • key์— ๋ฐ์ดํ„ฐ ๋ชฐ๋ฆผ


4๏ธโƒฃ Shuffle Read / Write – ์ง„์งœ ๋ฒ”์ธ

Shuffle Read๊ฐ€ ํฌ๋ฉด?

  • join / groupBy ๊ณผ๋‹ค
  • partition ์ˆ˜ ๋ถ€์กฑ

Shuffle Write๊ฐ€ ํฌ๋ฉด?

  • repartition ๋‚จ๋ฐœ
  • unnecessary shuffle

๐Ÿ‘‰ Spark ๋ณ‘๋ชฉ์˜ 70%๋Š” shuffle


5๏ธโƒฃ Executors ํƒญ – ๋ฆฌ์†Œ์Šค ๋ณ‘๋ชฉ ํ™•์ธ

์ฒดํฌ๋ฆฌ์ŠคํŠธ

  • Executor CPU ์‚ฌ์šฉ๋ฅ 
  • Memory ์‚ฌ์šฉ๋Ÿ‰
  • GC Time ๋น„์œจ

์œ„ํ—˜ ์‹ ํ˜ธ

  • GC Time ๋น„์ค‘ ํผ
  • Executor ์ž์ฃผ ์ฃฝ์Œ
  • Task ์‹คํŒจ ๋ฐ˜๋ณต

๐Ÿ‘‰ ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ / ์บ์‹œ ๊ณผ๋‹ค


6๏ธโƒฃ Storage ํƒญ – cache ๋ฌธ์ œ ์ฐพ๊ธฐ

ํ™•์ธ

  • cache๋œ DataFrame ์žˆ์Œ?
  • ๋ฉ”๋ชจ๋ฆฌ ์ ์œ ์œจ ๊ณผ๋‹ค?

๐Ÿ‘‰ cache ํ–ˆ๋Š”๋ฐ ํ•œ ๋ฒˆ๋งŒ ์“ฐ๋ฉด

→ ์˜คํžˆ๋ ค ๋ณ‘๋ชฉ


๐Ÿง  ๋ณ‘๋ชฉ ์œ ํ˜•๋ณ„ ๋น ๋ฅธ ์ง„๋‹จํ‘œ

์ฆ์ƒ ์›์ธ ๋Œ€์‘

Stage ์‹œ๊ฐ„ ๊น€ shuffle ๊ณผ๋‹ค join/partition ์ ๊ฒ€
Task ๋ช‡ ๊ฐœ๋งŒ ๊น€ ๋ฐ์ดํ„ฐ ์Šคํ salting, repartition
GC Time ํผ ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ cache ์ œ๊ฑฐ, memory ์ฆ๊ฐ€
Executor idle partition ๋ถ€์กฑ repartition ์ฆ๊ฐ€
Executor ๊ณผ๋ถ€ํ•˜ partition ๊ณผ๋‹ค coalesce

๐Ÿ”ฅ ์‹ค๋ฌด์—์„œ ์ œ์ผ ์ž์ฃผ ๋ณด๋Š” ๋ณ‘๋ชฉ TOP 3

1๏ธโƒฃ Join + Shuffle

  • ํ•ด๊ฒฐ: broadcast join, join key ์ ๊ฒ€

2๏ธโƒฃ Partition ์ˆ˜ ์•ˆ ๋งž์Œ

  • ๋„ˆ๋ฌด ์ ์Œ → CPU ๋†€์Œ
  • ๋„ˆ๋ฌด ๋งŽ์Œ → ์˜ค๋ฒ„ํ—ค๋“œ

3๏ธโƒฃ ๋ฌด์ง€์„ฑ cache

  • cache → unpersist ์•ˆ ํ•จ

๋ฉด์ ‘์šฉ 30์ดˆ ๋‹ต๋ณ€ (์ด๊ฑฐ ์™ธ์›Œ๋„ ๋จ)

“Spark UI์—์„œ๋Š” ๋จผ์ € Stages ํƒญ์—์„œ

์‹คํ–‰ ์‹œ๊ฐ„์ด ๊ธด Stage๋ฅผ ํ™•์ธํ•˜๊ณ ,

Shuffle Read/Write ์—ฌ๋ถ€๋ฅผ ๋ด…๋‹ˆ๋‹ค.

์ดํ›„ Tasks ๋ถ„ํฌ๋กœ ๋ฐ์ดํ„ฐ ์Šคํ๋ฅผ ํ™•์ธํ•˜๊ณ ,

Executors ํƒญ์—์„œ CPU·๋ฉ”๋ชจ๋ฆฌ·GC๋ฅผ ์ ๊ฒ€ํ•ด

๋ฆฌ์†Œ์Šค ๋ณ‘๋ชฉ์ธ์ง€ ํŒ๋‹จํ•ฉ๋‹ˆ๋‹ค.”


๋งˆ์ง€๋ง‰ ํ•ต์‹ฌ ๋ฌธ์žฅ

๐Ÿ‘‰ Spark ๋ณ‘๋ชฉ =‘shuffle + partition + memory’ ์ค‘ ํ•˜๋‚˜๋‹ค.


728x90
๋ฐ˜์‘ํ˜•

'DataEngineering > Spark' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

Spark ์ตœ์ ํ™” ์ฒดํฌ๋ฆฌ์ŠคํŠธ  (0) 2026.01.30
Suffle์ด๋ž€?  (1) 2026.01.30
df.cache() ๋ฅผ ์–ธ์ œ ์จ์•ผํ•˜๋‚˜?  (0) 2026.01.30
Dataframe ์ด๋ž€?  (0) 2026.01.30
Lazy Evaluation ์ด๋ž€?  (0) 2026.01.30
    'DataEngineering/Spark' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
    • Spark ์ตœ์ ํ™” ์ฒดํฌ๋ฆฌ์ŠคํŠธ
    • Suffle์ด๋ž€?
    • df.cache() ๋ฅผ ์–ธ์ œ ์จ์•ผํ•˜๋‚˜?
    • Dataframe ์ด๋ž€?
    kkh1902
    kkh1902
    1Day 1 Commit ๋ชฉํ‘œ ๊ณต๋ถ€ํ•œ๊ฒƒ๋“ค ๋งค์ผ ๊ธฐ๋กํ•˜๊ธฐ

    ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”