๋ฐ˜์‘ํ˜•
kkh1902
Steadily
kkh1902
์ „์ฒด ๋ฐฉ๋ฌธ์ž
์˜ค๋Š˜
์–ด์ œ
  • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (178) N
    • DataEngineering (20) N
      • Spark (7) N
      • Airflow (2) N
      • DBT (2) N
      • Architecture (3) N
      • Data Quality (5) N
      • Infra (1) N
    • ๐Ÿค– AI (12) N
      • ML (7)
      • DL (0)
      • LLM (5) N
    • ๐Ÿ“š Study (74)
      • DataEngineering (0)
      • Spring (9)
      • Java (2)
      • Html, css (10)
      • JS, JQuery (29)
      • DB (5)
      • DevOps (13)
      • roadmap (2)
      • Architecture (1)
      • Flutter (2)
    • ๐Ÿ’ป Computer Science (28)
      • Datastructure (0)
      • Algorithm (2)
      • Design pattern (0)
      • Network (1)
      • DB (13)
      • Operating System (0)
      • Software Engineering (4)
      • CS interview (5)
      • topcit (3)
    • โš’๏ธ Etc (6)
      • Error (3)
      • Trouble_Shooting (2)
      • Dev_environment (1)
    • ๐Ÿ“ฐ News (24)
      • daily (7)
      • think (17)
    • ๐Ÿ“˜ Hobby (13)
      • English (13)

๋ธ”๋กœ๊ทธ ๋ฉ”๋‰ด

  • ๐Ÿ“‹ ์ด๋ ฅ์„œ
  • โšก๏ธ ๊นƒํ—ˆ๋ธŒ
  • ํƒœ๊ทธ
  • ๋ฐฉ๋ช…๋ก

๊ณต์ง€์‚ฌํ•ญ

์ธ๊ธฐ ๊ธ€

ํƒœ๊ทธ

  • React๋ฅผ ๋ฐฐ์›Œ์•ผํ•˜๋Š” ์ด์œ 
  • Wonder # word
  • testcode
  • junit5
  • ์†Œํ”„ํŠธ์›จ์–ด ๊ณตํ•™ # chapter1
  • ์†Œํ”„ํŠธ์›จ์–ด ๊ณตํ•™ #project๋งŒ๋“ค๋•Œ ์ค‘์š”
  • Flutter
  • Linear Regression
  • db
  • git
  • Qr_payment project # CSS ํ•ด์„ # Basic ๋งจ์œ„ ํ•ด์„
  • sourcetreee
  • React JS # ์ž์Šต์„œ # Component์™€ Props
  • React JS # 2 The Basic of React
  • SpringBootTest
  • React # JSX
  • git stash
  • React JS #์ž์Šต์„œ
  • gitaction
  • think #bootstrap์„ ์จ์•ผํ•˜๋Š” ์ด์œ 

์ตœ๊ทผ ๋Œ“๊ธ€

์ตœ๊ทผ ๊ธ€

ํ‹ฐ์Šคํ† ๋ฆฌ

250x250
hELLO ยท Designed By ์ •์ƒ์šฐ.
๊ธ€์“ฐ๊ธฐ / ๊ด€๋ฆฌ์ž
kkh1902

Steadily

DataEngineering/Spark

Suffle์ด๋ž€?

2026. 1. 30. 19:39
728x90
๋ฐ˜์‘ํ˜•

ํ•œ ์ค„ ์ •์˜

๐Ÿ‘‰ Shuffle = ๋ฐ์ดํ„ฐ๊ฐ€ ๋„คํŠธ์›Œํฌ๋ฅผ ํƒ€๊ณ  ๋‹ค๋ฅธ Executor๋กœ “์žฌ๋ถ„๋ฐฐ”๋˜๋Š” ๊ณผ์ •


์™œ shuffle์ด ์ƒ๊ธฐ๋ƒ?

Spark๋Š” ์›๋ž˜ ๋ฐ์ดํ„ฐ๋ฅผ **์—ฌ๋Ÿฌ ์กฐ๊ฐ(partition)**์œผ๋กœ ๋‚˜๋ˆ ์„œ ์ฒ˜๋ฆฌํ•ด.

๊ทธ๋Ÿฐ๋ฐ ์–ด๋–ค ์—ฐ์‚ฐ์€ ๊ฐ™์€ ํ‚ค๋ฅผ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ๊ฐ€ ํ•œ ๊ณณ์— ๋ชจ์—ฌ์•ผ ํ•ด.

๊ทธ๋•Œ ๋ฐœ์ƒํ•˜๋Š” ๊ฒŒ shuffle์•ผ.


shuffle์ด ์ƒ๊ธฐ๋Š” ๋Œ€ํ‘œ ์—ฐ์‚ฐ

์•„๋ž˜ ๋‚˜์˜ค๋ฉด ๋ฌด์กฐ๊ฑด ์˜์‹ฌ ๐Ÿ‘‡

  • groupBy
  • join
  • orderBy
  • distinct
  • repartition

์˜ˆ:

df.groupBy("user_id").count()

๐Ÿ‘‰ user_id๊ฐ€ ๊ฐ™์€ row๋“ค์„ ํ•œ Executor๋กœ ๋ชจ์•„์•ผ ํ•จ

๐Ÿ‘‰ ๊ทธ๋ž˜์„œ ๋ฐ์ดํ„ฐ ์ด๋™ ๋ฐœ์ƒ = shuffle


๊ทธ๋ฆผ ์—†์ด ๋น„์œ ๋กœ ์ดํ•ดํ•˜๊ธฐ ๐ŸŽ’

  • ํ•™์ƒ๋“ค์ด ๊ต์‹ค ์—ฌ๋Ÿฌ ๊ณณ์— ํฉ์–ด์ ธ ์žˆ์Œ
  • “๊ฐ™์€ ๋ฐ˜๋ผ๋ฆฌ ๋ชจ์—ฌ!” ๋ผ๊ณ  ์ง€์‹œํ•จ

๐Ÿ‘‰ ํ•™์ƒ๋“ค์ด ๊ต์‹ค์„ ์˜ฎ๊ฒจ ๋‹ค๋‹˜

๐Ÿ‘‰ ์ด ์ด๋™์ด ๋ฐ”๋กœ shuffle


์™œ shuffle์ด ๋А๋ฆฌ๋ƒ?

shuffle์€ ๋™์‹œ์— 3๊ฐ€์ง€๋ฅผ ์”€ โŒ

1๏ธโƒฃ ๋„คํŠธ์›Œํฌ I/O

2๏ธโƒฃ ๋””์Šคํฌ I/O (spill)

3๏ธโƒฃ serialization / deserialization

๐Ÿ‘‰ CPU๋ณด๋‹ค ํ›จ์”ฌ ๋А๋ฆผ

๊ทธ๋ž˜์„œ:

  • Spark ์ž‘์—… ๋А๋ฆฌ๋‹ค = ๋Œ€๋ถ€๋ถ„ shuffle ๋•Œ๋ฌธ

shuffle ์ „ / ํ›„ ์ฐจ์ด

โŒ shuffle ์žˆ์Œ

Executor A → ๋„คํŠธ์›Œํฌ → Executor B
Executor C → ๋””์Šคํฌ → Executor D

โœ… shuffle ์—†์Œ

๊ฐ Executor๊ฐ€ ์ž๊ธฐ ๋ฐ์ดํ„ฐ๋งŒ ์ฒ˜๋ฆฌ


shuffle์ด ์ƒ๊ฒผ๋Š”์ง€ ์–ด๋–ป๊ฒŒ ์•„๋ƒ?

1๏ธโƒฃ ์ฝ”๋“œ๋กœ ๋ฐ”๋กœ ๋ณด์ž„

groupBy / join / orderBy

2๏ธโƒฃ Spark UI์—์„œ ํ™•์ธ

  • DAG์— shuffle boundary
  • Stage๊ฐ€ ๋‚˜๋‰˜์–ด ์žˆ์Œ

๊ทธ๋ž˜์„œ ์ตœ์ ํ™”์˜ ํ•ต์‹ฌ์€?

๐Ÿ‘‰ shuffle์„ ์ค„์ด๊ฑฐ๋‚˜, ํ”ผํ•˜๋Š” ๊ฒƒ

๋Œ€ํ‘œ ๋ฐฉ๋ฒ•:

  • ์ž‘์€ ํ…Œ์ด๋ธ” broadcast join
  • join ์ „์— filter, select
  • ๋ถˆํ•„์š”ํ•œ groupBy ์ œ๊ฑฐ
  • ์ ์ ˆํ•œ partition ์„ค๊ณ„

ํ•œ ๋ฌธ์žฅ ์š”์•ฝ (๋ฉด์ ‘์šฉ โœจ)

“Shuffle์€ Spark์—์„œ ์—ฐ์‚ฐ์„ ์œ„ํ•ด ๋ฐ์ดํ„ฐ๋ฅผ Executor ๊ฐ„์— ์žฌ๋ถ„๋ฐฐํ•˜๋Š” ๊ณผ์ •์œผ๋กœ, ๋„คํŠธ์›Œํฌ์™€ ๋””์Šคํฌ I/O๊ฐ€ ๋ฐœ์ƒํ•ด ์„ฑ๋Šฅ ๋ณ‘๋ชฉ์˜ ์ฃผ์š” ์›์ธ์ด ๋ฉ๋‹ˆ๋‹ค.”


์ด๊ฒƒ๋งŒ ๊ธฐ์–ตํ•ด๋„ ๋จ

  • shuffle = ๋ฐ์ดํ„ฐ ์ด๋™
  • ๋А๋ฆผ = ๋„คํŠธ์›Œํฌ + ๋””์Šคํฌ
  • Spark ์ตœ์ ํ™” = shuffle ์ตœ์†Œํ™”

๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์กฐ์ธ(broadcast join)

  • *๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์กฐ์ธ(broadcast join)**์€ Spark ์ตœ์ ํ™”์—์„œ ํšจ๊ณผ๊ฐ€ ์ œ์ผ ํฐ ๊ธฐ์ˆ  ์ค‘ ํ•˜๋‚˜์•ผ. ์•„์ฃผ ์‰ฝ๊ฒŒ ์„ค๋ช…ํ• ๊ฒŒ.

ํ•œ ์ค„ ์ •์˜

๐Ÿ‘‰ ์ž‘์€ ํ…Œ์ด๋ธ”์„ ๋ชจ๋“  Executor ๋ฉ”๋ชจ๋ฆฌ์— ๋ณต์‚ฌํ•ด์„œ,ํฐ ํ…Œ์ด๋ธ”์ด “์ด๋™ ์—†์ด” ์กฐ์ธํ•˜๋Š” ๋ฐฉ์‹


์™œ ์ด๊ฒŒ ๋น ๋ฅด๋ƒ?

์ผ๋ฐ˜ ์กฐ์ธ์€ ๐Ÿ‘‡

  • ํฐ ํ…Œ์ด๋ธ” ↔ ํฐ ํ…Œ์ด๋ธ”
  • ์„œ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ด๋™(shuffle) ํ•ด์•ผ ํ•จ → ๋А๋ฆผ โŒ

๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์กฐ์ธ์€ ๐Ÿ‘‡

  • ์ž‘์€ ํ…Œ์ด๋ธ”์„ ๋ฏธ๋ฆฌ ์ „ํŒŒ
  • ํฐ ํ…Œ์ด๋ธ”์€ ์ž๊ธฐ ์ž๋ฆฌ์—์„œ lookup๋งŒ ํ•จ
  • → shuffle ์—†์Œ โœ…

๊ทธ๋ฆผ ์—†์ด ๋น„์œ ๋กœ ์ดํ•ดํ•˜๊ธฐ ๐Ÿ“˜

  • ํฐ ์ฑ…(๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ)์ด ์—ฌ๋Ÿฌ ๊ต์‹ค์— ๋‚˜๋‰˜์–ด ์žˆ์Œ
  • ์ž‘์€ ์ฐธ๊ณ ํ‘œ(์ž‘์€ ํ…Œ์ด๋ธ”)๋ฅผ ๋ชจ๋“  ๊ต์‹ค์— ํ•œ ๋ถ€์”ฉ ๋ณต์‚ฌ

๐Ÿ‘‰ ํ•™์ƒ(ํฐ ํ…Œ์ด๋ธ”)์ด ์ด๋™ํ•  ํ•„์š” ์—†์ด

๐Ÿ‘‰ ์ž๊ธฐ ๊ต์‹ค์—์„œ ๋ฐ”๋กœ ์ฐธ๊ณ ํ‘œ ๋ณด๊ณ  ์กฐ์ธ


์–ธ์ œ “์ž‘๋‹ค”๊ณ  ํ•˜๋ƒ?

์ •ํ™•ํ•œ ๊ธฐ์ค€์€ ์—†์ง€๋งŒ ๋ณดํ†ต ๐Ÿ‘‡

  • ์ˆ˜ MB ~ ์ˆ˜์‹ญ MB
  • ์ˆ˜์ฒœ ~ ์ˆ˜์‹ญ๋งŒ row
  • ๋ฉ”๋ชจ๋ฆฌ์— ์ถฉ๋ถ„ํžˆ ์˜ฌ๋ผ๊ฐ€๋Š” ํฌ๊ธฐ

ํ•ต์‹ฌ ๊ธฐ์ค€: Executor ๋ฉ”๋ชจ๋ฆฌ์— ์˜ฌ๋ฆด ์ˆ˜ ์žˆ๋‚˜?


์ฝ”๋“œ๋กœ ๋ณด๋ฉด ๋ฐ”๋กœ ์ดํ•ด๋จ

from pyspark.sql.functions import broadcast

df_big.join(
    broadcast(df_small),
    "user_id"
)

์ด ํ•œ ์ค„์ด ์˜๋ฏธํ•˜๋Š” ๊ฒƒ ๐Ÿ‘‡

๐Ÿ‘‰ df_small์„ ๋ชจ๋“  Executor์— ๋ณต์‚ฌ


broadcast ์•ˆ ์“ฐ๋ฉด ์–ด๋–ป๊ฒŒ ๋˜๋‚˜?

df_big.join(df_small, "user_id")

๐Ÿ‘‰ Spark๊ฐ€ ํŒ๋‹จํ•ด์„œ:

  • ๋‘˜ ๋‹ค ํฌ๋ฉด → shuffle join
  • ์ž‘์œผ๋ฉด → ์ž๋™ broadcast (์•ˆ ๋  ์ˆ˜๋„ ์žˆ์Œ)

๐Ÿ“Œ ๊ทธ๋ž˜์„œ ๋ช…์‹œ์ ์œผ๋กœ broadcast ์“ฐ๋Š” ๊ฒŒ ์•ˆ์ „


์–ธ์ œ ์“ฐ๋ฉด ์•ˆ ๋˜๋‚˜? โŒ

  • ํ…Œ์ด๋ธ”์ด ํผ (์ˆ˜๋ฐฑ MB ์ด์ƒ)
  • Executor ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ
  • ์ž์ฃผ ๋ฐ”๋€Œ๋Š” ๋Œ€ํ˜• ์ฐจ์› ํ…Œ์ด๋ธ”

๐Ÿ‘‰ ์ด๋•Œ ์“ฐ๋ฉด OOM(๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ)


์‹ค๋ฌด์—์„œ ์ž์ฃผ ์“ฐ๋Š” ํŒจํ„ด

์˜ˆ์‹œ

  • clickstream (์ˆ˜์–ต row)
  • product / user / category ํ…Œ์ด๋ธ” (์ž‘์Œ)
clicks.join(
    broadcast(products),
    "product_id"
)

๐Ÿ‘‰ ์ •์„ ์ค‘ ์ •์„


Spark๋Š” ์ž๋™์œผ๋กœ ํ•ด์ฃผ์ง€ ์•Š๋‚˜?

๐Ÿ‘‰ ๊ฐ€๋”์€ ํ•ด์ฃผ์ง€๋งŒ, ๋ฏฟ์ง€ ์•Š๋Š”๋‹ค

  • ํ†ต๊ณ„ ๋ถ€์ •ํ™•
  • threshold ์ดˆ๊ณผ
  • ๊ณ„ํš ๋ณ€๊ฒฝ

๐Ÿ“Œ ์ค‘์š”ํ•œ join์€ ์ง์ ‘ broadcast ์ง€์ •


๋ฉด์ ‘์šฉ ํ•œ ๋ฌธ์žฅ โœจ

“Broadcast join์€ ์ž‘์€ ํ…Œ์ด๋ธ”์„ Executor ๋ฉ”๋ชจ๋ฆฌ์— ๋ณต์‚ฌํ•ด

shuffle ์—†์ด ํฐ ํ…Œ์ด๋ธ”๊ณผ ์กฐ์ธํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ,

Spark ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.”


์ด๊ฒƒ๋งŒ ๊ธฐ์–ตํ•˜๋ฉด ๋จ

  • broadcast join = shuffle ์ œ๊ฑฐ
  • ์ž‘์€ ํ…Œ์ด๋ธ”์ผ ๋•Œ๋งŒ
  • ๋ฉ”๋ชจ๋ฆฌ ์ถฉ๋ถ„ํ•  ๋•Œ๋งŒ
  • join ์„ฑ๋Šฅ ์ฒด๊ฐ ์ตœ๊ณ 

728x90
๋ฐ˜์‘ํ˜•

'DataEngineering > Spark' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

spark ui ์—์„œ ๋ณ‘๋ชฉ์ฐพ๋Š”๋ฒ•  (0) 2026.01.30
Spark ์ตœ์ ํ™” ์ฒดํฌ๋ฆฌ์ŠคํŠธ  (0) 2026.01.30
df.cache() ๋ฅผ ์–ธ์ œ ์จ์•ผํ•˜๋‚˜?  (0) 2026.01.30
Dataframe ์ด๋ž€?  (0) 2026.01.30
Lazy Evaluation ์ด๋ž€?  (0) 2026.01.30
    'DataEngineering/Spark' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
    • spark ui ์—์„œ ๋ณ‘๋ชฉ์ฐพ๋Š”๋ฒ•
    • Spark ์ตœ์ ํ™” ์ฒดํฌ๋ฆฌ์ŠคํŠธ
    • df.cache() ๋ฅผ ์–ธ์ œ ์จ์•ผํ•˜๋‚˜?
    • Dataframe ์ด๋ž€?
    kkh1902
    kkh1902
    1Day 1 Commit ๋ชฉํ‘œ ๊ณต๋ถ€ํ•œ๊ฒƒ๋“ค ๋งค์ผ ๊ธฐ๋กํ•˜๊ธฐ

    ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”