๋ฐ˜์‘ํ˜•
kkh1902
Steadily
kkh1902
์ „์ฒด ๋ฐฉ๋ฌธ์ž
์˜ค๋Š˜
์–ด์ œ
  • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (178) N
    • DataEngineering (20) N
      • Spark (7) N
      • Airflow (2) N
      • DBT (2) N
      • Architecture (3) N
      • Data Quality (5) N
      • Infra (1) N
    • ๐Ÿค– AI (12) N
      • ML (7)
      • DL (0)
      • LLM (5) N
    • ๐Ÿ“š Study (74)
      • DataEngineering (0)
      • Spring (9)
      • Java (2)
      • Html, css (10)
      • JS, JQuery (29)
      • DB (5)
      • DevOps (13)
      • roadmap (2)
      • Architecture (1)
      • Flutter (2)
    • ๐Ÿ’ป Computer Science (28)
      • Datastructure (0)
      • Algorithm (2)
      • Design pattern (0)
      • Network (1)
      • DB (13)
      • Operating System (0)
      • Software Engineering (4)
      • CS interview (5)
      • topcit (3)
    • โš’๏ธ Etc (6)
      • Error (3)
      • Trouble_Shooting (2)
      • Dev_environment (1)
    • ๐Ÿ“ฐ News (24)
      • daily (7)
      • think (17)
    • ๐Ÿ“˜ Hobby (13)
      • English (13)

๋ธ”๋กœ๊ทธ ๋ฉ”๋‰ด

  • ๐Ÿ“‹ ์ด๋ ฅ์„œ
  • โšก๏ธ ๊นƒํ—ˆ๋ธŒ
  • ํƒœ๊ทธ
  • ๋ฐฉ๋ช…๋ก

๊ณต์ง€์‚ฌํ•ญ

์ธ๊ธฐ ๊ธ€

ํƒœ๊ทธ

  • Flutter
  • React JS # 2 The Basic of React
  • sourcetreee
  • React JS #์ž์Šต์„œ
  • Linear Regression
  • ์†Œํ”„ํŠธ์›จ์–ด ๊ณตํ•™ #project๋งŒ๋“ค๋•Œ ์ค‘์š”
  • git
  • gitaction
  • React JS # ์ž์Šต์„œ # Component์™€ Props
  • db
  • junit5
  • Qr_payment project # CSS ํ•ด์„ # Basic ๋งจ์œ„ ํ•ด์„
  • testcode
  • think #bootstrap์„ ์จ์•ผํ•˜๋Š” ์ด์œ 
  • git stash
  • SpringBootTest
  • Wonder # word
  • React๋ฅผ ๋ฐฐ์›Œ์•ผํ•˜๋Š” ์ด์œ 
  • ์†Œํ”„ํŠธ์›จ์–ด ๊ณตํ•™ # chapter1
  • React # JSX

์ตœ๊ทผ ๋Œ“๊ธ€

์ตœ๊ทผ ๊ธ€

ํ‹ฐ์Šคํ† ๋ฆฌ

250x250
hELLO ยท Designed By ์ •์ƒ์šฐ.
๊ธ€์“ฐ๊ธฐ / ๊ด€๋ฆฌ์ž
kkh1902

Steadily

DataEngineering/Spark

df.cache() ๋ฅผ ์–ธ์ œ ์จ์•ผํ•˜๋‚˜?

2026. 1. 30. 19:38
728x90
๋ฐ˜์‘ํ˜•

ํ•œ ์ค„ ๋‹ต

๐Ÿ‘‰ ๊ฐ™์€ DataFrame์„ “์—ฌ๋Ÿฌ ๋ฒˆ ๋‹ค์‹œ ์“ธ ๋•Œ”๋งŒ cache ํ•œ๋‹ค.



์™œ cache๊ฐ€ ํ•„์š”ํ•˜๋ƒ๋ฉด

Spark๋Š” lazy evaluation์ด๋ผ์„œ

DataFrame์„ ์“ธ ๋•Œ๋งˆ๋‹ค ์ฒ˜์Œ๋ถ€ํ„ฐ ๋‹ค์‹œ ๊ณ„์‚ฐํ•ด.

CSV → filter → join → groupBy

์ด๊ฑธ

  • count()
  • show()
  • write()

์ฒ˜๋Ÿผ ์—ฌ๋Ÿฌ ๋ฒˆ ํ˜ธ์ถœํ•˜๋ฉด

๐Ÿ‘‰ ๋งค๋ฒˆ ๋‹ค์‹œ CSV๋ถ€ํ„ฐ ์ฝ์Œ ๐Ÿ˜ฑ

๊ทธ๋ž˜์„œ ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋ฅผ ๋ฉ”๋ชจ๋ฆฌ์— ์ €์žฅํ•˜๋Š” ๊ฒŒ cache().


โœ… cache ์จ์•ผ ํ•˜๋Š” ๊ฒฝ์šฐ

1๏ธโƒฃ ๊ฐ™์€ df๋ฅผ ์—ฌ๋Ÿฌ ์•ก์…˜์—์„œ ์‚ฌ์šฉ

df2 = df.filter(...)

df2.count()
df2.write.parquet(...)

๐Ÿ‘‰ df2.cache() ๐Ÿ‘


2๏ธโƒฃ ๋น„์‹ผ ์—ฐ์‚ฐ ๋’ค์˜ ์ค‘๊ฐ„ ๊ฒฐ๊ณผ

  • ํฐ join
  • aggregation
  • UDF ์ ์šฉ ํ›„

๐Ÿ‘‰ ๋‹ค์‹œ ๊ณ„์‚ฐํ•˜๊ธฐ ์‹ซ์„ ๋•Œ


3๏ธโƒฃ ๋ฐ˜๋ณต ์ž‘์—… / ํƒ์ƒ‰ ๋ถ„์„

  • EDA
  • ๋””๋ฒ„๊น…
  • ์—ฌ๋Ÿฌ ํ†ต๊ณ„๊ฐ’ ํ™•์ธ

โŒ cache ์“ฐ๋ฉด ์•ˆ ๋˜๋Š” ๊ฒฝ์šฐ

1๏ธโƒฃ ํ•œ ๋ฒˆ๋งŒ ์“ฐ๋Š” df

df.filter(...).write.parquet(...)

๐Ÿ‘‰ cache โŒ (์˜๋ฏธ ์—†์Œ)


2๏ธโƒฃ df๊ฐ€ ๋„ˆ๋ฌด ํผ

  • ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ → spill → ์˜คํžˆ๋ ค ๋А๋ฆผ

3๏ธโƒฃ ๋ฐ”๋กœ write๋งŒ ํ•˜๋Š” ๊ฒฝ์šฐ

  • write๋Š” ์–ด์ฐจํ”ผ ํ•œ ๋ฒˆ ์•ก์…˜

cache vs persist ์ฐจ์ด

  • cache() = ๋ฉ”๋ชจ๋ฆฌ ๊ธฐ๋ณธ
  • persist() = ์ €์žฅ ๋ฐฉ์‹ ์„ ํƒ
df.persist(StorageLevel.MEMORY_AND_DISK)

๐Ÿ‘‰ ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ ๋Œ€๋น„ํ•  ๋•Œ


์‹ค๋ฌด์—์„œ ์ž์ฃผ ์“ฐ๋Š” ํŒจํ„ด

df_clean = (
    df_raw
    .filter(...)
    .join(...)
)

df_clean.cache()

df_clean.count()
df_clean.write.parquet(...)

๊ทธ๋ฆฌ๊ณ  ๋๋‚˜๋ฉด ๐Ÿ‘‡

df_clean.unpersist()


Spark UI๋กœ ํ™•์ธํ•˜๋Š” ๋ฒ•

  • Storage ํƒญ
  • Cached RDDs / DataFrames
  • ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ ํ™•์ธ

๐Ÿ‘‰ cache ํ–ˆ์œผ๋ฉด ๊ผญ ํ™•์ธ


๋ฉด์ ‘์šฉ ํ•œ ๋ฌธ์žฅ

“Spark์—์„œ df.cache()๋Š”

๋™์ผํ•œ DataFrame์„ ์—ฌ๋Ÿฌ ์•ก์…˜์—์„œ ์žฌ์‚ฌ์šฉํ•  ๋•Œ

์ค‘๋ณต ๊ณ„์‚ฐ์„ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•˜๋ฉฐ,

ํ•œ ๋ฒˆ๋งŒ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ์—๋Š” ์‚ฌ์šฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.”


๋งˆ์ง€๋ง‰ ํ•ต์‹ฌ ๋ฌธ์žฅ

๐Ÿ‘‰ cache๋Š” ‘์žฌ์‚ฌ์šฉํ•  ๋•Œ๋งŒ’ ์จ๋ผ.

ํ•œ ์ค„ ๋‹ต

๐Ÿ‘‰ ๊ฐ™์€ DataFrame์„ “์—ฌ๋Ÿฌ ๋ฒˆ ๋‹ค์‹œ ์“ธ ๋•Œ”๋งŒ cache ํ•œ๋‹ค.



์™œ cache๊ฐ€ ํ•„์š”ํ•˜๋ƒ๋ฉด

Spark๋Š” lazy evaluation์ด๋ผ์„œ

DataFrame์„ ์“ธ ๋•Œ๋งˆ๋‹ค ์ฒ˜์Œ๋ถ€ํ„ฐ ๋‹ค์‹œ ๊ณ„์‚ฐํ•ด.

CSV → filter → join → groupBy

์ด๊ฑธ

  • count()
  • show()
  • write()

์ฒ˜๋Ÿผ ์—ฌ๋Ÿฌ ๋ฒˆ ํ˜ธ์ถœํ•˜๋ฉด

๐Ÿ‘‰ ๋งค๋ฒˆ ๋‹ค์‹œ CSV๋ถ€ํ„ฐ ์ฝ์Œ ๐Ÿ˜ฑ

๊ทธ๋ž˜์„œ ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋ฅผ ๋ฉ”๋ชจ๋ฆฌ์— ์ €์žฅํ•˜๋Š” ๊ฒŒ cache().


โœ… cache ์จ์•ผ ํ•˜๋Š” ๊ฒฝ์šฐ

1๏ธโƒฃ ๊ฐ™์€ df๋ฅผ ์—ฌ๋Ÿฌ ์•ก์…˜์—์„œ ์‚ฌ์šฉ

df2 = df.filter(...)

df2.count()
df2.write.parquet(...)

๐Ÿ‘‰ df2.cache() ๐Ÿ‘


2๏ธโƒฃ ๋น„์‹ผ ์—ฐ์‚ฐ ๋’ค์˜ ์ค‘๊ฐ„ ๊ฒฐ๊ณผ

  • ํฐ join
  • aggregation
  • UDF ์ ์šฉ ํ›„

๐Ÿ‘‰ ๋‹ค์‹œ ๊ณ„์‚ฐํ•˜๊ธฐ ์‹ซ์„ ๋•Œ


3๏ธโƒฃ ๋ฐ˜๋ณต ์ž‘์—… / ํƒ์ƒ‰ ๋ถ„์„

  • EDA
  • ๋””๋ฒ„๊น…
  • ์—ฌ๋Ÿฌ ํ†ต๊ณ„๊ฐ’ ํ™•์ธ

โŒ cache ์“ฐ๋ฉด ์•ˆ ๋˜๋Š” ๊ฒฝ์šฐ

1๏ธโƒฃ ํ•œ ๋ฒˆ๋งŒ ์“ฐ๋Š” df

df.filter(...).write.parquet(...)

๐Ÿ‘‰ cache โŒ (์˜๋ฏธ ์—†์Œ)


2๏ธโƒฃ df๊ฐ€ ๋„ˆ๋ฌด ํผ

  • ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ → spill → ์˜คํžˆ๋ ค ๋А๋ฆผ

3๏ธโƒฃ ๋ฐ”๋กœ write๋งŒ ํ•˜๋Š” ๊ฒฝ์šฐ

  • write๋Š” ์–ด์ฐจํ”ผ ํ•œ ๋ฒˆ ์•ก์…˜

cache vs persist ์ฐจ์ด

  • cache() = ๋ฉ”๋ชจ๋ฆฌ ๊ธฐ๋ณธ
  • persist() = ์ €์žฅ ๋ฐฉ์‹ ์„ ํƒ
df.persist(StorageLevel.MEMORY_AND_DISK)

๐Ÿ‘‰ ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ ๋Œ€๋น„ํ•  ๋•Œ


์‹ค๋ฌด์—์„œ ์ž์ฃผ ์“ฐ๋Š” ํŒจํ„ด

df_clean = (
    df_raw
    .filter(...)
    .join(...)
)

df_clean.cache()

df_clean.count()
df_clean.write.parquet(...)

๊ทธ๋ฆฌ๊ณ  ๋๋‚˜๋ฉด ๐Ÿ‘‡

df_clean.unpersist()


Spark UI๋กœ ํ™•์ธํ•˜๋Š” ๋ฒ•

  • Storage ํƒญ
  • Cached RDDs / DataFrames
  • ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ ํ™•์ธ

๐Ÿ‘‰ cache ํ–ˆ์œผ๋ฉด ๊ผญ ํ™•์ธ


๋ฉด์ ‘์šฉ ํ•œ ๋ฌธ์žฅ

“Spark์—์„œ df.cache()๋Š”

๋™์ผํ•œ DataFrame์„ ์—ฌ๋Ÿฌ ์•ก์…˜์—์„œ ์žฌ์‚ฌ์šฉํ•  ๋•Œ

์ค‘๋ณต ๊ณ„์‚ฐ์„ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•˜๋ฉฐ,

ํ•œ ๋ฒˆ๋งŒ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ์—๋Š” ์‚ฌ์šฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.”


๋งˆ์ง€๋ง‰ ํ•ต์‹ฌ ๋ฌธ์žฅ

๐Ÿ‘‰ cache๋Š” ‘์žฌ์‚ฌ์šฉํ•  ๋•Œ๋งŒ’ ์จ๋ผ.



728x90
๋ฐ˜์‘ํ˜•

'DataEngineering > Spark' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

Spark ์ตœ์ ํ™” ์ฒดํฌ๋ฆฌ์ŠคํŠธ  (0) 2026.01.30
Suffle์ด๋ž€?  (1) 2026.01.30
Dataframe ์ด๋ž€?  (0) 2026.01.30
Lazy Evaluation ์ด๋ž€?  (0) 2026.01.30
Spark ๋ž€?  (0) 2026.01.30
    'DataEngineering/Spark' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
    • Spark ์ตœ์ ํ™” ์ฒดํฌ๋ฆฌ์ŠคํŠธ
    • Suffle์ด๋ž€?
    • Dataframe ์ด๋ž€?
    • Lazy Evaluation ์ด๋ž€?
    kkh1902
    kkh1902
    1Day 1 Commit ๋ชฉํ‘œ ๊ณต๋ถ€ํ•œ๊ฒƒ๋“ค ๋งค์ผ ๊ธฐ๋กํ•˜๊ธฐ

    ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”