๋ฐ˜์‘ํ˜•
kkh1902
Steadily
kkh1902
์ „์ฒด ๋ฐฉ๋ฌธ์ž
์˜ค๋Š˜
์–ด์ œ
  • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (178) N
    • DataEngineering (20) N
      • Spark (7) N
      • Airflow (2) N
      • DBT (2) N
      • Architecture (3) N
      • Data Quality (5) N
      • Infra (1) N
    • ๐Ÿค– AI (12) N
      • ML (7)
      • DL (0)
      • LLM (5) N
    • ๐Ÿ“š Study (74)
      • DataEngineering (0)
      • Spring (9)
      • Java (2)
      • Html, css (10)
      • JS, JQuery (29)
      • DB (5)
      • DevOps (13)
      • roadmap (2)
      • Architecture (1)
      • Flutter (2)
    • ๐Ÿ’ป Computer Science (28)
      • Datastructure (0)
      • Algorithm (2)
      • Design pattern (0)
      • Network (1)
      • DB (13)
      • Operating System (0)
      • Software Engineering (4)
      • CS interview (5)
      • topcit (3)
    • โš’๏ธ Etc (6)
      • Error (3)
      • Trouble_Shooting (2)
      • Dev_environment (1)
    • ๐Ÿ“ฐ News (24)
      • daily (7)
      • think (17)
    • ๐Ÿ“˜ Hobby (13)
      • English (13)

๋ธ”๋กœ๊ทธ ๋ฉ”๋‰ด

  • ๐Ÿ“‹ ์ด๋ ฅ์„œ
  • โšก๏ธ ๊นƒํ—ˆ๋ธŒ
  • ํƒœ๊ทธ
  • ๋ฐฉ๋ช…๋ก

๊ณต์ง€์‚ฌํ•ญ

์ธ๊ธฐ ๊ธ€

ํƒœ๊ทธ

  • junit5
  • git stash
  • ์†Œํ”„ํŠธ์›จ์–ด ๊ณตํ•™ #project๋งŒ๋“ค๋•Œ ์ค‘์š”
  • React JS # ์ž์Šต์„œ # Component์™€ Props
  • Qr_payment project # CSS ํ•ด์„ # Basic ๋งจ์œ„ ํ•ด์„
  • React JS #์ž์Šต์„œ
  • gitaction
  • Linear Regression
  • React JS # 2 The Basic of React
  • ์†Œํ”„ํŠธ์›จ์–ด ๊ณตํ•™ # chapter1
  • sourcetreee
  • git
  • React๋ฅผ ๋ฐฐ์›Œ์•ผํ•˜๋Š” ์ด์œ 
  • Wonder # word
  • think #bootstrap์„ ์จ์•ผํ•˜๋Š” ์ด์œ 
  • Flutter
  • React # JSX
  • db
  • SpringBootTest
  • testcode

์ตœ๊ทผ ๋Œ“๊ธ€

์ตœ๊ทผ ๊ธ€

ํ‹ฐ์Šคํ† ๋ฆฌ

250x250
hELLO ยท Designed By ์ •์ƒ์šฐ.
๊ธ€์“ฐ๊ธฐ / ๊ด€๋ฆฌ์ž
kkh1902

Steadily

DataEngineering/Spark

Spark ๋ž€?

2026. 1. 30. 19:36
728x90
๋ฐ˜์‘ํ˜•

ํ•œ ์ค„ ์ •์˜

๐Ÿ‘‰ **Spark๋Š” “๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ๋ฅผ ๋น ๋ฅด๊ฒŒ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ ๋ถ„์‚ฐ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ์—”์ง„”**์ด์•ผ.


์กฐ๊ธˆ ํ’€์–ด์„œ ๋งํ•˜๋ฉด

  • ํ•œ ๋Œ€ ์ปดํ“จํ„ฐ๋กœ๋Š” ๋А๋ฆฐ ์ž‘์—…์„
  • ์—ฌ๋Ÿฌ ๋Œ€ ์ปดํ“จํ„ฐ(CPU, ๋ฉ”๋ชจ๋ฆฌ)๋ฅผ ๋™์‹œ์— ์จ์„œ
  • ํ•œ ๋ฒˆ์— ์ฒ˜๋ฆฌํ•ด์ฃผ๋Š” ์‹œ์Šคํ…œ

๐Ÿ‘‰ ๊ทธ๊ฒŒ ๋ฐ”๋กœ Apache Spark


Spark๋Š” ๋ญ˜ ์ž˜ํ•˜๋ƒ?

โœ”๏ธ ์ž˜ํ•˜๋Š” ๊ฒƒ

  • ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ (GB~TB~PB)
  • ๋กœ๊ทธ ๋ถ„์„
  • ETL ํŒŒ์ดํ”„๋ผ์ธ
  • ์ง‘๊ณ„ / ํ†ต๊ณ„
  • ๋จธ์‹ ๋Ÿฌ๋‹ ์ „์ฒ˜๋ฆฌ

โŒ ์•ˆ ๋งž๋Š” ๊ฒƒ

  • ์†Œ๋Ÿ‰ ๋ฐ์ดํ„ฐ
  • ์‹ค์‹œ๊ฐ„ ์ดˆ์ €์ง€์—ฐ ํŠธ๋žœ์žญ์…˜ (OLTP)

Spark vs ๋‹จ์ผ ์„œ๋ฒ„ (์ฐจ์ด ๊ฐ๊ฐ)

โŒ ์ผ๋ฐ˜ Python

ํ•œ ์ค„๋กœ ํ•œ ์ค„ ์ฒ˜๋ฆฌ
CPU 1๊ฐœ
๋ฉ”๋ชจ๋ฆฌ ํ•œ๊ณ„

โœ… Spark

๋ฐ์ดํ„ฐ๋ฅผ ์ชผ๊ฐœ์„œ
์—ฌ๋Ÿฌ CPU / ์—ฌ๋Ÿฌ ์„œ๋ฒ„์—์„œ
๋™์‹œ์— ์ฒ˜๋ฆฌ


Spark์˜ ํ•ต์‹ฌ ํŠน์ง• 4๊ฐ€์ง€

1๏ธโƒฃ ๋ถ„์‚ฐ ์ฒ˜๋ฆฌ

  • ๋ฐ์ดํ„ฐ๋ฅผ ์—ฌ๋Ÿฌ ์กฐ๊ฐ(partition)์œผ๋กœ ๋‚˜๋ˆ”
  • ๋™์‹œ์— ์ฒ˜๋ฆฌ → ๋น ๋ฆ„

2๏ธโƒฃ ๋ฉ”๋ชจ๋ฆฌ ๊ธฐ๋ฐ˜ ์ฒ˜๋ฆฌ

  • ๋””์Šคํฌ๋ณด๋‹ค ๋ฉ”๋ชจ๋ฆฌ ์œ„์ฃผ
  • Hadoop MapReduce๋ณด๋‹ค ํ›จ์”ฌ ๋น ๋ฆ„

3๏ธโƒฃ Lazy Evaluation (์ง€์—ฐ ์‹คํ–‰)

  • ์ฝ”๋“œ ์“ด๋‹ค๊ณ  ๋ฐ”๋กœ ์‹คํ–‰ โŒ
  • show(), count(), write() ๊ฐ™์€ action์—์„œ ์‹คํ–‰

๐Ÿ‘‰ ์ตœ์ ํ™”ํ•  ์‹œ๊ฐ„์„ Spark๊ฐ€ ๊ฐ€์ง


4๏ธโƒฃ ๋‹ค์–‘ํ•œ API ์ œ๊ณต

  • DataFrame / SQL
  • Python, Scala, SQL ์ง€์›

Spark์—์„œ ์šฐ๋ฆฌ๊ฐ€ ์ฃผ๋กœ ์“ฐ๋Š” ๊ฒƒ

๐Ÿ“Š DataFrame

df.select().filter().groupBy()

๐Ÿ‘‰ SQL ํ…Œ์ด๋ธ”์ฒ˜๋Ÿผ ์“ฐ๋Š” ๊ตฌ์กฐ


Spark ๊ตฌ์„ฑ์š”์†Œ (์•„์ฃผ ๊ฐ„๋‹จํžˆ)

Driver   : ์ „์ฒด ์ž‘์—… ์ง€ํœ˜
Executor : ์‹ค์ œ ์—ฐ์‚ฐ ์ˆ˜ํ–‰

  • Driver๊ฐ€ ๊ณ„ํš ์„ธ์›€
  • Executor๋“ค์ด ์‹ค์ œ ๊ณ„์‚ฐ

Spark๋Š” ์–ด๋””์„œ ์“ฐ์ด๋‚˜?

  • AWS EMR
  • GCP Dataproc
  • Databricks
  • ์˜จํ”„๋ ˆ๋ฏธ์Šค ํด๋Ÿฌ์Šคํ„ฐ

๐Ÿ‘‰ ๋ฐ์ดํ„ฐ ์—”์ง€๋‹ˆ์–ด ํ•„์ˆ˜ ๋„๊ตฌ


ํ•œ ๋ฌธ์žฅ ์š”์•ฝ (๋ฉด์ ‘์šฉ โœจ)

“Spark๋Š” ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์‚ฐ ํ™˜๊ฒฝ์—์„œ ๋ฉ”๋ชจ๋ฆฌ ๊ธฐ๋ฐ˜์œผ๋กœ ๋น ๋ฅด๊ฒŒ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ์—”์ง„์ž…๋‹ˆ๋‹ค.”


์ง„์งœ ํ•ต์‹ฌ๋งŒ ๋‹ค์‹œ

  • Spark = ๋ถ„์‚ฐ ์ฒ˜๋ฆฌ
  • ๋น ๋ฅธ ์ด์œ  = ๋ณ‘๋ ฌ + ๋ฉ”๋ชจ๋ฆฌ
  • ํ•ต์‹ฌ API = DataFrame
  • ๋А๋ฆด ๋•Œ ์›์ธ = Shuffle

728x90
๋ฐ˜์‘ํ˜•

'DataEngineering > Spark' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

Spark ์ตœ์ ํ™” ์ฒดํฌ๋ฆฌ์ŠคํŠธ  (0) 2026.01.30
Suffle์ด๋ž€?  (1) 2026.01.30
df.cache() ๋ฅผ ์–ธ์ œ ์จ์•ผํ•˜๋‚˜?  (0) 2026.01.30
Dataframe ์ด๋ž€?  (0) 2026.01.30
Lazy Evaluation ์ด๋ž€?  (0) 2026.01.30
    'DataEngineering/Spark' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
    • Suffle์ด๋ž€?
    • df.cache() ๋ฅผ ์–ธ์ œ ์จ์•ผํ•˜๋‚˜?
    • Dataframe ์ด๋ž€?
    • Lazy Evaluation ์ด๋ž€?
    kkh1902
    kkh1902
    1Day 1 Commit ๋ชฉํ‘œ ๊ณต๋ถ€ํ•œ๊ฒƒ๋“ค ๋งค์ผ ๊ธฐ๋กํ•˜๊ธฐ

    ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”