๋ฐ˜์‘ํ˜•
kkh1902
Steadily
kkh1902
์ „์ฒด ๋ฐฉ๋ฌธ์ž
์˜ค๋Š˜
์–ด์ œ
  • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (178) N
    • DataEngineering (20) N
      • Spark (7) N
      • Airflow (2) N
      • DBT (2) N
      • Architecture (3) N
      • Data Quality (5) N
      • Infra (1) N
    • ๐Ÿค– AI (12) N
      • ML (7)
      • DL (0)
      • LLM (5) N
    • ๐Ÿ“š Study (74)
      • DataEngineering (0)
      • Spring (9)
      • Java (2)
      • Html, css (10)
      • JS, JQuery (29)
      • DB (5)
      • DevOps (13)
      • roadmap (2)
      • Architecture (1)
      • Flutter (2)
    • ๐Ÿ’ป Computer Science (28)
      • Datastructure (0)
      • Algorithm (2)
      • Design pattern (0)
      • Network (1)
      • DB (13)
      • Operating System (0)
      • Software Engineering (4)
      • CS interview (5)
      • topcit (3)
    • โš’๏ธ Etc (6)
      • Error (3)
      • Trouble_Shooting (2)
      • Dev_environment (1)
    • ๐Ÿ“ฐ News (24)
      • daily (7)
      • think (17)
    • ๐Ÿ“˜ Hobby (13)
      • English (13)

๋ธ”๋กœ๊ทธ ๋ฉ”๋‰ด

  • ๐Ÿ“‹ ์ด๋ ฅ์„œ
  • โšก๏ธ ๊นƒํ—ˆ๋ธŒ
  • ํƒœ๊ทธ
  • ๋ฐฉ๋ช…๋ก

๊ณต์ง€์‚ฌํ•ญ

์ธ๊ธฐ ๊ธ€

ํƒœ๊ทธ

  • sourcetreee
  • junit5
  • testcode
  • think #bootstrap์„ ์จ์•ผํ•˜๋Š” ์ด์œ 
  • Flutter
  • ์†Œํ”„ํŠธ์›จ์–ด ๊ณตํ•™ #project๋งŒ๋“ค๋•Œ ์ค‘์š”
  • React JS # ์ž์Šต์„œ # Component์™€ Props
  • gitaction
  • React # JSX
  • Wonder # word
  • Linear Regression
  • React๋ฅผ ๋ฐฐ์›Œ์•ผํ•˜๋Š” ์ด์œ 
  • git stash
  • Qr_payment project # CSS ํ•ด์„ # Basic ๋งจ์œ„ ํ•ด์„
  • SpringBootTest
  • db
  • ์†Œํ”„ํŠธ์›จ์–ด ๊ณตํ•™ # chapter1
  • React JS #์ž์Šต์„œ
  • React JS # 2 The Basic of React
  • git

์ตœ๊ทผ ๋Œ“๊ธ€

์ตœ๊ทผ ๊ธ€

ํ‹ฐ์Šคํ† ๋ฆฌ

250x250
hELLO ยท Designed By ์ •์ƒ์šฐ.
๊ธ€์“ฐ๊ธฐ / ๊ด€๋ฆฌ์ž
kkh1902

Steadily

DataEngineering/Spark

Dataframe ์ด๋ž€?

2026. 1. 30. 19:37
728x90
๋ฐ˜์‘ํ˜•


๐Ÿ“Š DataFrame ์ด๋ž€?

ํ•œ ์ค„ ์ •์˜

๐Ÿ‘‰ ํ–‰(row)๊ณผ ์—ด(column)๋กœ ์ด๋ฃจ์–ด์ง„ “ํ‘œ ํ˜•ํƒœ์˜ ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ”

(์—‘์…€·SQL ํ…Œ์ด๋ธ” ๊ฐ™์€ ๊ฐœ๋…)


์ง๊ด€์ ์ธ ๋น„์œ 

DataFrame = ์—‘์…€ ํ‘œ๋ฅผ์ˆ˜์ฒœ ๋Œ€ ์ปดํ“จํ„ฐ์— ๋‚˜๋ˆ ์„œ ๋™์‹œ์— ๊ณ„์‚ฐํ•˜๋Š” ๊ตฌ์กฐ


DataFrame์˜ ํ•ต์‹ฌ ํŠน์ง•

1๏ธโƒฃ ์Šคํ‚ค๋งˆ๊ฐ€ ์žˆ๋‹ค

  • ์ปฌ๋Ÿผ ์ด๋ฆ„ + ํƒ€์ž… ๋ณด์œ 
  • Spark๊ฐ€ ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ๋ฅผ ์ดํ•ดํ•จ
user_id : long
event_time : timestamp
price : double

๐Ÿ‘‰ ์ด๊ฒŒ RDD์™€ ๊ฐ€์žฅ ํฐ ์ฐจ์ด


2๏ธโƒฃ SQL์ฒ˜๋Ÿผ ๋‹ค๋ฃฌ๋‹ค

df.filter(df.price > 100).groupBy("user_id").count()

๋˜๋Š”

SELECT user_id, COUNT(*)
FROM df
GROUP BY user_id

๐Ÿ‘‰ ๋ถ„์„๊ฐ€·์—”์ง€๋‹ˆ์–ด ๋ชจ๋‘ ์นœ์ˆ™


3๏ธโƒฃ ์ž๋™ ์ตœ์ ํ™” ๋œ๋‹ค

Spark๊ฐ€ ๋‚ด๋ถ€์—์„œ:

  • ๋ถˆํ•„์š”ํ•œ ์ปฌ๋Ÿผ ์•ˆ ์ฝ์Œ
  • ์กฐ๊ฑด ๋จผ์ € ์ ์šฉ
  • ์‹คํ–‰ ๊ณ„ํš ์žฌ์ž‘์„ฑ

๐Ÿ‘‰ ์‚ฌ๋žŒ์ด ์ตœ์ ํ™” ์•ˆ ํ•ด๋„ ๋น ๋ฆ„


4๏ธโƒฃ ๋ถ„์‚ฐ ์ฒ˜๋ฆฌ ๊ธฐ๋ณธ ๋‚ด์žฅ

  • ๋ฐ์ดํ„ฐ ์ž๋™ ๋ถ„ํ•  (partition)
  • ์—ฌ๋Ÿฌ executor๊ฐ€ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ
  • ์žฅ์•  ๋‚˜๋„ ์ž๋™ ๋ณต๊ตฌ

๐Ÿ‘‰ ๋Œ€์šฉ๋Ÿ‰์— ๊ฐ•ํ•จ


DataFrame vs RDD (ํ•œ ๋ˆˆ ๋น„๊ต)

๊ตฌ๋ถ„ DataFrame RDD

๊ตฌ์กฐ ํ‘œ (ํ–‰/์—ด) ๋ฆฌ์ŠคํŠธ
์Šคํ‚ค๋งˆ ์žˆ์Œ ์—†์Œ
์ตœ์ ํ™” ์ž๋™ ์—†์Œ
๋‚œ์ด๋„ ์‰ฌ์›€ ์–ด๋ ค์›€
์‹ค๋ฌด ํ‘œ์ค€ ๊ฑฐ์˜ ์•ˆ ์”€

๐Ÿ‘‰ ์‹ค๋ฌด์—์„œ๋Š” DataFrame์ด ๊ธฐ๋ณธ


Spark์—์„œ DataFrame์ด ์ค‘์š”ํ•œ ์ด์œ 

  • Spark SQL ์—”์ง„๊ณผ ๋ฐ”๋กœ ์—ฐ๊ฒฐ
  • BI / dbt / DW์™€ ๊ถํ•ฉ ์ข‹์Œ
  • ์ฝ”๋“œ ์งง๊ณ  ๊ฐ€๋…์„ฑ ์ข‹์Œ
  • ์„ฑ๋Šฅ ์ตœ์ ํ™” ์ž๋™

๐Ÿ‘‰ ๊ทธ๋ž˜์„œ Spark = DataFrame ์ค‘์‹ฌ ์—”์ง„์ด๋ผ๊ณ  ๋ด๋„ ๋จ


์–ธ์ œ DataFrame ์“ฐ๋ฉด ์•ˆ ๋ ๊นŒ?

๊ฑฐ์˜ ์—†์Œ ๐Ÿ˜…

์•„์ฃผ ํŠน์ˆ˜ํ•œ ๊ฒฝ์šฐ๋งŒ:

  • ์ปฌ๋Ÿผ ๊ตฌ์กฐ ์—†๋Š” ๋น„์ •ํ˜• ๋ฐ์ดํ„ฐ
  • ๋ณต์žกํ•œ ๊ฐ์ฒด ๋‹จ์œ„ ์—ฐ์‚ฐ

๋ฉด์ ‘์šฉ 15์ดˆ ๋‹ต๋ณ€

“DataFrame์€ ํ–‰๊ณผ ์—ด๋กœ ๊ตฌ์„ฑ๋œ ๋ถ„์‚ฐ ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ๋กœ,

์Šคํ‚ค๋งˆ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์–ด Spark๊ฐ€ ์ž๋™์œผ๋กœ ์ตœ์ ํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ˜„์žฌ Spark ์‹ค๋ฌด์—์„œ๋Š” RDD๋ณด๋‹ค DataFrame API๊ฐ€ ํ‘œ์ค€์ž…๋‹ˆ๋‹ค.”


๋งˆ์ง€๋ง‰ ํ•ต์‹ฌ ๋ฌธ์žฅ

๐Ÿ‘‰ DataFrame์€ ‘์‚ฌ๋žŒ์ด ๋ณด๊ธฐ ์‰ฌ์šด ํ‘œ’์ด๋ฉด์„œ‘Spark๊ฐ€ ์ตœ์ ํ™”ํ•˜๊ธฐ ์‰ฌ์šด ๊ตฌ์กฐ’๋‹ค.


728x90
๋ฐ˜์‘ํ˜•

'DataEngineering > Spark' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

Spark ์ตœ์ ํ™” ์ฒดํฌ๋ฆฌ์ŠคํŠธ  (0) 2026.01.30
Suffle์ด๋ž€?  (1) 2026.01.30
df.cache() ๋ฅผ ์–ธ์ œ ์จ์•ผํ•˜๋‚˜?  (0) 2026.01.30
Lazy Evaluation ์ด๋ž€?  (0) 2026.01.30
Spark ๋ž€?  (0) 2026.01.30
    'DataEngineering/Spark' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
    • Suffle์ด๋ž€?
    • df.cache() ๋ฅผ ์–ธ์ œ ์จ์•ผํ•˜๋‚˜?
    • Lazy Evaluation ์ด๋ž€?
    • Spark ๋ž€?
    kkh1902
    kkh1902
    1Day 1 Commit ๋ชฉํ‘œ ๊ณต๋ถ€ํ•œ๊ฒƒ๋“ค ๋งค์ผ ๊ธฐ๋กํ•˜๊ธฐ

    ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”