Data Engineering

Data Format

데먕 2022. 6. 17. 11:02

Parquet

  • Binary Format
  • Machine-Readable
  • Splitable
  • Column-wise
  • Good for Read-Heavy Apps
  • Compression-able
  • Mostly used in Apache Spark Apps

Avro

  • Binary Format
  • Machine-Readable
  • Splitable
  • Row-wise
  • Good for Write-Heavy Apps
  • Compression-able
  • Schema Evolution-able
  • Mostly used in Kafka Apps

ORC

  • Binary Format
  • Machine-Readable
  • Splitable
  • Column-wise
  • Good for Read-Heavy Apps
  • Mostly used in Hive Apps

Protocol Buffers (ProtoBuf)

  • Binary Format
  • Machine-Readable
  • Splitable
  • Row-wise
  • Compression-able
  • Schema Evolution-able
  • Mostly used in Kafka Apps

JSON

  • Non-Binary
  • Human-Readable
  • Non-splitable
  • Used for browser-based Apps

XML

  • Non-Binary
  • Human-Readable
  • Non-splitable
  • Used for browser-based Apps

CSV

  • Non-Binary
  • Human-Readable
  • Non-splitable
  • Row-wise
  • Used for data science Apps

Reference

https://www.youtube.com/watch?v=oipFhroPFVM