...
Apache Parquet is a file-based data representation that is known for its efficient data representation. Parquet is built to support fast and effective compression and encoding schemes for simple columnar, complex nested, and raw binary data.
Columnar Data
Perhaps the most significant design feature of Parquet is its column-oriented storage of data, meaning that values for each column are stored together. Most common formats – like CSV and Avro – are row-oriented.
...
Organizing data by columns has many performance advantages. When querying Parquet documents for particular columns, the desired data can be retrieved quickly with less I/O. Unlike row-oriented storage formats like CSV, only the desired columns need to be loaded into memory – resulting in a lower memory footprint and faster queries.
Flexible, Extensible Encoding
Parquet also allows compression on a per-column basis. Different encoding schemes can be used for different columns, allowing for, say, dictionary-based encodings to be used for columns with enumerated strings or bit packing for columns with small integer values.
Self-Describing Schemas
Apache Parquet is a file-oriented encoding, and the file includes metadata that specifies schemas for each column, location information of columns, encodings used, etc. Note that this structure implies that you must have access to the entire file before processing.
Example Parquet Schema
Insert excerpt Parquet Profile Configuration Schema(3.0) Parquet Profile Configuration Schema(3.0) nopanel true
Parquet Concepts
The schema in the previous section illustrates Apache Parquet concepts, but it helps to have a good grasp of primitives, nested groups, repetition levels, and logical types. Briefly:
...
For further reading on Parquet, see these documents: Apache Parquet Documentation, Parquet Logical Types Definitions, and Maven Repository Apache Parquet.
Parquet in
Insert excerpt Parquet Examples(3.0) Parquet Examples(3.0) nopanel true
Child pages (Children Display) |
---|
...