...
Setting | Description |
---|---|
Schema | See below |
Validate | To validate the Schema, to make sure it has a correct format. |
Defining the Parquet Schema
To be able to define a Schema, it is useful to have knowledge about primitives, nested groups, repetition levels, and logical types, as described below:
Primitives in Apache are the fundamental data types. They consist of integers (e.g., int32, int64), floating point (e.g., float, double), Boolean (boolean), and byte array (binary).
Nested groups in Apache are the way structured objects (consisting of primitives or lists of groups/primitives) are put together. In the example below, id is a nested group that includes a name (which is itself a nested group) and employeeNumber (an integer primitive).
Repetition levels are modifiers that specify whether a column is optional, required, or repeated multiple times.
Logical types are used to extend the sparse primitive types. For example, the byte array data type can be used to specify strings and structured JSON as well as binary data.
Example Parquet Schema
Apache Parquet supports a small set of primitives (integer, floating point, boolean, and byte array). These primitives can be extended using logical type annotations, which are modifiers on primitives. For example, the UTF8 annotation is a modifier to byte arrays that denote string data. Parquet also supports structured data through groups and repetitions (i.e., optional, required, repeated).
Info | |||||||
---|---|---|---|---|---|---|---|
| |||||||
The structured text block shows an example Parquet schema for company employees:
|
...