Parquet Profile Configuration Schema
The Schema tab is the primary configuration in the Parquet Profile. This tab allows the author to specify a Parquet Schema. This schema will be used for different purposes by the Parquet Encoder and Paruet Decoder agents.
Parquet Encoder Agent - The Parquet Encoder agent will generate a Parquet document that conforms to the specified schema. Not only will the data conform to the schema, but the schema itself is included in the Parquet document.
Parquet Decoder Agent - When the Parquet Decoder processes a Parquet document, only the columns included in the specified schema will be included. For example:
Consider a document with columns A, B, C, and D.
Assume that the schema in the Parquet Profile only specifies columns A and D.
The generated ParquetDecoderUDRs will include only fields A and D in the payload map.
Note that the Parquet Profile (and hence the schema) are required for Parquet Encoder agents and optional for Parquet Decoders.
The Parquet profile's Schema tab with an example of a defined Schema. You will have to write the Schema for your desired functions.
Setting | Description |
---|---|
Schema | See below. |
Validate | Press the Validate button to validate the Schema and make sure it has a correct format. |
Defining the Parquet Schema
To be able to define a Schema, it is useful to have knowledge about primitives, nested groups, repetition levels, and logical types, as described below:
Example Parquet Schema
Apache Parquet supports a small set of primitives (integer, floating point, boolean, and byte array). These primitives can be extended using logical type annotations which are modifiers on primitives. For example, the UTF8 annotation is a modifier to byte arrays that denotes string data. Parquet also supports structured data through groups and repetitions (that is, optional, required, repeated).
Example - Parquet Schema
This structured text block shows an example Parquet schema for company employees: