Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Current »

This section describes Apache Parquet agents to handle data encoded and compressed with Apache Parquet. The Parquet Decoder Agent converts rows from Parquet documents into UDRs to be routed into workflows. The Parquet Encoder Agent converts UDRs into Parquet format to be delivered to output destinations. The Parquet agents are only available for batch workflow configurations.

Apache Parquet is a file-based data representation that is known for its efficient data representation. Parquet is built to support fast and effective compression and encoding schemes for simple columnar, complex nested, and raw binary data.

Columnar Data

Perhaps the most significant design feature of Parquet is its column-oriented storage of data, meaning that values for each column are stored together. Most common formats – like CSV and Avro – are row-oriented.

The figure below illustrates row-oriented versus column-oriented storage.


Illustration of differences between row- and column-oriented storage

Organizing data by columns has many performance advantages. When querying Parquet documents for particular columns, the desired data can be retrieved quickly with less I/O. Unlike row-oriented storage formats like CSV, only the desired columns need to be loaded into memory – resulting in a lower memory footprint and faster queries.

Flexible, Extensible Encoding

Parquet also allows compression on a per-column basis. Different encoding schemes can be used for different columns, allowing for, say, dictionary-based encodings to be used for columns with enumerated strings or bit packing for columns with small integer values.

Self-Describing Schemas

Apache Parquet is a file-oriented encoding, and the file includes metadata that specifies schemas for each column, location information of columns, encodings used, etc. Note that this structure implies that you must have access to the entire file before processing.

Example Parquet Schema

Apache Parquet supports a small set of primitives (integer, floating point, boolean, and byte array). These primitives can be extended using logical type annotations which are modifiers on primitives. For example, the UTF8 annotation is a modifier to byte arrays that denotes string data. Parquet also supports structured data through groups and repetitions (that is, optional, required, repeated).

Example - Parquet Schema

This structured text block shows an example Parquet schema for company employees:

message employee {
  required group id {
    required group name {
      required binary surname (UTF8);
      required binary firstName (UTF8);
      optional binary preferredFirstName (UTF8);
    }
    required int32 employeeNumber;
  }
  optional group phones (LIST) {
    repeated group list {
      required group element {
        required binary type (ENUM);
        required binary phoneNumber (UTF8);
      }
    }
  }
  required binary email (UTF8);
  optional binary manager (UTF8);
  required binary jobTitle (UTF8);
  required group team {
    required binary country (UTF8);
    required binary businessUnit (UTF8);
    required binary function (UTF8);
    optional binary team (UTF8);
    optional binary department (UTF8);
    required binary legalEntity (UTF8);
  }
  optional int32 birthdate (DATE);
}

Parquet Concepts

The schema in the previous section illustrates Apache Parquet concepts, but it helps to have a good grasp of primitives, nested groups, repetition levels, and logical types. Briefly:


Primitives in Apache are the fundamental data types. They consist of integers (for example, int32, int64), floating point (for example, float, double), Boolean (boolean), and bytearray (binary).

Nested groups in Apache are the way structured objects (consisting of primitives or lists of groups/primitives) are put together. In the example above, id is a nested group that includes a name (which is itself a nested group) and employeeNumber (an integer primitive).

Repetition levels are modifiers that specify whether a column is optional, required, or repeated multiple times.

Logical types are used to extend the sparse primitive types. For example, the bytearray data type can be used to specify strings and structured JSON as well as binary data.

For further reading on Parquet, see these documents: Apache Parquet Documentation, Parquet Logical Types Definitions, and Maven Repository Apache Parquet.

Parquet in

Apache Parquet support is in the form of a pair of agents:

The Parquet Decoder processes data from incoming Parquet documents, and the Parquet Encoder creates outgoing Parquet documents. The Parquet Encoder – and optionally the Decoder – makes use of a Parquet Profile that encapsulates the schema as well as encoding options.

The Parquet Decoder agent receives Parquet data from file collectors in bytearray format, converts the data into ParquetDecoderUDRs (one UDR per record), and routes those UDRs forward into the workflow.

The Parquet Encoder agent receives ParquetEncoderUDRs, converts the data into Parquet, and forwards bytearray data to a forwarder to (eventually) be written to a Parquet document.


Example workflow with Parquet Decoder and Encoder

Note!

These Parquet agents are batch agents. Given that Parquet is a file-oriented encoding scheme that includes metadata about the entire document, batch agents – which natively support the processing of entire files – provide a tasteful lifecycle for Parquet.


The section contains the following subsections:

  • No labels