Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Image Added

This example illustrates typical use of the Parquet Decoder agent in a batch workflow. In this example, complete records are processed using the embedded document schema. The following configurations will be created:

  • Ultra Format
  • Batch Workflow that makes use of a Parquet Decoder agent that parses Parquet documents.

Define an Ultra Format

A simple Ultra Format needs to be created for the incoming UDRs. For more information about the Ultra Format Editor and the UFDL syntax, refer to the Ultra Format[hide]3.0[/hide].

Info
titleExample - Ultra

Create an Ultra Format as defined below:

Code Block
languagetext
themeEclipse
external BOOK_HEADER : identified_by(strREContains(HEADER, "title,name,organization,copyrightYear")), terminated_by(0xA)
{
  ascii HEADER : terminated_by(0xA);
};

external BookRecord
{
  ascii title                 : terminated_by(",");
  ascii authorName            : terminated_by(",");
  ascii organization          : terminated_by(",");
  ascii copyrightYearString   : terminated_by(",");
  ascii numberOfPages         : terminated_by(0xA);
};

internal BookRecord
{
  string title;
  string authorName;
  string organization;
  string copyrightYearString;
  int numberOfPages;

  //  enriched
  date copyrightYear;
};

//  decoder
in_map BOOK_HEADER_InMap : external(BOOK_HEADER), target_internal(BOOK_HEADER), discard_output { automatic; };
in_map BookRecord_InMap : external(BookRecord), internal(BookRecord) { automatic; };
decoder BOOK_HEADER_Decode : in_map(BOOK_HEADER_InMap);
decoder BookRecord_Decode : in_map(BookRecord_InMap);
decoder DECODER { decoder BOOK_HEADER_Decode; decoder BookRecord_Decode *; };

//  encoder
out_map BookRecord_OutMap : external(BookRecord), internal(BookRecord) { automatic; };
encoder ENCODER : out_map(BookRecord_OutMap);


Create a Batch Workflow

In this workflow, Parquet files on disk are retrieved that are then decoded into UDRs that are written into a CSV file. The workflow is illustrated here:

Example workflow with Parquet Encoder

Walking through the example workflow from left to right, we have:

  • A Disk agent that reads in the source file (which contains a Parquet document) as a byte array.
  • A Parquet Decoder agent that parses the bytes from the file as Parquet, passing ParquetDecoderUDRs to the Analysis agent.
  • An Analysis agent that transforms these incoming ParquetDecoderUDRs into BookRecord UDRs.
  • An Encoder agent that encodes the BookRecord UDRs as CSV bytes.
  • The Disk forwarding agent receives the bytearray data and writes out a CSV document.

This section walks through the steps of creating such a batch workflow.

Disk

Disk_Input is a Disk Collection agent that collects data from an input file and forwards it to the Decoder agent.

Double-click on the Disk_Source agent to display the configuration dialog for the agent:

Example of a Disk agent configuration


...