Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Overview

This example illustrates typical use of the Parquet Decoder agent in a batch workflow. In this example, complete records are processed using the embedded document schema. The following configurations will be created:

  • An Ultra Format
  • A Batch Workflow that makes use of a Parquet Decoder agent that parses Parquet documents.

Define an Ultra Format

A simple Ultra Format needs to be created for the incoming UDRs. For more information about the Ultra Format Editor and the UFDL syntax, refer to the Ultra Format Management User's Guide.

Info
titleExample - Ultra

Create an Ultra Format as defined below:

Code Block
languagetext
themeEclipse
external BOOK_HEADER : identified_by(strREContains(HEADER, "title,name,organization,copyrightYear")), terminated_by(0xA)
{
  ascii HEADER : terminated_by(0xA);
};

external BookRecord
{
  ascii title                 : terminated_by(",");
  ascii authorName            : terminated_by(",");
  ascii organization          : terminated_by(",");
  ascii copyrightYearString   : terminated_by(",");
  ascii numberOfPages         : terminated_by(0xA);
};

internal BookRecord
{
  string title;
  string authorName;
  string organization;
  string copyrightYearString;
  int numberOfPages;

  //  enriched
  date copyrightYear;
};

//  decoder
in_map BOOK_HEADER_InMap : external(BOOK_HEADER), target_internal(BOOK_HEADER), discard_output { automatic; };
in_map BookRecord_InMap : external(BookRecord), internal(BookRecord) { automatic; };
decoder BOOK_HEADER_Decode : in_map(BOOK_HEADER_InMap);
decoder BookRecord_Decode : in_map(BookRecord_InMap);
decoder DECODER { decoder BOOK_HEADER_Decode; decoder BookRecord_Decode *; };

//  encoder
out_map BookRecord_OutMap : external(BookRecord), internal(BookRecord) { automatic; };
encoder ENCODER : out_map(BookRecord_OutMap);


Create a Batch Workflow

In this workflow, Parquet files on disk are retrieved that are then decoded into UDRs that are written into a CSV file. The workflow is illustrated here:

Example workflow with Parquet Encoder

Walking through the example workflow from left to right, we have:

  • A Disk agent named Disk_Source that reads in the source file (which contains a Parquet document) as a byte array.
  • A Parquet Decoder agent that parses the bytes from the file as Parquet, passing ParquetDecoderUDRs to the Analysis agent.
  • An Analysis agent named Analysis that transforms these incoming ParquetDecoderUDRs into BookRecord UDRs.
  • An Encoder agent named CSV_Encoder that encodes the BookRecord UDRs as CSV bytes.
  • The Disk_Destination forwarding agent receives the bytearray data and writes out a CSV document.

This section walks through the steps of creating such a batch workflow.

Disk

Disk_Source is a Disk Collection agent that collects data from an input file and forwards it to the Decoder agent.

Double-click on the Disk_Source agent to display the configuration dialog for the agent:

Example of a Disk agent configuration


...

When you run the Workflow, it processes Parquet files from the input directory and writes out corresponding CSV files in the configured output directory.


Scroll ignore
scroll-viewportfalse
scroll-pdftrue
scroll-officefalse
scroll-chmtrue
scroll-docbooktrue
scroll-eclipsehelptrue
scroll-epubtrue
scroll-htmlfalse


Next: