Thursday, June 13, 2013

Processing Files with Headers and Footers with Talend Open Studio


Talend Open Studio uses wizards to define schemas for text and Excel files.  However, the wizards aren't sufficient for processing a complex file containing a header or a footer.  Use Java to build a state table that will route the different sections of the file to the correct target.

Wizards like the Delimited File Wizard or the Excel File Wizard define a schema based on a single row/column structure.  Components like tFileInputExcel provide additional control over the file processing, allowing header rows to be skipped.  But if the input header (or footer) needs to be processed, add some Java components to help.

Algorithm

For processing the complex file, use a state table which identifies the sections of the file and describes the section boundaries.  Take the following file as an example

Rows 1-3 contain the header.  Rows 5-9 contain the body.  Row 11 contains the footer.  An empty row separates the header from the body.  An empty row separates the body from the footer.  For processing, this can be written as a state table where working with a section in the file defines a new state.

State    Transition    New State
------------------------------------
Header   Empty row     Body
Body     Empty row     Footer 


The state table drives a process where the header is processed until an empty row is found.  Then, the body is processed until an empty row is found.  Finally, the footer is processed.  In the following Talend Open Studio job, "processed" means routed to the correct target (the tLogRows to the right of the image).

Job Routing Header, Body, and Trailer to Different Targets
 Input

 In the example job, a state table is defined in a tJava (see section "State Table").  The input flows from a loosely-defined tFileInputExcel.  The tFileInputExcel contains a schema with a field for each of the possible input fields, all of which are defined as string.  Usually, the number of fields will be number of fields as the body, but if the header contained more columns, then include that many columns in the generic schema.

Schema Used in tFileInputExcel
Identifying Sections

The tFileInputExcel is connected to a tJavaRow.  The tJavaRow maps each of the input fields to output fields.  The tJavaRow also defines an extra "recordType" field.  This holds the value of the current state and is used in the later tMap to route the record to the correct schema and target.  The tJavaRow also applies the state table to the input, determining if a new section is being processed (see section "Applying the State Table").

tJavaRow schema with recordType Field
Routing

Finally, a tMap is used to route the input to its correct destination.  The tMap defines three output schemas for each of the three sections of the file.  These are NOT loosely defined.  They contain the proper number of columns, column headers, and data types.  Each output schema in the tMap is supported by an expression which will examine the recordType value to determine which schema should be applied.

tMap Defining 3 Output Schemas with recordType Expressions

Output

Each output from the tMap is directed to a different component.  This example uses tLogRows that print out the component name with the fields for debugging.  The output could be three individual tables in an RDBMS in which case a unifying "load number" should be acquired from the DB.  This load number would link together the header, body, and footer records for later re-assembly.

Two tFilterRow components are used to filter out the blank lines that demarcate the file sections but which are included in the processing flow.

State Table

The state table in this job is implemented using a Java Map.  The Map is set up in the tJava at the start of the job.

Here is the code behind the tJava_1 component.  From "Basic settings"

Map, FileState> stateTable = 
     new HashMap, FileState>();

Map transition = 
     new HashMap();  
transition.put(FileState.HEADER, null);
stateTable.put( transition, FileState.BODY );

transition = new HashMap();
transition.put(FileState.BODY, null);
stateTable.put( transition, FileState.TRAILER );

FileState currentState = FileState.HEADER;

globalMap.put("stateTable", stateTable);
globalMap.put("currentState", currentState);

"Advanced settings" has the usual imports.  There is also a cool (my word) Java language trick to apply an enumerated value.  I slipped in an enum definition for use throughout the job.

import java.util.Map;
import java.util.HashMap;

enum FileState { HEADER, BODY, TRAILER };


Applying the State Table

The state table and its transitions are applied in the tJavaRow component between the tMap and the tFileInputExcel.  The code relies on a nested Map which requires a special two-part key.  This is the code from "Basic settings" in the tJavaRow.

FileState currentState = (FileState)globalMap.get("currentState");

output_row.field1 = input_row.field1;
output_row.field2 = input_row.field2;
output_row.field3 = input_row.field3;
output_row.recordType = String.valueOf(currentState);

Map, FileState> stateTable =
  (Map, FileState>)globalMap.get("stateTable");

Map key = new HashMap();
key.put( currentState, input_row.field1 );
 
FileState nextState = stateTable.get( key );

if( nextState != null ) {
     currentState = nextState;
     globalMap.put("currentState", currentState);
} 

While there is a lot of Java code in this job to process the complex file, the job design is simple and scalable.  New file sections can be added in a straight-forward manner and the demarcations (currently empty rows) can be more sophisticated. 

No comments: