Talend Open Studio uses wizards to define schemas for text and Excel files. However, the wizards aren't sufficient for processing a complex file containing a header or a footer. Use Java to build a state table that will route the different sections of the file to the correct target.
Wizards like the Delimited File Wizard or the Excel File Wizard define a schema based on a single row/column structure. Components like tFileInputExcel provide additional control over the file processing, allowing header rows to be skipped. But if the input header (or footer) needs to be processed, add some Java components to help.
Algorithm
For processing the complex file, use a state table which identifies the sections of the file and describes the section boundaries. Take the following file as an example
Rows 1-3 contain the header. Rows 5-9 contain the body. Row 11 contains the footer. An empty row separates the header from the body. An empty row separates the body from the footer. For processing, this can be written as a state table where working with a section in the file defines a new state.
State Transition New State
------------------------------------
Header Empty row Body
Body Empty row Footer
The state table drives a process where the header is processed until an empty row is found. Then, the body is processed until an empty row is found. Finally, the footer is processed. In the following Talend Open Studio job, "processed" means routed to the correct target (the tLogRows to the right of the image).
Input
In the example job, a state table is defined in a tJava (see section "State Table"). The input flows from a loosely-defined tFileInputExcel. The tFileInputExcel contains a schema with a field for each of the possible input fields, all of which are defined as string. Usually, the number of fields will be number of fields as the body, but if the header contained more columns, then include that many columns in the generic schema.
Identifying Sections
The tFileInputExcel is connected to a tJavaRow. The tJavaRow maps each of the input fields to output fields. The tJavaRow also defines an extra "recordType" field. This holds the value of the current state and is used in the later tMap to route the record to the correct schema and target. The tJavaRow also applies the state table to the input, determining if a new section is being processed (see section "Applying the State Table").
Routing
Finally, a tMap is used to route the input to its correct destination. The tMap defines three output schemas for each of the three sections of the file. These are NOT loosely defined. They contain the proper number of columns, column headers, and data types. Each output schema in the tMap is supported by an expression which will examine the recordType value to determine which schema should be applied.
Output
Each output from the tMap is directed to a different component. This example uses tLogRows that print out the component name with the fields for debugging. The output could be three individual tables in an RDBMS in which case a unifying "load number" should be acquired from the DB. This load number would link together the header, body, and footer records for later re-assembly.
Two tFilterRow components are used to filter out the blank lines that demarcate the file sections but which are included in the processing flow.
State Table
The state table in this job is implemented using a Java Map. The Map is set up in the tJava at the start of the job.
Here is the code behind the tJava_1 component. From "Basic settings"
Map
new HashMap
Map transition =
new HashMap();
transition.put(FileState.HEADER, null);
stateTable.put( transition, FileState.BODY );
transition = new HashMap();
transition.put(FileState.BODY, null);
stateTable.put( transition, FileState.TRAILER );
FileState currentState = FileState.HEADER;
globalMap.put("stateTable", stateTable);
globalMap.put("currentState", currentState);
"Advanced settings" has the usual imports. There is also a cool (my word) Java language trick to apply an enumerated value. I slipped in an enum definition for use throughout the job.
import java.util.Map;
import java.util.HashMap;
enum FileState { HEADER, BODY, TRAILER };
Applying the State Table
The state table and its transitions are applied in the tJavaRow component between the tMap and the tFileInputExcel. The code relies on a nested Map which requires a special two-part key. This is the code from "Basic settings" in the tJavaRow.
FileState currentState = (FileState)globalMap.get("currentState");
output_row.field1 = input_row.field1;
output_row.field2 = input_row.field2;
output_row.field3 = input_row.field3;
output_row.recordType = String.valueOf(currentState);
Map
(Map
Map key = new HashMap();
key.put( currentState, input_row.field1 );
FileState nextState = stateTable.get( key );
if( nextState != null ) {
currentState = nextState;
globalMap.put("currentState", currentState);
}
While there is a lot of Java code in this job to process the complex file, the job design is simple and scalable. New file sections can be added in a straight-forward manner and the demarcations (currently empty rows) can be more sophisticated.
Wizards like the Delimited File Wizard or the Excel File Wizard define a schema based on a single row/column structure. Components like tFileInputExcel provide additional control over the file processing, allowing header rows to be skipped. But if the input header (or footer) needs to be processed, add some Java components to help.
Algorithm
For processing the complex file, use a state table which identifies the sections of the file and describes the section boundaries. Take the following file as an example
Rows 1-3 contain the header. Rows 5-9 contain the body. Row 11 contains the footer. An empty row separates the header from the body. An empty row separates the body from the footer. For processing, this can be written as a state table where working with a section in the file defines a new state.
State Transition New State
------------------------------------
Header Empty row Body
Body Empty row Footer
The state table drives a process where the header is processed until an empty row is found. Then, the body is processed until an empty row is found. Finally, the footer is processed. In the following Talend Open Studio job, "processed" means routed to the correct target (the tLogRows to the right of the image).
Job Routing Header, Body, and Trailer to Different Targets |
In the example job, a state table is defined in a tJava (see section "State Table"). The input flows from a loosely-defined tFileInputExcel. The tFileInputExcel contains a schema with a field for each of the possible input fields, all of which are defined as string. Usually, the number of fields will be number of fields as the body, but if the header contained more columns, then include that many columns in the generic schema.
Schema Used in tFileInputExcel |
The tFileInputExcel is connected to a tJavaRow. The tJavaRow maps each of the input fields to output fields. The tJavaRow also defines an extra "recordType" field. This holds the value of the current state and is used in the later tMap to route the record to the correct schema and target. The tJavaRow also applies the state table to the input, determining if a new section is being processed (see section "Applying the State Table").
tJavaRow schema with recordType Field |
Finally, a tMap is used to route the input to its correct destination. The tMap defines three output schemas for each of the three sections of the file. These are NOT loosely defined. They contain the proper number of columns, column headers, and data types. Each output schema in the tMap is supported by an expression which will examine the recordType value to determine which schema should be applied.
tMap Defining 3 Output Schemas with recordType Expressions |
Output
Each output from the tMap is directed to a different component. This example uses tLogRows that print out the component name with the fields for debugging. The output could be three individual tables in an RDBMS in which case a unifying "load number" should be acquired from the DB. This load number would link together the header, body, and footer records for later re-assembly.
Two tFilterRow components are used to filter out the blank lines that demarcate the file sections but which are included in the processing flow.
State Table
The state table in this job is implemented using a Java Map. The Map is set up in the tJava at the start of the job.
Here is the code behind the tJava_1 component. From "Basic settings"
Map
new HashMap
Map
new HashMap
transition.put(FileState.HEADER, null);
stateTable.put( transition, FileState.BODY );
transition = new HashMap
transition.put(FileState.BODY, null);
stateTable.put( transition, FileState.TRAILER );
FileState currentState = FileState.HEADER;
globalMap.put("stateTable", stateTable);
globalMap.put("currentState", currentState);
"Advanced settings" has the usual imports. There is also a cool (my word) Java language trick to apply an enumerated value. I slipped in an enum definition for use throughout the job.
import java.util.Map;
import java.util.HashMap;
enum FileState { HEADER, BODY, TRAILER };
Applying the State Table
The state table and its transitions are applied in the tJavaRow component between the tMap and the tFileInputExcel. The code relies on a nested Map which requires a special two-part key. This is the code from "Basic settings" in the tJavaRow.
FileState currentState = (FileState)globalMap.get("currentState");
output_row.field1 = input_row.field1;
output_row.field2 = input_row.field2;
output_row.field3 = input_row.field3;
output_row.recordType = String.valueOf(currentState);
Map
(Map
Map
key.put( currentState, input_row.field1 );
FileState nextState = stateTable.get( key );
if( nextState != null ) {
currentState = nextState;
globalMap.put("currentState", currentState);
}
While there is a lot of Java code in this job to process the complex file, the job design is simple and scalable. New file sections can be added in a straight-forward manner and the demarcations (currently empty rows) can be more sophisticated.
No comments:
Post a Comment