For loading a set of files into a staging table with Talend Open Studio, use two subjobs: one subjob for clearing the tables for the overall job and one subjob for iterating over the files and loading each one.
In Talend Open Studio, you can apply processing directly to a spreadsheet in preparation for loading into a data mart. However, it's often better to do this type of work in the database itself. This is because of the flexibility and performance of running queries directly in the database process. For example, rather than gathering surrogate keys outside of the database using tMap and lookups from a file processing loop, load the data into the database and use a query containing high-performance joins for additional transformations.
File Processing
Start by defining the main processing loop based off of tFileList. This will feed into a tMap component. The tMap component can provide some basic NULL-checking and data type validation, spooling any rejects off to reject table.
This is a Talend Open Studio job containing 3 subjobs that load a directory of spreadsheets into a MySQL table, leaving the results in place for additional processing.
While the tMysqlOutput components have an option to Clear or Truncate tables, we want to leave the records from each iteration in order to gain an economies of scale. That is, load every data file and then process the entire set in one large batch rather than flushing the contents out with every iteration. I use a single iteration of a tForeach (added a single element "once" to the list) to drive the two delete statements.
Rejecting Records
The tMap component contains logic for rejecting records. An alternate implementation would put this logic into a component like tFilterRow so that the specific failure condition could be tracked.
The expression a the top of the tMap contains several Commons Lang calls to check the presence and data type of the fields. If the criteria is not met, the record is not written out. A second schema -- directed to the reject table -- contains the opposite expression.
I used the Java not equal operator '!' to keep the two expressions, the main criteria and the reject criteria, in sync. I took the REV expression and cut-and-pasted it into the REV_REJECT expression and wrapped the code with a !() block. This is to make sure that there aren't any logical gaps.
tRules, available on Talend Exchange will also handle rejects using a Javascript syntax defining the filtering conditions.
Stage as String
The schemas for components like tFileInputExcel can use any simple data type available in Java: String, Integer, Date, Double. I find it useful to stage everything using very loose schemas where each field is a String. This has the following advantages
Staging is used to load a database table where it is more convenient to analyze and process a record than a set of flat files. Once in the database, performant queries can be run without suffering the network latency of having to move data in and out of the database process. Talend Open Studio components provide the file processing and the data loading steps in this arrangement of subjobs.
In Talend Open Studio, you can apply processing directly to a spreadsheet in preparation for loading into a data mart. However, it's often better to do this type of work in the database itself. This is because of the flexibility and performance of running queries directly in the database process. For example, rather than gathering surrogate keys outside of the database using tMap and lookups from a file processing loop, load the data into the database and use a query containing high-performance joins for additional transformations.
File Processing
Start by defining the main processing loop based off of tFileList. This will feed into a tMap component. The tMap component can provide some basic NULL-checking and data type validation, spooling any rejects off to reject table.
This is a Talend Open Studio job containing 3 subjobs that load a directory of spreadsheets into a MySQL table, leaving the results in place for additional processing.
Talend Open Studio - A Typical ETL Staging Job |
While the tMysqlOutput components have an option to Clear or Truncate tables, we want to leave the records from each iteration in order to gain an economies of scale. That is, load every data file and then process the entire set in one large batch rather than flushing the contents out with every iteration. I use a single iteration of a tForeach (added a single element "once" to the list) to drive the two delete statements.
Rejecting Records
The tMap component contains logic for rejecting records. An alternate implementation would put this logic into a component like tFilterRow so that the specific failure condition could be tracked.
Talend Open Studio - Map Only Valid Fields |
Talend Open Studio - Mapping Reject |
tRules, available on Talend Exchange will also handle rejects using a Javascript syntax defining the filtering conditions.
Stage as String
The schemas for components like tFileInputExcel can use any simple data type available in Java: String, Integer, Date, Double. I find it useful to stage everything using very loose schemas where each field is a String. This has the following advantages
- Bring data into the system where it can be handled, and
- Allow for runtime processing related to Locales or conditional formatting.
Staging is used to load a database table where it is more convenient to analyze and process a record than a set of flat files. Once in the database, performant queries can be run without suffering the network latency of having to move data in and out of the database process. Talend Open Studio components provide the file processing and the data loading steps in this arrangement of subjobs.
No comments:
Post a Comment