Tuesday, June 11, 2013

When Talend Open Studio's Guess Schema Isn't Enough

When Talend Open Studio's Guess Schema Isn't Enough

If you have a text file or an Excel spreadsheet of unknown quality, sometimes the Talend Open Studio for Data Integration feature "Guess Schema" may be too strict.

It may not be possible to define a strict schema for a text file or Excel spreadsheet.  For example, the source file on which you're working may not give the whole domain or range of possible values.

Take the following data file
A Data File with a "Obvious" Schema
Loading this into Talend Open Studio's metadata would produce a schema like this.  "field_c" and "field_d" are Integers.

However, later on in the file, there is some variation.  There is an "a" (not an Integer) in field_c.

A Letter Shows Up in a Number Column
 If this happens once or twice in a schema with a few fields, this can be corrected in the UI.  But if there are many fields -- 100 or 1,000 -- consider creating your own schema XML file for import, then tweak the schema with (hopefully) fewer edits.  Talend Open Studio can generate this file using the following job.


The XML File

Talend Open Studio's metadata can be created field-by-field in the UI, through a wizard, or by importing an XML file.  To import an XML file, create metadata and use the import button.

Import Button on a Metadata Wizard
 The XML file format is a set of elements under a toplevel .
XML Format
This schema was generated with a Talend job, defining each field as a String. Although the Talend job uses a hardcoded "id_String" with each field -- ignoring Talend's interpretation -- the job does consider the field name.


Talend Job to Generate

This job reads the first line of a CSV file to provide a list of file names. The tFileInputDelimited does not use a delimiter (note the "" in the component view) so that it provides a single string to the next component to turn the single line into a list of records.


Job to Produce with tFileInputDelimited
The schema to tFileInputDelimited is a single field "line".  This feeds into a tNormalize component which will turn each field name into a record.


Turn Each Field in CSV List into a Record
The tMap provides mostly hard-coded values that can be adjusted for your preferences.  I'm defining each field as a String.  The line is mapped to label which is the field name.
Finally, configure the XML component.  I'm using the tFileOutputXML because it's the simplest to configure and all that's needed for this row-oriented () output.


XML - Basic Settings
I have one special hitch on the Advanced Settings tab.  Apparently, you can't call a field "default", so I'm calling a field "DEFAULT" (all caps).  I then use the Advanced Settings tab to make sure that the required lower-case "default" is outputted.


XML - Advanced Settings
For schemas with a few fields, you can create a schema by hand through the UI.  I almost always use the schema wizards especially the database which will provide both Talend types (String) and database types (VARCHAR).  This Talend job can generate schemas using your own criteria.  Load the schemas into the metadata panel.

No comments: