Tuesday, June 18, 2013

Talend Open Studio Calling: Clean Up Your Phone Numbers


Google's library for working with phone numbers -- libphonenumber -- is available at Google code.  The power of the library is apparent when you read that it has support for 228 countries or regions.  Bringing this library into Talend Open Studio helps you standardize your phone numbers.

With support for the phone number formats of 228 countries and regions, libphonenumber from Google Code is a valuable addition to Talend Open Studio.


libphonenumber

libphonenumber is hosted at Google Code: downloads. I built the 3.0 version from source code, but you can download a JAR file if you don't have a JDK and Maven handy.  Note the download location for use in the tLoadLibrary component.

libphonenumber will take data like the following variations of a Maryland phone number.

(301) 555-5555|3015555555|301.555.5555

And produce three standardized entries.

(301) 555-5555|(301) 555-5555|(301) 555-5555

Talend Job

To demonstrate libphonenumber, the following job uses a tRowGenerator to send a record to a tJavaRow component.  There are tLogRow components for output and a tLoadLibrary component for loading libphonenumber-3.0.jar and various imports.

libphonenumber Test Job

 tLibraryLoad

Configure the Basic settings tab by searching for the JAR file loaded on Google Code.

In the Advanced settings tab, add the following import statements.  The class structure (nested classes and enums) can be a little tricky if you're not used to Java, so double-check this if there's a problem.

import com.google.i18n.phonenumbers.PhoneNumberUtil;
import static com.google.i18n.phonenumbers.Phonenumber.PhoneNumber;
import static com.google.i18n.phonenumbers.PhoneNumberUtil.PhoneNumberFormat;


tRowGenerator

The tRowGenerator column generates a single three-column record based on this schema.

tRowGenerator Schema
The three-column record uses the following phone number values.

tRowGenerator Generated Values
   
tJavaRow

The tJavaRow component uses the classes of libphonenumber.  It forms several  PhoneNumber Java objects using the PhoneNumberUtils.parse() method.  Note the country code that is listed along with the phone number ("US").  The parse() calls are followed up by format() calls that work with the PhoneNumber Java objects.  The format() calls return Strings to the output_row columns.

Enter this code in the tJavaRow Basic settings tab. "INTERNATIONAL" or "E164" can be substituted for "NATIONAL" to render a different format.

try {

 

PhoneNumberUtil phoneUtil = PhoneNumberUtil.getInstance();

PhoneNumber col1_pn = phoneUtil.parse(input_row.usPhoneCol1, "US");
PhoneNumber col2_pn = phoneUtil.parse(input_row.usPhoneCol2, "US");
PhoneNumber col3_pn = phoneUtil.parse(input_row.usPhoneCol3, "US");

output_row.usPhoneCol1 = phoneUtil.format(col1_pn, PhoneNumberFormat.NATIONAL);
output_row.usPhoneCol2 = phoneUtil.format(col2_pn, PhoneNumberFormat.NATIONAL);
output_row.usPhoneCol3 = phoneUtil.format(col3_pn, PhoneNumberFormat.NATIONAL);
}
catch(NumberFormatException exc) {
exc.printStackTrace();
}


PhoneNumber objects can do a lot more than just be formatted.  They can return parts of the phone number like the country code, area code, or local exchange.

228 Countries and Regions

You can work with Java regular expressions in Talend Open Studio, but to cover this much functionality will require quite a lot of them.  This example could use some tweaking around parsing errors, possibly rejecting the record or providing a default value.  As a starting point, this is a great way to get phone numbers in shape.

No comments: