Tuesday, June 18, 2013

A School Revenue Data Mart Project


To try out Talend Open Studio in a Data Mart scenario, I put together a small project based on converting a spreadsheet of revenues for the Frederick Country (Maryland, US) school system into a dimensionally-modeled data mart.

I first formed a spreadsheet based on PDF reports posted on the school system's web site.  Each year, the PDFs reported actual revenues and their sources.  I left the expenses off for now.  I copied the values from each year into a column of the spreadsheet.  I didn't attempt to normalize any of the revenue sources since they are all from the same school system.  Here is a copy of the spreadsheet: FCPS Revenues.

The target data model is dimensional.  This means that each focal point of analysis -- year, school system, and revenue source -- will have a row in a dimension (a table).  A fact table contains the actual revenue value (a U.S. dollar amount).  The fact table will refer to the dimensions. The data model is presented below.

Dimensional Data Model for FCPS Revenues
I'll next create a Talend Open Studio Job for loading the spreadsheet into the data mart.  My focus will be on the rules and data transformations rather than the file processing (for now).

Here is a data dictionary describing the tables.
  • SR_DIM_SYSTEM.  A governmental school system.  A REVENUE must be reported by just one other SYSTEM.  A SYSTEM may report one or more REVENUES.
  • SR_DIM_YEAR.  A fiscal year.  A YEAR may record one or more REVENUES.  A REVENUE must be recorded by just one other YEAR.
  • SR_DIM_REVENUE_SOURCE.  A line item in the revenue summary.  A REVENUE_SOURCE may classify one or more REVENUES.  A REVENUE must be classified by just one other REVENUE_SOURCE.
  • SR_FACT_REVENUE. A value for the REVENUE.
The dimensions are Kimball Type 1 ("upsert") dimensions.

While useful for personal productivity, previous tutorials showed simple cases of loading an Excel file into a MySQL table or converting a text file to JSON.  Integration tools become far more beneficial with more substantive data work.  This spreadsheet and E-R diagram are the basis for a more significant example.

No comments: