There is a Spoon: Integrating ETL into your application

Posted by Tejus Parikh on August 30, 2007

A not-too-uncommon occurrence when building a data-driven application is the realization that you need an honest-to-goodness data warehouse. Having one-off statistical tables scattered throughout your schema just doesn’t cut it after a while. Thankfully, there’s plenty of information on the web on how to build a data-warehouse. Once you create your denormalized, summarized, and dimensioned schema, you face the challenge of getting the data out of your transaction processing database and into the warehouse. That’s the domain of ETL (Extract, Transform, and Load) tools. This post is about how to get Pentaho Data Integration (formerly known as Kettle) embedded within your application. Since I like ‘Kettle’ much better than PDI, I’m going to use Kettle throughout the post. Step 0: Create your transformation The transformation will be very specific to the tables you are trying to transform. However, there are a few general guidelines that will make the process easier. Unless your production environment is your workstation, it’s a good practice to use Kettle variables for all the database fields. To make the spoon gui work, you will need to add a kettle.properties file to your userhome/.kettle/ directory. I’ll show you how to set the properties within your application a little further down. Step 1: Getting your project environment configured The first step is to identify which jars you’ll need to add to your project. Kettle depends on a lot of the usual java libraries, so the new stuff you need to add is probably minimal. From the lib directory of the Kettle distribution, you’ll need kettle.jar. Given that I work on a server application, I was somewhat chagrined that I had to import common.jar from the libswt directory. You’ll probably need to grab a few other things from libext before the kettle stuff will compile. What specifically you need will depend on what you already have. You won’t need everything, since you are only going to embed a small portion of the total Kettle toolkit. Step 2: Create the Skeleton There are three steps to executing a Kettle transformation: initializing the environment, loading the transformation, and execution. Therefore, I created a method that currently calls stubs to handle each of these steps as well as repackage any exceptions that are thrown.


    private void executeKettleTransform(String transformLocation) throws WarehouseDataEtlException {

        try {

            initializeEnvironment();

            TransMeta meta = loadTransform(transformLocation);

            executeTransform(meta);

        } catch(Throwable t) {

            throw new WarehouseDataEtlException("Unable to run transform at " + transformLocation, t);

        }

    }

Step 3: Implement the initialization functions The initialization method needs to do a few things. First, it has to initialize the transformation’s environment, followed by loading the steps, then setting all the variables for the jdbc properties.

    private void initializeEnvironment() {

        EnvUtil.environmentInit();

        StepLoader steploader = StepLoader.getInstance();

        if (!steploader.read()) {

            throw new IllegalStateException("Spoon broke for some reason");

        }

        

        KettleVariables kettleVariables =  KettleVariables.getInstance();

        kettleVariables.setVariable("jdbc.server.name", jdbcServerName);

        kettleVariables.setVariable("jdbc.olap.server.name", jdbcOlapServerName);

        kettleVariables.setVariable("jdbc.dbname", jdbcDBName);

        kettleVariables.setVariable("jdbc.olap.dbname", jdbcOlapDBName);

        kettleVariables.setVariable("jdbc.username", jdbcUsername);

        kettleVariables.setVariable("jdbc.password", jdbcPassword);

    }

One of the more annoying things about working with the Kettle API is it’s use of returning false for failures instead of exceptions. Hence, the boolean check around steploader.read(). Also note that this is where we set the variables for our application runtime. This approach lets the developer run the transformation in the application as well as from the tools provided with the Kettle project, which can be extremely helpful while debugging. Step 4: Load the transformation A Kettle transformation is just an XML file, which provides some flexibility in how you load it within your server app. I chose to put the transformation on the classpath, then use the class loader to get access to it. Since the 2.5.1 release of Kettle does not support loading a transformation directly from a stream, you have to parse the XML then feed it to the transformation metadata object.

    private TransMeta loadTransform(String transformName) throws Exception {

        String transformLocation = getTransformPath(transformName);

        TransMeta transMeta = null;

        InputStream transformStream = this.getClass().getResourceAsStream(transformLocation);

        

        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();

        DocumentBuilder builder = factory.newDocumentBuilder();

        Document doc = builder.parse(transformStream);



        transMeta = new TransMeta(doc.getFirstChild());

        return transMeta;

    }

A crucial realization is that you need to pass in the first child of your xml document, and not the root of the document itself. Otherwise, Kettle will throw exceptions about being unable to find anything to load. The call to getTransformPath is a call to an internal method that maps a string to a classpath location. Step 5: Execute the Spoon Executing the transformation is just getting an instance of the transformation from the transformation metadata object, then executing it. If you are logging transformation executions to a table, you need to wait for it to finish, then manually call end processing. Failure to do these steps will mean you will never see a row with the status end in the log table. This is especially bad if you are using transformation times in your where clauses for data input steps.

    private void executeTransform(TransMeta transMeta) throws Exception {

        Trans trans = new Trans(LogWriter.getInstance(), transMeta);

        if(!trans.execute(transMeta.getArguments())) {

            throw new RuntimeException("Transformation failed");

        }

        

        trans.waitUntilFinished();

        trans.endProcessing("end");

    }

That’s all there is, and you can easily test it by wrapping a junit around the execute method call. In the next few weeks, I’ll post something about doing near-real-time warehousing using quartz, spring, and the code that I’ve shown here.

Tejus Parikh

I'm a software engineer that writes occasionally about building software, software culture, and tech adjacent hobbies. If you want to get in touch, send me an email at [my_first_name]@tejusparikh.com.