A few posts ago, I talked about how to use Kettle to transform data from your normalized structures into something useful for data warehousing. In-order to have a near real-time data-warehouse, we need run our jobs frequently. This could have the unintended side-effect of a small hiccup causing a snowballing slowdown. Therefore, ensuring that two jobs do not run concurrently is a major concern. You could write a shell script to launch pan, record the pid, then check for the pid if cron starts the job again before the previous instance has had the opportunity to complete. Or you could just use Quartz.
If you’re familiar with Quartz, setting up a job for a Kettle transform is trivial. The key thing is to implement StatefulJob
. A side-effect of persisting the state of the job across invocations has a side-effect of forcing the job to complete before it can be started again. Which means that the downwards spiral of resource consumption is not possible.
Another safe-guard is logging when the the time to execute a transformation is longer than the the time allocated. This can be calculated from properties of the JobExecutionContext (context).
long start = System.currentTimeMillis(); runJob(); long runtime = System.currentTimeMillis() - start; long startTime = context.getFireTime().getTime(); long nextFireTime = context.getNextFireTime().getTime(); if(runtime > (nextFireTime - startTime)) { //log error message }The call
context.getFireTime()
returns the time the job started (when the trigger was fired). Next fire time returns the next time the job is supposed to start. If the time between these points is shorter than the time it took run, well then you’ve got a problem. Thankfully, I haven’t run into this yet, so no ideas on how to fix it!