Wednesday 31 March 2010

Expressing intent in scala

Scala gets some bad press because it's possible to get carried away with some its shiny new toys, especially the ability to create methods with non-alpha names. Some of this I agree with - methods like >:> really doesn't aid readability for me.

Don't throw out the baby with the bathwater though. I like scala because it makes it much easier for me to express my intent.

I've been working on some code that builds a solr index of a large database. To do this efficiently, it needs to divide the data set into smaller blocks for processing in parallel.

The ids in the table are autogenerated, and for historical reasons are far from monotonically incrementing. For the data retrieval queries to be efficient we need to find the primary keys for each block. The (oracle) query looks a bit like this:

SELECT id FROM 
   (SELECT id, ROWNUM rownumber 
      FROM [the_large_table] ORDER BY id) 
WHERE MOD(rownumber, 1000) = 0

Then we had a bit of java code to build batch objects from this:

public List<Batch> getBatches
    (int batchSize) {
  List<Integer> idList = 
    // result of the query above
  int maxId =  
    // find the max id from the table

  int currentId = 0;

  List<Batch> batches = 
    new ArrayList<Batch>();

  int num = 1;

  for (Integer nextId : idList) {
    batches.add(new Batch(num++, currentId, nextId-1));
    currentId = nextId;
  }

  batches.add(new Batch(num, currentId, maxId));

  return batches;
}

I converted this to the following scala yesterday:

private def buildBatches(
      num: Int, 
      idList: List[Int]) = {
  
idList match {
 case startOfThis :: startOfNext :: tail =>
   new Batch(num, startOfThis, startOfNext-1) :: 
    buildBatches(num+1, startOfNext :: tail)
  case _ => Nil
 }
}
  
def getBatches(batchSize: Int) = {
 val idList = // ...
 val maxId = // ...

 buildBatches(1, 0 :: idList ::: maxId :: Nil)
}


Importantly, when I discussed how to build up the batches with a colleague, we talked about the operation as "take the first two entries in the the list, build a batch from that, then go on to the next two entries in the list. Oh, and we need to pretend that the list from the database has a zero on the beginning and the max on the end." That's exactly what the code does.

"::" is the scala list append operator. So 1 :: 2 :: 3 :: Nil is equivalent to List(1, 2, 3). Nil represents an empty list. (You need the Nil on the end to let the scala compiler know that you're wanting to do stuff with lists.)

The magic here is the match in the buildBatches method (it looks better when it doesn't have quite so many line breaks!). The match works like this:

List(1, 2, 3, 4) match { 
  case a :: b :: c => 
    println(a)
    println(b) 
    println(c) 
  case _ => 
}
// output:
// 1
// 2
// List(3, 4)

This is why I'm liking scala: it enables me to express my intent in a way that less malleable languages like java do not. As always, the most important thing is to focus on readability, not no the shiny toys.

A future blog post will (hopefully) talk about the 20 lines of scala code that hides the complexity of jdbc from the the code.

2 comments:

Daithi said...

You've got me thinking about this and the zip method.

It's possible to create the id intervals with a zip:

val startIndices = 0 :: idList
val endIndices = idList.map(_ - 1) + maxId
val intervals = startIndices.zip(endIndices)

Then zipWithIndex to get a zero-based batch number and construct the object:

intervals.zipWithIndex.map(
   ((start, end), batchNum) => new Batch(batchNum + 1, start, end)
)

Scala is beautiful.

Daithi said...

And now that I think about it, that piece of code expresses the intent quite clearly. Which I think is precisely the point you were making...