It’s Too Dang Slow

As I mentioned in my previous post, I’ve been working on my mosaic program again, and in trying to implement a feature that I was really excited about, I’ve discovered it to be lacking in one primary element – speed. I’m going to describe the issue here, in the hopes that I may be able to tease out a new mechanism that might be at least fast enough.

The Problem

In creating an image mosaic, my goal is to find a combination of images from the image library that best matches the image – thus reducing the amount of distortion from creating a mosaic. If we consider the “cost” of placing a member of the image library into a specific region of the goal image, the main idea then is to find a minimal cost matching.

That’s the big picture, but if we drill into the details, one thing drops out – we have to evaluate that cost function a lot. Consider, the following very small example:

An image that is 1,820×1,320 pixels (at 300DPI ~6.1×4.4 inches) has 2,402,400. If I want to tile this image with 400 images from an image library, each would then be 91×66 pixels or a total of 6,006 pixels. To compare each image in the image library to each section of the goal image would be:

400 × 400 × 6006 = 960,960,000 executions of the cost function.

My Old Solution

In my old code, doing this computation was pretty straightforward – I simply took the absolute value of the difference of the three pixel channels (red, green, and blue), and returned the total. In all, 8 integral operations (three subtractions, three absolute values, and two additions). If we have to add a whole bunch of these together, on average, we’re looking at one additional addition for a total of 9. Taking that into context of the small example above, on a perfectly optimized gigahertz machine, I would expect this to take approximately:

960,960,000 × 9 / 1,000,000,000 = 8.648 seconds

This is pretty snappy, and worked great for my old code.

My New Solution

In my new code, I strategized this away, so that I could do different things to determine this cost. Specifically, I wanted to use the CIE-L*ab color space as it is a better approximation for how humans see the differences in color. In my initial implementation, I did this very naively and simply converted each pixel from RGB to LAB each time it needed to be measured. Converting a pixel from RGB to CIE-L*ab color space in the easiest way takes approximately 97 floating point operations (this is a pretty gracious estimate). Since I have to do this to two pixels to do a comparison, and then perform the same 9 operations I did above, this turns out to be approximately:

960,960,000 × 9 / 1,000,000,000 + 960,960,000 × 2 × 97 / 200,000,000 = 932.131 seconds

Which, at over 15 minutes, is starting to feel pretty sluggish, because I haven’t even started solving the mosaic yet!

Possible Options

As soon as I found my current implementation unusable, I looked into better ways of accomplishing this task that would be more efficient. I knew that any efficiency would help because the main cost is the fact that I have to run the same function a great many times.

My first thought was to rewrite the computation as a lookup – so, instead of doing a bunch of floating point operations at run time, I would do these operations ahead of time and just do a table lookup at runtime (in effect, trading time for space). The main problem with doing this is creating the table. Since a pixel is three dimensional, and the conversion to CEI-L*ab is dependent on all of them simultaneously, my lookup must also be three dimensional. With pixels ranging from 0-65535, it would cost approximately:

65536 × 65536 × 65536 = 281,474,976,710,656 memory locations

and, at 6 bytes per location,

281,474,976,710,656 × 6 = 1,536 Tbytes of space

far more than I could hope to hold in memory. I may return to this idea using fewer colors (at the cost of having a less accurate cost function), but I wanted to try some other possibilities.

My next idea was to utilize caching – ideally, I would only want to convert any given pixel once, and then just look it up instead of having to do the computation every time. The potential savings here is that instead of having to do two conversions every time the cost function is called, I may be able to replace a significant number of those with lookups. In our small example from above, I could in theory replace the term:

(400 × 400 × 6006 × 2) × 97 flops

with

(2,402,400 + 400 × 6006) × (97 flops + cache store) + 1,917,115,200 × (lookup cost)

Which, if the lookup cost is less than 97 flops, this is a pretty good savings. Also, the memory footprint for our example would be quite a bit smaller than the completely lookup version at:

4,804,800 memory locations and 4,804,800 × 6 = ~27.49 Mbytes.

In doing some testing, this translated into about a 10% savings using std::map for the cache. Bummer – this is still pretty hefty.

Other Possibilities

There are some other things that I haven’t had the time to play with yet, but I feel are likely to be worth looking into. First, I can do a low-fidelity comparison – using a 0-255 pixel range rather than 0-65535, I could use the first option with fewer memory locations. 256 × 256 × 256 = 16,777,216 memory addresses and 96 Mbytes. I may alternately be able to get away with sampling the image for the comparison – rather than comparing every pixel, I could compare, say, every 8th pixel. This has the potential to be a great savings, but, thinking I could do the same with the RGB version (to make that even faster), is slightly less satisfying.

I’m still playing around with ideas – hopefully one of them will show some promise…

Update – actual testing

So, I’ve had the chance to implement a few of the options I talked about above, and think I have a clear winner as well as a reasonable runner-up. These tests were run against the small test that I described above (400 91×66 library images and a 1,820×1,320 goal image) on my laptop. As a comparison, the plain RGB setup took 10 seconds.

  • Naive computation: 547 seconds
  • Standard Map: 345 seconds (36.9% speedup)
  • Hash Map: 137 seconds (75.0% speedup)
  • 256x256x256 Lookup (including table creation): 34 (93.8% speedup)

So, the lookup table is clearly very efficient, while the hash hap is next in line. I was surprised that the standard map was so inefficient – I guess it’s the difference between log n lookups versus constant lookups. The catch is that the table creation is a low-fidelity approximation, so I’m giving something up for the speed.