Sign in to follow this  
  • entries
    24
  • comments
    21
  • views
    17468

Like a Hot Knif through Butter

Sign in to follow this  
Enigma

303 views

I've finished (the first pass of) my Jpeg2000 loader. I think I'm going to opt to call it Jackknif. That's Jackknife, the only English word I could find which contains a 'J' followed by two 'k's (J2K, get it?), with the e knocked off to indicate that it's not a complete implementation. Yes, there is method to my madness (or should that be madness to my method?).

It turns out I did manage to get a finished version of the code down to less than a thousand lines, which shows that it really isn't that complicated an algorithm. Speed wise I was competing against two open source reference implementation - JasPer (written in C) and JJ2000 (written in Java). My reference image (2048x2048 rgb) took approximately nine seconds to load under Jasper and approximately six seconds to load under JJ2000.

The first complete version of Jackknif was taking around fourteen seconds. I thought this was pretty reasonable and whipped out a profiler only to be rather confused by the results. The hotspot was showing up as 120 million calls to fill_n, but I only used fill_n in a couple of places. One place which should only have amounted to a few thousand calls and another, in static initialisation, which should only have involved about twenty or so calls. I took a careful look through the source and spotted a minor bug in my static initialisation code. It looked something like:
static int array[size];
static bool initialised = false;
if (!initialised)
{
function_which_initialises_array(array);
}
// code which uses array
I'd forgotten to set the boolean flag to true, so my array was being repeatedly initialised, to the tune of ~6 million times. Fixing that minor bug, along with a couple of very minor optimisations (changing arrays of ints to arrays of shorts) brought Jackknif down to just under six seconds.

I was very pleased with this. My fairly naieve implementation was outperforming even the "optimised" JJ2000 implementation. Next bottleneck was the filtering. The way it was implemented wasn't very cache friendly. I looped through every component and for each component looped through every row and then every column. To demonstrate, a 4 pixel square image would have been processed something like:
Image:
r11 g11 b11 r12 g12 b12 r13 g13 b13 r14 g14 b14
r21 g21 b21 r22 g22 b22 r23 g23 b23 r24 g24 b24
r31 g31 b31 r32 g32 b32 r33 g33 b33 r34 g34 b34
r41 g41 b41 r42 g42 b42 r43 g43 b43 r44 g44 b44

Visitation order (components):
r11 r12 r12 r14 r21 r22 r23 r24 r31 r32 r33 r34 r41 r42 r43 r44
r11 r21 r31 r41 r12 r22 r32 r42 r13 r23 r33 r43 r14 r24 r34 r44
g11 g12 g12 g14 g21 g22 g23 g24 g31 g32 g33 g34 g41 g42 g43 g44
g11 g21 g31 g41 g12 g22 g32 g42 g13 g23 g33 g43 g14 g24 g34 g44
b11 b12 b12 b14 b21 b22 b23 b24 b31 b32 b33 b34 b41 b42 b43 b44
b11 b21 b31 b41 b12 b22 b32 b42 b13 b23 b33 b43 b14 b24 b34 b44

Visitation order (array indices):
0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45
0 12 24 36 3 15 27 39 6 18 30 42 9 21 33 45
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46
1 13 25 37 4 16 28 40 7 19 31 43 10 22 34 46
2 5 8 11 14 17 20 23 26 29 32 35 38 41 44 47
2 14 26 38 5 17 29 41 8 20 32 44 11 23 35 47
I switched the order to loop though components of each pixel one after another and processed the first pixel of each column in order before processing the next column:
Visitation order (components):
r11 g11 b11 r12 g12 b12 r13 g13 b13 r14 g14 b14
r21 g21 b21 r22 g22 b22 r23 g23 b23 r24 g24 b24
r31 g31 b31 r32 g32 b32 r33 g33 b33 r34 g34 b34
r41 g41 b41 r42 g42 b42 r43 g43 b43 r44 g44 b44
r11 g11 b11 r12 g12 b12 r13 g13 b13 r14 g14 b14
r21 g21 b21 r22 g22 b22 r23 g23 b23 r24 g24 b24
r31 g31 b31 r32 g32 b32 r33 g33 b33 r34 g34 b34
r41 g41 b41 r42 g42 b42 r43 g43 b43 r44 g44 b44

Visitation order (array indices):
0 1 2 3 4 5 6 7 8 9 10 11
12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35
36 37 38 39 40 41 42 43 44 45 46 47
0 1 2 3 4 5 6 7 8 9 10 11
12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31 32 33 34 35
36 37 38 39 40 41 42 43 44 45 46 47
I expected that might bring the execution time down to around four and a half seconds, maybe four if I was lucky. I underestimated. With that simple optimisation the execution time plummeted to around 2.2 seconds.

I still have a few more optimisations to apply. I'm not sure when that will happen since I shall probably be working on something else this next week for a bit of a break. My target though is to bring that execution time down to no more than 1.5 seconds for my reference image.

I intend to release the final source code, both a cleaned up unoptimised version so people can see how the algorithm works, plus the final optimised version, under a permissive open source license when I'm done. The only thing I intend to disallow is patenting of techniques used in derivative works. I'm sure there exists an open source license with this kind of restriction. If anyone knows of a license with this restriction, please let me know - it'll save me a few minutes searching.

Finally, the obligatory screenshot, actually four screenshots in one:


The top left shows the fully decoded image, minus horizontal and vertical filtering, resized from 2048x2048 to 256x256. The top right shows the same image, but only the top left corner of it, at normal size. The bottom left shows the fully decoded image with horizontal and vertical filtering, again, resized from 2048x2048 to 256x256. The bottom right shows the same image, but again only the top left corner of it, at normal size.

?nigma
Sign in to follow this  


2 Comments


Recommended Comments

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now