I have plenty of respect for the people who work in Image Processing.
However, you wouldn’t guess that from my latest experiment. I didn’t do any reading or studying. I didn’t try to stand on the shoulders of giants. I just plowed in and tried something. Unsurprisingly, I had mixed results.
Let me build up to the problem…
An Easier Problem: Using hashes to detect duplicates
Suppose you have a large collection of files, and you suspect there are duplicates. How do you find them?
The standard solution would be to create a hash of each file and add references to a hash table, looking for duplicates as you go.
What hash would you use for random files?
The first thing to remember is you need a better hash than you might think because of the Birthday Paradox – there are a lot of pairwise comparisons here –
O(n2) – and there are likely to be more clashes than you might first expect.
SHA (or some other cryptographic hash) would probably be the gold-standard, but it is computationally expensive and unnecessarily secure against forgery.
CRC is cheaper, but is intended for checking that files aren’t corrupted. The Python CRC library explicitly warns against using it for hashes. I haven’t completely understood why.
I wonder if XOR could be used as a cheap hash. Divide the file into, say, 64 bit chunks, and XOR them together. It would be hopeless at doing MD5′s job of being unforgeable, and it would be hopeless at doing CRC’s job of detecting missing packets, but it sounds like a cheap and reasonable hash for unknown data. I haven’t investigated further than that.
Using customised hashes to detect approximate matches
So, the next trick is to use special hashes (based on knowledge of the type of data) to detect approximate matches.
I last did this in anger about two years ago, when I used the Soundex hash to look over our customer database. Soundex is a hash that attempts to make like-sounding English strings end up with the same hash value, so you can fix misspellings and the like. I successfully found several cases where customers appeared twice in our database with slight variations to the spellings of their names.
The Real Problem: Detecting Duplicate Photos
So, I have lots of photos, and some are duplicated on several web-sites, with no reference to the original source. I want to detect matches, so I can move them (and their associated meta-data) all to a single yet-to-be-determined destination.
But here’s the snag. Some of the photos have been made into different sizes and qualities for web-viewing. It would save me time if I could automatically detect this. I need an equivalent to SoundEx that works on images, so images have the same hash even if they are resized.
How would you do that? If you answered “Google it, and find what the standard solution is”, you might be smarter than me. This article is about what I did instead.
Remember what the cost is of making a mistake.
If the algorithm makes a false positive (i.e. declares two photos are the same when they are not), I will notice when I sift through the results and I will correct it. Total cost: 5 seconds effort.
If it makes a false negative (i.e. fails to realise two images are the same), it won’t be as easy to find. I will end up with a duplicate photo in the final database, or I will fail to merge meta-data. Total cost: practically nothing; slightly lower quality photo site.
So, with the stakes so low, I just waded in with my eyes shut and I created my own hash function.
Hash Function, Version 1
My first version of the hash function had two parts.
The first part was the height:width ratio. I assumed that any two photos whose height:width ratio differed by more than 2% couldn’t be rescalings of the same photo.
The second part was a 5×5 array of tuples. To produce the array, I first divided each photo into 5×5 equal-sized rectangles, and summed all the Red, Blue and Green values for each pixel in each rectangle.
In the 5×5 array, I stored the Red:Blue ratio and Green:Blue ratio. I figure that was better storing colour ratios rather than simple brightness, because I feared that there might sometimes be some adjustments to contrast and brightness. I hoped (without evidence) that such adjustments wouldn’t affect the colour ratios. I assumed that any two photos whose colour ratios were within 5% in all 25 boxes were probably the same photo.
If you are noticing a lot of wishful thinking here, you aren’t alone.
I wrote my own “equivalence” method, so two image hashes could be compared. If the hashes revealed compatible height:width ratios and compatible red:blue and green:blue ratios, then the hashes matched.
I tested the result with three images. The first was a photo of a person. The second was a photo of the same person on the same day at a different time. The third was a thumbnail version of the first image.
It correctly detected that Images 1 and 2 were different and Images 1 and 3 were the same!
Not a very thorough unit test, but it passed! Yay!
However, there was a fatal flaw with this hash function. Two equivalent hashes wouldn’t have bitwise equality. These hashes could not be used in a standard hash-table, and
O(n2) hash comparisons would be required. With thousands of photos, this would be too slow. Time for a rethink.
Hash Function Version 2
The second version of the hash function was built on top of the first one.
The height:width ratio was discarded.
Each of the 5×5 colour-ratios tuples was numbered 0 to 24, and numbers were sorted based on the size of each of the corresponding ratios. (I made sure the Python sort algorithm was stable.)
The hash was now two tuples (representing the red:blue and green:blue ratios) with 25 elements – each element was in the range 0 to 24.
So, if box 3 had the highest red:blue ratio, and box 5 had the second-highest red:blue ratio, the first of the tuples would be (3, 5, … ).
Those 50 integers characterised the relative colouring of the sections of image (or so I hoped!), and were ready for bitwise comparison to other hashes.
I ran my trivial unit tests again. They failed!
Within Image 1, some boxes had colour ratios very close to other boxes. Within Image 3, during the resizing, the relative ordering wasn’t preserved – a slight increase in one ratio made it larger than another. Of the 50 integers, 48 matched, but two adjacent ones were swapped. A false negative was the result.
I pondered how to fix this; it was tricky, so I started with an initial, quick hack. I dropped the number of boxes per image from 5×5 to 4×4. A smaller number of boxes, with more samples in each box, reduced (but didn’t eliminate!) the chance of the order being changed during rescaling. It should reduce false negatives, at the cost of more false positives.
I re-ran the unit-test and it passed! Woohoo!
Time for more unit-tests? Nah! Time for the system test. I used the Image Hash to run the basic duplicates search algorithm over hundreds of photos.
I did get some false positives, but many fewer than I expected. All of the false positives were black-and-white photos. Black-and-white pictures should have the exact same colour ratio in each of squares, making the hash meaningless.
I also had false negatives, although that was harder to immediately detect, without manually wading through the photos by hand. I detected them by further reducing the hash to a 3×3 array, and that revealed a lot of the images that had been missed (false negatives) by the 4×4 version.
I tried to create an effective hash that detected photos that were approximately the same apart from rescaling. I hoped it would also be agnostic to simple brightness and contrast adjustments.
The result was completely ineffective for black-and-white photos, but apart from that had surprisingly few false positives. However, it did suffer from a moderate number of false negatives.
I suppose I should go read a book on Image Processing and find out how the pros do it. How tedious, compared to reinventing the wheel…