{"id":599,"date":"2008-08-25T01:38:45","date_gmt":"2008-08-24T15:38:45","guid":{"rendered":"http:\/\/www.somethinkodd.com\/oddthinking\/?p=599"},"modified":"2008-08-25T01:38:45","modified_gmt":"2008-08-24T15:38:45","slug":"hashes-to-detect-resized-images","status":"publish","type":"post","link":"https:\/\/www.somethinkodd.com\/oddthinking\/2008\/08\/25\/hashes-to-detect-resized-images\/","title":{"rendered":"Hashes To Detect Resized Images"},"content":{"rendered":"<p>I have plenty of respect for the people who work in Image Processing. <\/p>\n<p>However, you wouldn&#8217;t guess that from my latest experiment. I didn&#8217;t do any reading or studying. I didn&#8217;t try to stand on the shoulders of giants. I just plowed in and tried something. Unsurprisingly, I had mixed results.<\/p>\n<p>Let me build up to the problem&#8230;<\/p>\n<h4>An Easier Problem: Using hashes to detect duplicates<\/h4>\n<p>Suppose you have a large collection of files, and you suspect there are duplicates. How do you find them?<\/p>\n<p>The standard solution would be to create a hash of each file and add references to a hash table, looking for duplicates as you go.<\/p>\n<p>What hash would you use for random files?<\/p>\n<p>The first thing to remember is you need a better hash than you might think because of the <a href=\"http:\/\/en.wikipedia.org\/wiki\/Birthday_paradox\" title=\"Wikipedia definition of Birthday_paradox\" class=\"wikipedia\">Birthday Paradox<\/a> &#8211; there are a lot of pairwise comparisons here &#8211; <code>O(n<sup>2<\/sup>)<\/code> &#8211; and there are likely to be more clashes than you might first expect.<\/p>\n<p><a href=\"http:\/\/en.wikipedia.org\/wiki\/SHA_hash_functions\" title=\"Wikipedia definition of SHA_hash_functions\" class=\"wikipedia\">SHA<\/a> (or some other cryptographic hash) would probably be the gold-standard, but it is computationally expensive and unnecessarily secure against forgery.<\/p>\n<p><a href=\"http:\/\/en.wikipedia.org\/wiki\/Cyclic_redundancy_check\" title=\"Wikipedia definition of Cyclic_redundancy_check\" class=\"wikipedia\">CRC<\/a> is cheaper, but is intended for checking that files aren&#8217;t corrupted. The Python CRC library explicitly warns against using it for hashes. I haven&#8217;t completely understood why.<\/p>\n<p>I wonder if XOR could be used as a cheap hash. Divide the file into, say, 64 bit chunks, and XOR them together. It would be hopeless at doing MD5&#8217;s job of being unforgeable, and it would be hopeless at doing CRC&#8217;s job of detecting missing packets, but it sounds like a cheap and reasonable hash for unknown data. I haven&#8217;t investigated further than that.<\/p>\n<h4>Using customised hashes to detect approximate matches<\/h4>\n<p>So, the next trick is to use special hashes (based on knowledge of the type of data) to detect approximate matches.<\/p>\n<p>I last did this in anger about two years ago, when I used the <a href=\"http:\/\/en.wikipedia.org\/wiki\/Soundex\" title=\"Wikipedia definition of Soundex\" class=\"wikipedia\">Soundex<\/a> hash to look over our customer database. Soundex is a hash that attempts to make like-sounding English strings end up with the same hash value, so you can fix misspellings and the like. I successfully found several cases where customers appeared twice in our database with slight variations to the spellings of their names.<\/p>\n<h4>The Real Problem: Detecting Duplicate Photos<\/h4>\n<p>So, I have lots of photos, and some are duplicated on several web-sites, with no reference to the original source. I want to detect matches, so I can move them (and their associated meta-data) all to a single <a href=\"http:\/\/www.somethinkodd.com\/oddthinking\/2008\/06\/12\/online-photo-db-stage-3-identify-possible-solutions\/\">yet-to-be-determined<\/a> destination.<\/p>\n<p>But here&#8217;s the snag. Some of the photos have been made into different sizes and qualities for web-viewing. It would save me time if I could automatically detect this. I need an equivalent to SoundEx that works on images, so images have the same hash even if they are resized.<\/p>\n<p>How would you do that?  If you answered &#8220;Google it, and find what the standard solution is&#8221;, you might be smarter than me. This article is about what I did instead.<\/p>\n<p>Remember what the cost is of making a mistake. <\/p>\n<p>If the algorithm makes a false positive (i.e. declares two photos are the same when they are not), I will notice when I sift through the results and I will correct it. Total cost: 5 seconds effort.<\/p>\n<p>If it makes a false negative (i.e. fails to realise two images are the same), it won&#8217;t be as easy to find. I will end up with a duplicate photo in the final database, or I will fail to merge meta-data. Total cost: practically nothing; slightly lower quality photo site.<\/p>\n<p>So, with the stakes so low, I just waded in with my eyes shut and I created my own hash function. <\/p>\n<h4>Hash Function, Version 1<\/h4>\n<p>My first version of the hash function had two parts.<\/p>\n<p>The first part was the height:width ratio. I assumed that any two photos whose height:width ratio differed by more than 2% couldn&#8217;t be rescalings of the same photo. <\/p>\n<p>The second part was a 5&#215;5 array of tuples. To produce the array, I first divided each photo into 5&#215;5 equal-sized rectangles, and summed all the Red, Blue and Green values for each pixel in each rectangle. <\/p>\n<div class=\"aside\"><a href=\"http:\/\/www.pythonware.com\/products\/pil\/\">PIL<\/a> has a histogram function, which meant the bulk of the work was written in optimised C, so the performance was fine.<\/div>\n<p>In the 5&#215;5 array, I stored the Red:Blue ratio and Green:Blue ratio. I figure that was better storing colour ratios rather than simple brightness, because I feared that there might sometimes be some adjustments to contrast and brightness.  I hoped (without evidence) that such adjustments wouldn&#8217;t affect the colour ratios. I assumed that any two photos whose colour ratios were within 5% in all 25 boxes were probably the same photo.<\/p>\n<p>If you are noticing a lot of wishful thinking here, you aren&#8217;t alone.<\/p>\n<p>I wrote my own &#8220;equivalence&#8221; method, so two image hashes could be compared. If the hashes revealed compatible height:width ratios and compatible red:blue and green:blue ratios, then the hashes matched.<\/p>\n<p>I tested the result with three images. The first was a photo of a person. The second was a photo of the same person on the same day at a different time. The third was a thumbnail version of the first image.<\/p>\n<p>It correctly detected that Images 1 and 2 were different and Images 1 and 3 were the same!<\/p>\n<p>Not a very thorough unit test, but it passed! Yay!<\/p>\n<p>However, there was a fatal flaw with this hash function. Two equivalent hashes wouldn&#8217;t have bitwise equality. These hashes could not be used in a standard hash-table, and <code>O(n<sup>2<\/sup>)<\/code> hash comparisons would be required. With thousands of photos, this would be too slow. Time for a rethink.<\/p>\n<h4>Hash Function Version 2<\/h4>\n<p>The second version of the hash function was built on top of the first one.<\/p>\n<p>The height:width ratio was discarded. <\/p>\n<p>Each of the 5&#215;5 colour-ratios tuples was numbered 0 to 24, and numbers were sorted based on the size of each of the corresponding ratios. (I made sure the Python sort algorithm was stable.)<\/p>\n<p>The hash was now two tuples (representing the red:blue and green:blue ratios) with 25 elements &#8211; each element was in the range 0 to 24. <\/p>\n<p>So, if box 3 had the highest red:blue ratio, and box 5 had the second-highest red:blue ratio, the first of the tuples would be (3, 5, &#8230; ).<\/p>\n<p>Those 50 integers characterised the relative colouring of the sections of image (or so I hoped!), and were ready for bitwise comparison to other hashes.<\/p>\n<p>I ran my trivial unit tests again. They failed!<\/p>\n<p>Within Image 1, some boxes had colour ratios very close to other boxes. Within Image 3, during the resizing, the relative ordering wasn&#8217;t preserved &#8211; a slight increase in one ratio made it larger than another. Of the 50 integers, 48 matched, but two adjacent ones were swapped. A false negative was the result.<\/p>\n<p>I pondered how to fix this; it was tricky, so I started with an initial, quick hack. I dropped the number of boxes per image from 5&#215;5 to 4&#215;4. A smaller number of boxes, with more samples in each box, reduced (but didn&#8217;t eliminate!) the chance of the order being changed during rescaling. It should reduce false negatives, at the cost of more false positives.<\/p>\n<p>I re-ran the unit-test and it passed! Woohoo!<\/p>\n<p>Time for more unit-tests? Nah! Time for the system test. I used the Image Hash to run the basic duplicates search algorithm over hundreds of photos.<\/p>\n<p>I did get some false positives, but many fewer than I expected. All of the false positives were black-and-white photos. Black-and-white pictures should have the exact same colour ratio in each of squares, making the hash meaningless.<\/p>\n<p>I also had false negatives, although that was harder to immediately detect, without manually wading through the photos by hand. I detected them by further reducing the hash to a 3&#215;3 array, and that revealed a lot of the images that had been missed (false negatives) by the 4&#215;4 version.<\/p>\n<h4>Conclusion<\/h4>\n<p>I tried to create an effective hash that detected photos that were approximately the same apart from rescaling. I hoped it would also be agnostic to simple brightness and contrast adjustments.<\/p>\n<p>The result was completely ineffective for black-and-white photos, but apart from that had surprisingly few false positives. However, it did suffer from a moderate number of false negatives.<\/p>\n<p>I suppose I should go read a book on Image Processing and find out how the pros do it. How tedious, compared to reinventing the wheel&#8230;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>So, I have lots of photos, and some are duplicated on several web-sites, with no reference to the original source. I want to detect matches, so I can move them (and their associated meta-data) all to a single <a href=\"http:\/\/www.somethinkodd.com\/oddthinking\/2008\/06\/12\/online-photo-db-stage-3-identify-possible-solutions\/\">yet-to-be-determined<\/a> destination.<\/p>\n<p>But here&#8217;s the snag. Some of the photos have been made into different sizes and qualities for web-viewing. It would save me time if I could automatically detect this. I need an equivalent to SoundEx that works on images, so images have the same hash even if they are resized.<\/p>\n<p>How would you do that?  If you answered &#8220;Google it, and find what the standard solution is&#8221;, you might be smarter than me. This article is about what I did instead.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_s2mail":"","footnotes":""},"categories":[285,33,34],"tags":[213,299,284,81],"class_list":["post-599","post","type-post","status-publish","format-standard","hentry","category-photography-geek","category-puzzle-solving","category-software-development","tag-algorithm","tag-image-processing","tag-online-photo-database","tag-python"],"_links":{"self":[{"href":"https:\/\/www.somethinkodd.com\/oddthinking\/wp-json\/wp\/v2\/posts\/599","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.somethinkodd.com\/oddthinking\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.somethinkodd.com\/oddthinking\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.somethinkodd.com\/oddthinking\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.somethinkodd.com\/oddthinking\/wp-json\/wp\/v2\/comments?post=599"}],"version-history":[{"count":3,"href":"https:\/\/www.somethinkodd.com\/oddthinking\/wp-json\/wp\/v2\/posts\/599\/revisions"}],"predecessor-version":[{"id":602,"href":"https:\/\/www.somethinkodd.com\/oddthinking\/wp-json\/wp\/v2\/posts\/599\/revisions\/602"}],"wp:attachment":[{"href":"https:\/\/www.somethinkodd.com\/oddthinking\/wp-json\/wp\/v2\/media?parent=599"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.somethinkodd.com\/oddthinking\/wp-json\/wp\/v2\/categories?post=599"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.somethinkodd.com\/oddthinking\/wp-json\/wp\/v2\/tags?post=599"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}