Octopart Under the Hood: Image Deduplication
Pulse
Octopart Staff
Nov 11, 2018

dedupewide

From the beginning, one of the missions of Octopart has been to provide open and useful access to electronic part data by receiving and consolidating data that previously existed only in scattered silos.

A key step in doing this is to identify records which are separate instances of the same entity (“matching”) and combine them into a single record (“merging”). This process is known as entity resolution.

For example, performing matching and merging over manufacturer part numbers allows us to combine the distributor offers and pricing for both “Vishay 1N4148-TAP” and “Vishay 1N4148TAP” into a single Octopart search result. (While being careful not to merge “3M 85-12” with “3M 8512”, since they are separate parts.)

Another example of matching and merging that has been particularly tricky to get right has been deduplication of part images.

We often receive dozens of images for a part from a variety of sources, but many are duplicates derived from a small number of original photos. Eliminating these duplicates is not always straightforward due to resizing, resampling, re-encoding, and watermarking.

We recently undertook a project to identify and hide these duplicate images, which produced some satisfying results. The part pages look much cleaner with fewer extraneous images, and I hope everyone out there using Octopart finds this to be a pleasant improvement.

Here you can see a few before-and-after examples. On the left are the original images, and on the right are the deduplicated images.

dedupewide

dedupe1

dedupe2

dedupe3

If you're curious about how we did it, keep reading!

Image Deduplication Approach

There are two steps to the approach we took.

  1. Deduplicate byte-for-byte identical images from the same source. This finds exact duplicates.

  2. Deduplicate similar images by applying hierarchical clustering over perceptual hashes of the images. This catches the near-duplicates which look the same to a human viewer, but may have been resized, resampled, re-encoded, or watermarked.

The rest of this blog post will go into more depth on perceptual hashing and hierarchical clustering.

Perceptual hash

We need our image deduplication pipeline to run quickly, so instead of working with the image data directly, we first compute and store perceptual hashes of each image. We use the ImageHash library for Python to compute the hashes.

Perceptual hashing maps image data to a fixed size output value, e.g., 64 bits. In this way it is similar to a cryptographic hash function like SHA-256.

But perceptual hashing differs in that visually similar images map to similar hash values. In this way, perceptual hashing is related to locality-sensitive hash algorithms like SimHash. However, perceptual hashing is intended specifically for use with image data.

The similarity of two images can be measured by the Hamming distance between their perceptual hashes—i.e., the number of bit-flips needed to transform the first hash into the second.

Here we see the images for a part...

dedupe4

… and the perceptual hashes for each image, in the same order.

dedupe5

We see images 1, 3, 5, 8, and 9 (0-indexed) are identical and also have identical perceptual hashes. We also see that images 11, 12 and 13 are very similar, with differences due to watermarking, and they have nearly identical hashes.

Hierarchical clustering

A clustering algorithm is a machine learning technique for finding groupings of similar items within a data set. Hierarchical clustering is useful here because it does not require as input a fixed number of expected clusters, as many clustering algorithms do.

Hierarchical clustering builds a tree of all the clusters that could be constructed from the items in the data set, based on a distance measure between them. This tree can be visualized to understand the operation of the algorithm. By choosing a distance threshold, clusters of a desired level of similarity can be constructed by walking the tree.

Here we see a dendrogram visualization of the hierarchical clustering tree produced from the same images as above:

dedupe6

Each index along the x-axis represents an image. The y-axis shows Hamming distance, which is a measure of similarity, with a distance of 0.0 indicating identical images. The black horizontal line marks a distance threshold we have chosen to use for defining clusters. By following the blue line for an image upward from the bottom of the diagram to the root of the tree, we see what distance threshold would be needed to join the image into a cluster with other images in the data set.

For example, we see that image 1 can be joined into a cluster with images 3, 5, 8, and 9 at a threshold of 0.0, which means these images are identical. This is expected because we saw above that these images all have the same perceptual hash.

We see that images 12 and 13 can be joined at a threshold of 0.0156, and can be joined with image 11 if the threshold is increased to 0.0703. This is expected because the only difference between these images appears to be watermarking and re-encoding, and the perceptual hashes are close but not exactly the same.

We also see that the cluster of images 11, 12, and 13 can be joined with the cluster of images 1, 3, 5, 8, and 9 into a single larger cluster if the threshold is increased to 0.229. But are these duplicate images? No, they are are actually distinct images. So we should be sure to choose a threshold lower than 0.229.

Finally, at the root of the tree, we see that if the distance threshold is set to 0.537 it will produce a single large cluster containing all the images.

As this example shows, the distance threshold must be chosen carefully to identify duplicates while minimizing false positives.

Conclusion

By performing hierarchical clustering over pre-computed perceptual hashes, we’re able to quickly and effectively identify the duplicates among the images for a part.

We’ve been pleased with the improvement and hope you find it helpful. If you notice any remaining duplicate images, please send them our way via the “Talk to Us!” box.

As always, we’re going to continue improving the quality of data on Octopart. We’ve already applied similar techniques in deduplicating part description text. Keep an eye out for more...

Read More Articles