Near-Realtime Search for Duplicates

m.Th. · Post by **m.Th.** » Thu Jul 18, 2013 8:13 am

There are a bunch of programs to search duplicate files and all of them are slow.
One of the main reasons for this is that these programs need to do over and over again the same job, not knowing anything about the files. But XnViewMP "knows". It has the Catalog.

In our discussion, there are 3 cases for Search of Duplicates:

1. On-Disk (the one which XnViewMP do now)
2. Disk-Database (the user has some folders somewhere on the disk and wants to search if there are duplicates of it in the DB)
3. In Database (the user wants to see if there are duplicates in the database)

The problem with the first search is that it is sloooooow.

And this slowness comes from several factors, including the speed of HDD (which is slow) and, in the case of comparing content, the slowness of generating hashes and the cartesian compare between them.

However, we have already a DB which speeds up this process several orders of magnitude.

AS (perhaps) you know, in non-hostile environments (like we are with our photos), first one checks for size and time/date of the last write. And (perhaps) checks for filename. And only in the final phase when all these are the same, the algorithm tries to compare the content of the file by generating hashes.

While, searching and reading file properties (size, last modified datetime, filename) and comparing them is very slow when one reads 40770 files from disk, while a DB query on the same files (I have them in the DB) - query which returns the duplicates and is written like this...

Code: Select all

select size, modifieddate, filename, count(*) from images group by size, modifieddate, filename having count(*)>1

returns 2515 duplicates from 40770 files in 0,221 secs(!). There is no disk search on Earth which can beat this!

(ok, I did some preparations: I indexed before the size and modifieddate )

...but what about comparing content (byte-by-byte ie. hashes)?

Also here a DB can help. Ok, Lightroom (and others) calculates the hashes from the beginning (at import) from all files and stores them in the DB. But I don't say that. I'd say that we can be more flexible:

If a file which is in the DB needs a hash, let's calculate it. But after we got it, do not loose it. Store it in a table, along with the ImageID which is the PK. Hence, next time when we do a comparison and we'll reach to the same file, we'll look in the DB first and get the hash ready from there. If is missing, then calculate it.

But can we improve the process more?

Sure we can.

Jesse Kornblum which was a Special Agent with the United States Air Force Office of Special Investigations

show how the US Goverment did/do it.

Since we have read the file sequentially (in a stream), we can read the first 8-16k and calculate a hash for these - usually this is calculated with a very quick hash like CRC32C (also known as iSCSI CRC) which is implemented directly in modern CPUs. And after that, we can read the remainder and calculate a hash for the big chunk/entire file with a more robust hash like murmur or similar. Because we store these two hashes in the DB, next time when we reach the same file for the comparison there is a very high change (approx 95-97 %) that the other file which we must read, if it is different, to show this difference from the hash which we got from first 8-16k.

Hence, in conclusion, we can do very important speed improvements if the user chose to search duplicates in the Database:

1.) We can dramatically reduce the number of 'suspect' files in just 0,2 seconds!
2.) We can store hashes when we calculate them (anyway the time to store a hash is near to 0 compared to time to calculate it)
3.) As an further optimization we can calculate two hashes: one for the beginning and one for the entire file (or remainder). In order to reduce further the need to read the content of the new file next time when we must do a comparison.

Beware! I don't say to add a new field in the Images table called 'Hash' or whatever. I say to create a new table called 'Hashes' with two fields: ImageID' and 'Hash something like this:

Code: Select all

Create table Hashes(ImageID INTEGER  PRIMARY KEY REFERENCES Images(ImageID) ON DELETE CASCADE, Hash TEXT);
Create Index Hashes_Hash_Idx on Hashes(Hash);

Also, is needed a trigger in Images saying if old.ModifiedDate<>new.ModifiedDate then delete from Hashes h where h.ImageID=New.ImageID;

Comments? Ideas? Opinions?

oops66 · Post by **oops66** » Thu Jul 18, 2013 9:24 am

+1, I support . I have already asked for this kind of thing to quickly find similar files (via a md5 checksum)

m.Th. · Post by **m.Th.** » Thu Jul 18, 2013 9:37 am

oops66 wrote:+1, I support . I have already asked for this kind of thing to quickly find similar files (via a md5 checksum)

Yep, MD5 is good. But there are better alternatives: http://www.strchr.com/hash_functions

Post by **xnview** » Thu Jul 18, 2013 10:05 am

Ok, but if we search on subfolder that are not in the DB??

m.Th. · Post by **m.Th.** » Thu Jul 18, 2013 11:09 am

xnview wrote:Ok, but if we search on subfolder that are not in the DB??

A quick answer now (perhaps i'll have more): Fallback on disk to take the filenames, sizes, and dates and put them in a Temporary (in-memory) Table (sqlite has this) index them and let the sqlite to do it's job. (perhaps via JOINs or something similar)

JohnFredC · Post by **JohnFredC** » Thu Jul 18, 2013 12:23 pm

I like these ideas.

Perhaps a hash could be calculated from the thumbnail image when it is retrieved or created?

m.Th. · Post by **m.Th.** » Thu Jul 18, 2013 5:41 pm

JohnFredC wrote:I like these ideas.

Perhaps a hash could be calculated from the thumbnail image when it is retrieved or created?

...hehehe... smart!

Nice idea. I was thinking at this some time ago, especially because the CRC32C (iSCSI CRC) is implemented in CPUs from several years. But because now almost all graphic files are compressed, hence the size + last write datetime stamp is a very strong "hash" I think that is not worth it to store all the hashes for a mere 1% from files which have the same size and datetime stamp. (Usually these come from batch converts which output uncompressed TIFF of the same size. )

If someone has another opinion please chime in.

m.Th. · Post by **m.Th.** » Sat Jul 20, 2013 9:25 am

Ok, let's go to the next step:

We have two things:

1. The gain which is given by the data already entered in the DB
2. The cost which is paid when we enter the data in the DB (even temporarily)

The first must offset the second and that's why the user must tell which folders are (mostly) in the db and which not. (...or perhaps we can find this out by ourselves? - I think yes, by querying the Folders table). I think that the GUI should be updated somewhat like this (scroll down to see all the pages of the wizard):

Duplicates.png

Hence we have three situations WRT the files which we want to compare:

Nothing is in the Database
Here we'll use the current (On-Disk) engine

Everything is already in the Database
Here we'll use the super-fast sqlite's grouping engine to reduce the number of comparisons at very few ( ~ 1% margin of error) from files.

Some (Most) of the files are already in the Database
Here we'll enter the files which aren't already in the DB in a temporary table (we'll set the pragma synchronous to none in order to be the fastest possible), index them and after that we'll do a JOIN on size,datetime and (possible) filename (COLLATE NOCASE) in order to get the possible duplicates.

After all this soup, for the smallish remainder of files -> Hashes.

Of course, all the above are for exact matches which can be greatly enhanced. For the 'Same picture content' option (ie. fuzzy search) we're bound to have a sequential scan / comparison, except, of course, if the user chooses to check the "Compare file name".

What do you think?

m.Th. · Post by **m.Th.** » Mon Aug 12, 2013 6:02 am

Add two check-boxes ("Search on Disk" and "Search in Catalog") on 1st page and hide the lists of folders. The lists will became visible when the corresponding checkbox will be checked:

Duplicates_2.png

Besides of having a much more clean landing page (which will help in user's experience) we can hide the second tab (called "What to compare") and show it only when both checkboxes will be checked.

In this way the workflow will be more streamlined.

m.Th. · Post by **m.Th.** » Wed Aug 14, 2013 6:39 am

Another missing bit - which is important also for the main search engine: Search only in database. This will speedup the things a lot because the program will not need to search in un-cataloged folders anymore. This complies with the common practice among eg. photographers (and not only) which have a static archive which doesn't change. In this case, any disk fallback in searching of potential new/changed items is a waste of time.

Of course this must be optional because there are cases in which one does not search in a static file structure.

Here is how ACDSee implements it: (scroll to see the white popup menu)

ACDSee Search.jpg

JohnFredC · Post by **JohnFredC** » Wed Aug 14, 2013 4:22 pm

I really enjoy your deep, well presented posts, and generally agree with most of what you say.

But something you said in your previous post has me thinking:

We have two things:

1. The gain which is given by the data already entered in the DB
2. The cost which is paid when we enter the data in the DB (even temporarily)

The first must offset the second

Would you clarify further what you meant by "The first must offset the second"?

m.Th. · Post by **m.Th.** » Wed Aug 14, 2013 5:53 pm

JohnFredC wrote:I really enjoy your deep, well presented posts, and generally agree with most of what you say.

But something you said in your previous post has me thinking:

We have two things:

1. The gain which is given by the data already entered in the DB
2. The cost which is paid when we enter the data in the DB (even temporarily)

The first must offset the second
Would you clarify further what you meant by "The first must offset the second"?

Let's take a typical black-white use case

Reminder:
Black-white is a very common scenario in DAM (Digital Asset Management) in which a (team of) user(s) has a black area (ok, one or more folders) which is uncategorized, unsorted (etc.), highly changing, not (yet) browsed by the program, with a bunch of files with zero, low or temporary importance together with valuable files - in a word: a fuzzy, twilight zone. Examples: the last gazillion of downloaded media from Internet, my last downloaded photos from my dSLR etc.

In contrast, the white area is my Collection. My Archive. My (more or less) organized folder(s) tree with my valuable files. In time it grows much much bigger than the black area, It has a very low changing rate (if any) but one wants to have a very high search quality here. Some DAM programs are white-only - eg. Lightroom - which, in order to process any digital media, this media must be imported in the DB (in the "Catalog" in Lr's terminology).

As you figure out already, usually the black area maps to the folders which are outside of DB and the white area maps to the files which are already in DB.

Well, now let's go back to the text which you quoted:

1. The gain which is given by the data already entered in the DB

This is obvious for the white area since - as you know -

A. the DB searches are thousand times faster than on-disk searches and
B. the GUI can be much more powerful here - eg. it can provide the possible values to search - see in the right pane the 'Auto Categories' section in this screenshot from ACDSee:

http://newsgroup.xnview.com/viewtopic.p ... 7f#p113521

Now compare that simple list of possible apertures in which one can simply click with the big problem of an empty text box in which the user will enter... what? f/1.8? F1.8? 1.8? 1,8? ...and what are the possible values for apertures? (as you know, and as you see in the above screen shot the Aperture scale isn't linear)

2. The cost which is paid when we enter the data in the DB (even temporarily)

With all the big gains explained at the point 1. there appears the idea of "quickly" entering the data in the DB and use the DB facilities to do the search (in our case search for the duplicates).

But there is a cost. A big one. The time to enter all the data which we want to process in DB. However, normally, this time is spread over days and days in building our archive, hence this time is dissolved (it "disappears") in the time of generating thumbnails, file copy etc. In the case of the white zone there is no time overhead.

...and this time, to "quickly enter" the data in DB in NORMAL(1) way can be bigger for large amounts of data than the time which we gain by using in-database search (SQL queries etc.).

The good news are that usually the users know where their Collection is (in fact, Lr is based exactly on this fact: that you know which is your Archive / Collection).

Hence, we can easily solve this problem by having a GUI in which we can say 'Search in the entire Catalog / DB / Archive' or, for more advanced users / advanced search an optional: "Enter the Catalog folders in which you want to search".

This is one of the main reasons for which we need to have two search engines adapted to the logical location of file info (name, size, metadata etc.): on-disk (which we already have, even if we can improve it(1)) and in-database (described shortly above).

It is clearer now?

----------
(1): An obvious improvement for the on-disk search is to have a flat(simple) temporary table in which the data will be entered in Sync=off mode (the fastest) and after that the indexes will be rebuilt. Then we can do fast searches on this and, also, compare / aggregate the results with the in-database search.

jadO · Post by **jadO** » Sun May 13, 2018 5:16 am

This suggestion is now 5 years old. Anything like that implemented into XnviewMP already?

XnView Software

Near-Realtime Search for Duplicates

Near-Realtime Search for Duplicates

Re: Near-Realtime Search for Duplicates

Re: Near-Realtime Search for Duplicates

Re: Near-Realtime Search for Duplicates

Re: Near-Realtime Search for Duplicates

Re: Near-Realtime Search for Duplicates

Re: Near-Realtime Search for Duplicates

Re: Near-Realtime Search for Duplicates

Re: Near-Realtime Search for Duplicates

Re: Near-Realtime Search for Duplicates

Re: Near-Realtime Search for Duplicates

Re: Near-Realtime Search for Duplicates

Re: Near-Realtime Search for Duplicates