One of the main reasons for this is that these programs need to do over and over again the same job, not knowing anything about the files. But XnViewMP "knows". It has the Catalog.
In our discussion, there are 3 cases for Search of Duplicates:
1. On-Disk (the one which XnViewMP do now)
2. Disk-Database (the user has some folders somewhere on the disk and wants to search if there are duplicates of it in the DB)
3. In Database (the user wants to see if there are duplicates in the database)
The problem with the first search is that it is sloooooow.
And this slowness comes from several factors, including the speed of HDD (which is slow) and, in the case of comparing content, the slowness of generating hashes and the cartesian compare between them.
However, we have already a DB which speeds up this process several orders of magnitude.
AS (perhaps) you know, in non-hostile environments (like we are with our photos), first one checks for size and time/date of the last write. And (perhaps) checks for filename. And only in the final phase when all these are the same, the algorithm tries to compare the content of the file by generating hashes.
While, searching and reading file properties (size, last modified datetime, filename) and comparing them is very slow when one reads 40770 files from disk, while a DB query on the same files (I have them in the DB) - query which returns the duplicates and is written like this...
Code: Select all
select size, modifieddate, filename, count(*) from images group by size, modifieddate, filename having count(*)>1
(ok, I did some preparations: I indexed before the size and modifieddate )
...but what about comparing content (byte-by-byte ie. hashes)?
Also here a DB can help. Ok, Lightroom (and others) calculates the hashes from the beginning (at import) from all files and stores them in the DB. But I don't say that. I'd say that we can be more flexible:
If a file which is in the DB needs a hash, let's calculate it. But after we got it, do not loose it. Store it in a table, along with the ImageID which is the PK. Hence, next time when we do a comparison and we'll reach to the same file, we'll look in the DB first and get the hash ready from there. If is missing, then calculate it.
But can we improve the process more?
Sure we can.
Jesse Kornblum which was a Special Agent with the United States Air Force Office of Special Investigations show how the US Goverment did/do it.
Since we have read the file sequentially (in a stream), we can read the first 8-16k and calculate a hash for these - usually this is calculated with a very quick hash like CRC32C (also known as iSCSI CRC) which is implemented directly in modern CPUs. And after that, we can read the remainder and calculate a hash for the big chunk/entire file with a more robust hash like murmur or similar. Because we store these two hashes in the DB, next time when we reach the same file for the comparison there is a very high change (approx 95-97 %) that the other file which we must read, if it is different, to show this difference from the hash which we got from first 8-16k.
Hence, in conclusion, we can do very important speed improvements if the user chose to search duplicates in the Database:
1.) We can dramatically reduce the number of 'suspect' files in just 0,2 seconds!
2.) We can store hashes when we calculate them (anyway the time to store a hash is near to 0 compared to time to calculate it)
3.) As an further optimization we can calculate two hashes: one for the beginning and one for the entire file (or remainder). In order to reduce further the need to read the content of the new file next time when we must do a comparison.
Beware! I don't say to add a new field in the Images table called 'Hash' or whatever. I say to create a new table called 'Hashes' with two fields: ImageID' and 'Hash something like this:
Code: Select all
Create table Hashes(ImageID INTEGER PRIMARY KEY REFERENCES Images(ImageID) ON DELETE CASCADE, Hash TEXT);
Create Index Hashes_Hash_Idx on Hashes(Hash);
Comments? Ideas? Opinions?