Cleaning up file recognition

Ideas for improvements and requests for new features in XnView Classic

Moderators: helmut, XnTriq, xnview

Xyzzy
Posts: 652
Joined: Tue Nov 23, 2004 10:17 pm
Location: Poland

Cleaning up file recognition

Post by Xyzzy »

Cleaning up file recognition

Generally now the whole 'filetype recognition' thing is definitely too much hassle. The fact that a lot of people have problems with it also indicates that there is a problem to solve.

So here is my solution.

A bit of theory

File type (like image, audio, movie etc.) tells XnView fe. if it should be displayed in Browser's file list or how it should be opened (fe. by XnView or associated application).

The only need for any advanced file type recognition are lacking or improper extensions (if extension is OK, basic reading name of file tells its type). You must admit that this is quite uncommon problem now, especially in Windows world. But still there can be situations when someone mistakenly renames his 10,000 items multimedia collection to .aaa or .jpg extension- such collection should be still browsable in some way.

Implementation

EDIT: The implementation may seem complicated, but the idea is really simple. The options proposed below may fit in a group box named 'Bad extensions handling'. They simply can force recognition by header for file list, preview and view. Even if header recognition is forced, file list, preview and view use up-to-date cache information, because when putting in cache, file header is always read anyway (because of creating thumbnails/details extraction). Default values for the options below are OFF for 'Scan headers' options and ON for 'Identify file' (btw, may here be any performance issues?).

Scanning in this implementation means 'scanning headers of all files'. I am aware that this also delays displaying file list, because file can be displayed in filelist when its filetype is already determined (except for All mode). IF the file information is already in cache (ie. filename and Modification date match with cache), this information is used. Cache can be refreshed with usual Ctrl+R.

Options for Browser.

Scan headers for Thumbnails and Details modes- Never, Always, Only on fixed disks, Not on network disks, Not on Floppy/CD/DVD
Scan headers for Icons and List modes- options as above

Two distinct options are needed, because Icons and List modes are supposed to be speedy, while Thumbnails and Details- detailed. One must remember, that XnView defines what to display in file list basing on file type, so scanning headers could significantly slow down display in 'speedy' modes..

Options for Preview and View.

Identify file on display - Always, Never, On preview, On Open;
If file type is not recognized ('Other') and Scan header... option for current mode is OFF, XnView uses header scanning to determine file type on displaying preview or viewing file.
This option can be split into two 'Identify file on display' options put separately in preview and View settings, with values 'Yes' and 'No'.

Current 'Recognize only by extension' and 'Scan file headers for folders' are removed.

Potential problems
The following situations may require special handling:
- files whose format doesn't use header and no proper extension is present - file is always handled like 'Other', generally proper extension is required.
- extension collisions (the same extension is used by many filetypes) AND header scanning is OFF - file is displayed in every file list view that matches any of the filetypes indicated by extension (fe. if extension .abc is used for both image and audio files, it is displayed both in 'image' only and 'audio' only modes. As for custom view ('Items displayed'), usual 'most restrictive' rule is used. If at the same time Identify file on display is OFF, file is handled like 'Other' on display (questionable; maybe force identification on preview/open for such files?).
- extension unknown/not present - if headers are not scanned, these are simply 'Other' files. If at the same time Identify file on display is OFF, file is handled like 'Other' on display.
- mixed extension (used extension for other file type than actual) AND header scanning is OFF- in file list file is displayed according to extension. If at the same time Identify file on display is OFF, Open action for filetype indicated by extension is used; if XnView opens such file, it should report "Bad file type".

Other issues
- Displaying View when no Browser is present- only Identify file on display applies.
- Handling size limits set in 'Items displayed'- size limits does not affect Open action.
- If view mode is changed from not scanning mode to scanning mode, file list is rescanned with current settings. If change is vice-versa, identification information is retained until directory change.
- If cache contains information on file types, it is used regardless of settings. Ctrl+R is required to refresh cache on un-detectable changes (ie. changes that retain filename and don't change Modified date).
- Current possiblity of opening HexaView for Other files opened in XnView also fits nicely the whole concept.

X.
Last edited by Xyzzy on Fri Jan 20, 2006 8:59 am, edited 1 time in total.
User avatar
Olivier_G
XnThusiast
Posts: 1423
Joined: Thu Dec 23, 2004 7:17 pm
Location: Paris, France

Post by Olivier_G »

Xyzzy: I have to admit that I had some hard time following you...
...but after thinking more and more about it, I came with another solution on the same subject.

The idea is to make several passes to check files:
1. Use only extensions that match chosen filetypes -> instant display
2. Scan Headers of matching extensions (file if no header exists?) -> update/remove in background
3. Scan Headers of non-matching extensions (file if no header?) -> update/add in background

You would get instant display (1). Files confirmed would get a darker color while scanning (2) to show progress (wrong files would be removed). Good files with wrong extension would then be added to the list (3). Of course the Cache would be used for fast confirmation (ie: matching filename +modified date +filesize). For Popup/Preview/View/Open, the file would immediately be scanned for confirmation, if it hasn't been scanned yet.

I may be wrong somewhere... but I don't see any potential problem with this system and I think it provides all advantages (speed, verification) with little - if any - drawbacks. Moreover, no option at all would be needed.

Your opinion?

Olivier
Xyzzy
Posts: 652
Joined: Tue Nov 23, 2004 10:17 pm
Location: Poland

Post by Xyzzy »

One thing comes to mind- do we really REALLY need any header scanning?
If one messes up tons of files, it is much better to scan them for header text and rename to proper extensions with some filemanager.
And XnView should always use header scanning on preview/display for non-recognized files.
If one gets some unrecognized files, he should turn on Others display, look up files and rename them anyway.

Such multipass scanning that cannnot be turned off is inacceptable in case of network drives. Also files appearing/disappearing out of nothing in file list because scanning headers determined that they should/should not be displayed would be imo very confusing.

X.
User avatar
xnview
Author of XnView
Posts: 45963
Joined: Mon Oct 13, 2003 7:31 am
Location: France

Post by xnview »

Xyzzy wrote:One thing comes to mind- do we really REALLY need any header scanning?
Yes, many users like that, they used not standard extension for their pictures...
Pierre.
Xyzzy
Posts: 652
Joined: Tue Nov 23, 2004 10:17 pm
Location: Poland

Post by Xyzzy »

The PROBLEM is, that the current options someway works.
But no one seems to be able to say what and how they affect, and there is no documentation for them.

I believe my proposal defines clear and understandable rules for file list, preview and view display. IMO current handling contains serious inconsequences, and it should be returned "back to the drawing board".

Example: Recognize only be extension OFF, Scan headers ALWAYS, JPEG file renamed to WAV, Open audio in associated editor, Cache cleared and disabled:

- It is not displayed with Image filter. :bug: It should be recognized by header scanning. If you relied on extension here, that's a bug, these options are supposed to operate just such cases- correct file type should be determined.
- Open action uses action for audio files. :bug: File should be recognized as image.

And so on, and so on, and so on.

X.
User avatar
Olivier_G
XnThusiast
Posts: 1423
Joined: Thu Dec 23, 2004 7:17 pm
Location: Paris, France

Post by Olivier_G »

What about having just:

Code: Select all

Extensive file checking [ ]
...which would add my suggested 2 passes when set ON (with a progress bar)?
(default=OFF would mean using file extension only)

You would get the standard & instant behaviour.
And if you really want to verify which files can be seen, you would check that option to get the extensive search (without having to wait for the whole scan => instant response on extension, update would be as fast as scanning all headers).

Olivier
Xyzzy
Posts: 652
Joined: Tue Nov 23, 2004 10:17 pm
Location: Poland

Post by Xyzzy »

First, I think that scanning headers function should scan ONLY headers and INGORE extensions- that's what it is for- identify misnamed files; how can we use extensions when they are supposedly bad? *) If you rely on extensions here, like in 1. step of your scanning, you miss the point. Also, I do not see any point in dividing scan into two passes depending on extensions. We presume that extensions are bad, so why use any extension info?
Using single option is simply like merging my 'Scan headers for <mode>' options together. What about my reasoning for splitting them? Is it bad?

You do not talk about preview/View display for still unidentified files (fe. when scanning headers is not used). It is important usability feature- enable browsing files with bad extensions without the need for header scanning (because of fe. large number of big files on a network drive).

BTW, thorough header scanning is not supposed to be speedy, it is supposed to be accurate- again, it is what it's for.

*) Using cache for already identified files and filetype-with-no-header rules apply.

X.
User avatar
Olivier_G
XnThusiast
Posts: 1423
Joined: Thu Dec 23, 2004 7:17 pm
Location: Paris, France

Post by Olivier_G »

Xyzzy wrote:If you rely on extensions here, like in 1. step of your scanning, you miss the point. Also, I do not see any point in dividing scan into two passes depending on extensions. We presume that extensions are bad, so why use any extension info?
The point is to implement a single method that is well designed enough to handle all situations in the best way possible.

As you said, file extension alone should be very accurate. So why wait for scanning all headers when 95% of the job is immediately available with file extension?
About considering extensions for scanning headers: First, I think it's important to remove problematic files as soon as possible... including the ones that have the right extension but for a different filetype. Second, I believe removal of files is more bothering than addition of files, therefore the need to handle them first. But this is just an small opinion...
Xyzzy wrote:Using single option is simply like merging my 'Scan headers for <mode>' options together. What about my reasoning for splitting them? Is it bad?
BTW, thorough header scanning is not supposed to be speedy, it is supposed to be accurate- again, it is what it's for.
Let's imagine we access a slow network drive (scan/extensive set ON)

Code: Select all

                                                  Xyzzy                    Olivier_G
0. enters remote drive                              X                         X
1.(fast) can see first files based on extension                               X
2. wrong files are removed                                                    X
3. complete verification (final display)            X                         X
Olivier_G can see, navigate, maybe load the first files (95% of the job?) much more rapidly than Xyzzy. Of course, Olivier_G can also wait for the end of progress bar if he wants to, in order to get that final display - as accurate as Xyzzy's one (but he doesn't have to change an option or even decide beforehand: he does it the way he wants).
That is true also for accessing large number of files on USB drives, for very large directories on local drives or for CD/DVD/Floppies (I believe Windows caches the name/extension -> 95% is done before even spinning disc).

For usual use on local drive, Xyzzy gets the final display after a very small delay, whereas Olivier_G gets a quick 'flickering' for bad files (which may be bothering... but also an indication that some files are problematic and may require further actions).

So after explaining that, I don't see the need to implement a different setting for Thumbnails/detailled vs List/Icons, be it on fixed disks, network, CD, etc...
Xyzzy wrote:You do not talk about preview/View display for still unidentified files
...because I already agreed on that:
Olivier_G wrote:For Popup/Preview/View/Open, the file would immediately be scanned for confirmation, if it hasn't been scanned yet.
Olivier
Xyzzy
Posts: 652
Joined: Tue Nov 23, 2004 10:17 pm
Location: Poland

Post by Xyzzy »

Olivier_G wrote:The point is to implement a single method that is well designed enough to handle all situations in the best way possible.
I think you are wrong here. The point is to provide a method of handling files with wrong extensions, while maintaining maximum speed in everyday work. Header scanning is not every day tool. It is to handle special situations -> so inactive in normal environment.

You cannot expect one solution to be the best for all situations, because situations with file layouts differ too much. The best you can come up with proposing one-size-fits-all solution is something in the middle, that satisfies very few users- the rest will complain about either speed or compatibility. (Or flickering on display.) And the worst thing is that it cannot be changed, neither by those who want speed nor by those who want compatibility.

My 'header scanning options' are rather like troubleshooting options, not something you use every day. As I wrote, they can be put into 'Bad extension handling' group, that a priori means turning on for some special situations and off in everyday use.

As for your file list preview- I compare this to current 'Delay high quality display'- nobody likes it and it's here just because HQ images cannot be displayed faster- not a real solution.

Also I see that you do not handle 'mixed extensions' (fe. jpg renamed to wav)- if you want to rely on extensions in first step. Header scanning is supposed to be accurate, and by using any extension information you deny purpose of this option.

In my design cache information for files already identified is used, and that speeds up recognition (you can always use Ctrl+R if cache is unreliable). You could say that 'header scanning' task is to put appropriate info into cache for files with wrong extensions.

Also, in header scan mode, file display do not need to be started after checking the whole directory, but right after determining file type- first items are displayed as they are identified.

As an extension to my method, I could propose something along "use header scanning for unknow extensions", but as it can help only in cases of unknow extensions, I don't find it really useful and it would even further confuse user.

EDIT: There is a saying here- "If something is designed for everything, it is useful for nothing".

X.
User avatar
Olivier_G
XnThusiast
Posts: 1423
Joined: Thu Dec 23, 2004 7:17 pm
Location: Paris, France

Post by Olivier_G »

First: specific points, to clarify things a bit...
Xyzzy wrote:And the worst thing is that it cannot be changed, neither by those who want speed nor by those who want compatibility.
It is proposed as an option, as I considered your comment (=> extension only OR scan all headers).
Xyzzy wrote:As for your file list preview - I compare this to current 'Delay high quality display'
I would not compare this update of problematic files only (ie: you would get no change at all in a normal situation) to a complete change of display. If user doesn't want the slightest chance of update, he can simply turn the option 'Extensive file checking' off.
By the way: your own suggestion to display items as they are identified while scanning imply even more updates.
Xyzzy wrote:I see that you do not handle 'mixed extensions' (jpg renamed to wav) - if you want to rely on extensions in first step. Header scanning is supposed to be accurate, and by using any extension information you deny purpose of this option.
Huh??? If my option is ON, a JPEG file renamed to .wav will be scanned and showed correctly in step 3, as explained. It is accurate...

=> I really wonder whether my suggestion has been correctly understood (in particular: 1/2/3 are not options... they are the 3 steps when option 'Extensive file checking' is ON). :?:


More general:
Xyzzy wrote:I think you are wrong here. The point is to provide a method of handling files with wrong extensions, while maintaining maximum speed in everyday work. Header scanning is not every day tool. It is to handle special situations -> so inactive in normal environment.
You cannot expect one solution to be the best for all situations, because situations with file layouts differ too much. The best you can come up with proposing one-size-fits-all solution is something in the middle, that satisfies very little users- the rest will complain about either speed or compatibility.
There is a saying here- "If something is designed for everything, it is useful for nothing".
My suggestion is to keep speed and provide accuracy at the same time, at the expense of small updates when waiting for accuracy. I believe the drawbacks are so low that it should be the default behaviour, to get speed AND accuracy.
About sayings: they are useful rhetoric tools... but I would never consider them to limit my own mind. You know where I stand about them, now... :mrgreen:

Olivier
Xyzzy
Posts: 652
Joined: Tue Nov 23, 2004 10:17 pm
Location: Poland

Post by Xyzzy »

Olivier_G wrote:
Xyzzy wrote:And the worst thing is that it cannot be changed, neither by those who want speed nor by those who want compatibility.
It is proposed as an option, as I considered your comment (=> extension only OR scan all headers).
OK, that's better. In your post dated Wed Jan 18, 2006 9:32 it looked to me as the only behaviour.
Olivier_G wrote:I would not compare this update of problematic files only (ie: you would get no change at all in a normal situation) to a complete change of display. If user doesn't want the slightest chance of update, he can simply turn the option 'Extensive file checking' off.
By the way: your own suggestion to display items as they are identified while scanning imply even more updates.
But user wants header scanning AND no strange updates. Linear appearing of items (one after another) is quite different from displaying window-full of items and them adding and deleting some of them. You may call displaying every item an update if you want, but then it will always be as many updates of file list as items. BTW, if in normal situation there is no change at all, as you write, why you want to put your option, requiring additional operations, as default?
Olivier_G wrote:Huh??? If my option is ON, a JPEG file renamed to .wav will be scanned and showed correctly in step 3, as explained. It is accurate...
Oh yes, you are right, it will pop out of nothing after the the header was scanned. That can be called confusing behaviour. Application first decides to hide it, then to show...
Olivier_G wrote: My suggestion is to keep speed and provide accuracy at the same time, at the expense of small updates when waiting for accuracy. I believe the drawbacks are so low that it should be the default behaviour, to get speed AND accuracy.
This is not a problem (keeping speed and accuracy). The problem is handling incorrect extensions.

Anyway I think that such solution as yours is not needed. There is no need for header scanning solution that is ON all the time. The files very rare get incorrect extensions. For these cases we need something simple yet effective.

What for work on some elaborate option if it is supposed to be used rarely, as a troubleshooting tool? Why settle for less speed (making it default option), when users generally do not need this added accuracy, but require speed? BTW, in this beta cycle there was AT LEAST one report on filelist flickering-> just hiding items displayed before identyfying them. Why turn on "compatiblity" option as default? It would be like setting permanently on XP "Run in Windows 98 compatiblity mode"- more potential problems than gains.
Why not get what user want when he wants, all the speed or all the precision, but something in between?
Finally, what usage scenario would benefit from such multipass scanning?


I personally wouldn't use the option because:
- long file list scanning (reading directory 3 times)
- operation is in fact completed multiple times, and there can be 2 updates to already presented filelist- confusing for me- I do not know when the file list is in final version.
- headers are scanned always, even if it is not needed
- already cached info is not used
- it is complicated; harder to understand- harder to use. Better use something simpler that gives easy predictable results.

X.
User avatar
Olivier_G
XnThusiast
Posts: 1423
Joined: Thu Dec 23, 2004 7:17 pm
Location: Paris, France

Post by Olivier_G »

Xyzzy wrote:- Why settle for less speed?
- Long file list scanning (reading directory 3 times)
- Why not get what user want when he wants, all the speed or all the precision, but something in between?
But where is my system slower than 'use extension only'??? (my "reading 3 times" is as intensive as a single scan, it just presents things quicker...)
Why do you say that it is less accurate than 'scan all headers'? It is not.
Xyzzy wrote:- I do not know when the file list is in final version.
- headers are scanned always, even if it is not needed
- already cached info is not used
-> progress bar
-> Huh??? headers are used to check files. I don't get it... :?
-> Of course Cache is used, as I said.


So basically, we don't agree on ONE thing with headers scans:
- You think that 'progressive but slow' display is better
- I think that 'fast but jumping' display is better
I tried to be as objective and cynical for both... :D)

It would be interesting to get others' comments on this.

This being said... I feel satisfied with your suggestion. I just think that there are more advantages in my own suggestion. If more people oppose that removing/adding behaviour and favor no instant display, I won't support it any longer.

Olivier
Xyzzy
Posts: 652
Joined: Tue Nov 23, 2004 10:17 pm
Location: Poland

Post by Xyzzy »

OK, so, without nitpicking, there are two approaches to header scanning (I omit things that are the same):

mine- linear reading and display of files (simpler to implement, faster FINAL display- accurate, somewhat more natural)
yours- 3 pass reading and display of files (more complicated and error prone, faster PREVIEW display- may be inaccurate, more fancy- reading in background)

Still, I strongly oppose making any header scanning default option, because speed requirement is magnitudes greater than extension problems.

BTW, reading in 3 passes is not the same as one read and slower, even if in every pass other files are scanned.

X.