Experimental Unicode/UTF8 Support on Windows
This release introduces some experimental build-time optional support for UNICODE pathnames on Windows. Essentially this works by following the model that Linux (and MacOS) have followed for years.
If this code is enabled, then the way ghostscript handles command lines, registry settings, file accesses and other api calls with top bit set characters in (i.e. codes >= 128) will change. The net benefit of this change is that ghostscript will now be able to cope with accessing files with unicode characters (i.e. codes >= 256) in their pathnames.
This behavior is all completely transparent to users, with the exception of those calling the gsapi functions with strings including 'extended ascii' (i.e. characters with codes >= 128 and <= 255). These characters include accented latin characters, such as u + umlaut, a + grave etc.
The changes required for code that is affected by this are relatively minor, but as this is a change to the current API, we are announcing it in advance, and inviting comments.
As of the 9.04 release, the code is disabled. For those who wish to experiment you will need to build Ghostscript from source, and either pass USEUNICODE=1 when you invoke nmake or edit psi/msvc.mak to remove the /DWINDOWS_NO_UNICODE option from CFLAGS.
WARNING: Our intention, subject to feedback, is to enable this by default in near-future releases (hopefully, the next major release). If you make use of the affected APIs you should be prepared for the change to occur - be aware, however, that the current code is experimental and, depending on the feedback we get, maybe subject to change.
NOTE: this whole change refers to file paths, command line parameters and so on - it does not imply that we have unilaterally extended Postscript to understand UNICODE.
To give an example, suppose we have a file 'EXAMPLE' we'd like to invoke ghostscript on, where 'EXAMPLE' is actually a string that contains some characters with codes >= 128.
On Linux (or MacOS X), when ghostscript is called from a shell, e.g.
the command is UTF8 encoded; this means that characters with codes < 128 are left unchanged, and characters >= 128 are encoded into multiple bytes. This encoded string is then passed to the standard 'main' entrypoint in the gs executable.
Ghostscript proceeds internally without any special handling of these multibyte characters at all. When it comes to access files it therefore passes out the UTF8 encoded strings to the standard OS file handling routines. These routines are designed to take pathnames in UTF8 format, and thus the files are accessed as normal.
If the Ghostscript executable outputs these (or other) strings to its stdout, the shell again converts the output from UTF8 back to unicode in order to display it.
The net effect is that the caller can seamlessly pass in unicode filenames, has his fileaccesses work out and gets unicode output without the core of ghostscript ever having to worry about it.
The code change discussed here endeavours to make Windows follow the same pattern as closely as possible.
When Windows executables are invoked, they can either be called through an 'ascii' entrypoint (main), or through a unicode ('wide') entrypoint (wmain). The difference is invisible to the caller, except that unicode executables can accept characters >= 256 in their invocations.
The new code changes ghostscript from being an ascii executable to being a unicode one. The Windows specific outer layer takes the unicode command string and UTF8 encodes it before passing it to the ghostscript core.
Similarly, the Windows specific filing system calls are updated to accept utf8 encoded strings from the core, and to convert them to unicode before operating on them.
The Windows gui app (gswin32.exe, NOT gswin32c.exe) is also updated to convert stdin/stdout between unicode and utf8 as appropriate, allowing unicode strings to be copied/pasted to/from other apps.
All of this should be completely transparent to the user, and no code changes should be required. The one area where changes may be required are where ghostscript is invoked through the gsapi functions.
Currently, on Linux (and MacOS X) any strings sent over the gsapi are assumed to be utf8 encoded (and thus can represent any Unicode character). On Windows, they are assumed simply to be in extended ASCII (and can therefore represent any character < 256 in the current codepage). With the proposed change, Windows will move to be in step with Linux. No differences will be caused to anyone who only uses chars <= 128, but those people using character codes between 128 and 256 (or indeed wanting to use higher codes) will need to utf8 encode the strings before calling gsapi functions.
Such encoding/decoding is a very simple process, and code for both directions can be found in psi/dwmain.c, psi/dwmainc.c and psi/dwtext.c.
Again, we welcome feedback on this feature, in this case problems or suggestions about the implementation can be submitted via Bugzilla but for detailed discussion about the approach for which we opted, it would be more beneficial discuss it (preferably) on our IRC channel #ghostscript on freenode.net, or on the gs-devel mailing list.