UTF-8 (page 2)

"Hauke Duden" <H.NS.Duden@gmx.net> wrote in message news:bshas9$1kgh$1@digitaldaemon.com... > Walter wrote: > > D currently can handle UTF-8 (unicode) strings, but it does not handle shift-JIS strings. To make shift-JIS work in your programs, you'll need to > > write a filter to convert shift-JIS to UTF-8 on input, and convert UTF-8 to > > shift-JIS on output. I intend to do this for all code pages, but have not > > written those filters yet. > > I don't think you have to. The C runtime library provides functions to convert from wide char to the local code page and vice versa. We can use those for conversions of this kind. > > I know I'm repeating the same stuff over and over, but maybe this real world example has shifted your position somewhat. I REALLY think UTF-8 strings should not use the "char" type. The CRT expects chars to be encoded in the local code page, so this will lead to all kinds of confusion when you mix C functions with D functions. The latter expects UTF-8, the former the local code page, but both use the same type. Actually, if you get right down to the definition, they use different types with the same name and none of the type-safety features you expect from a typed programming language! > > It would be a lot easier if the types had different names. I think you have some mileage in this idea. Walter, internationalisation is an issue so fraught with confusion and misunderstanding that I would suggest it is a +ve step to have new, and ugly, types with which to deal with the different coding schemes. No-one who does not understand it should go near such things. Naturally that leads us to the position where all these issues must be handled by the language for us, so people (which I think includes just about all of us) who do not understand the issues do not need to care and yet can still write correct programs.

> I don't have any criticisms to make of your internationalisation postulates, > but the above code is *not* the right way to write A/W flexible functions in > Win32. Ok, it's not the right way. I neglected OS-version check. OSVERSIONINFO osVersion; GetVersionEx(&osVersion); if(osVersion.dwPlatformId==VER_PLATFORM_WIN32_NT){ //try W-version API }else{ //try A-version API } But my code works because Windows9x has entries of some W-version API (return FALSE). (In my real intention, I want to use only W-version API for Phobos.) > For example, what do I get when I retrieve the win32 error code from the FileException? FileException of present Phobos has "errno". (I do not know well whether this usage is right...) Thanks. YT

> The C runtime library provides functions to convert from wide char to the local code page and vice versa. We can use those for conversions of this kind. Oh! I did not know. I don't have a good knowledge of C. Thank you. We can use setlocale, mbstowcs, wcstombs, etc. By the way, excuse me. Are these different? Only in the second case, I can get the right result. setlocale(LC_ALL, null); (returns "C") setlocale(LC_ALL, ""); (returns "Japanese_Japan.932") > I know I'm repeating the same stuff over and over, but maybe this real world example has shifted your position somewhat. I REALLY think UTF-8 strings should not use the "char" type. The CRT expects chars to be encoded in the local code page, so this will lead to all kinds of confusion when you mix C functions with D functions. The latter expects UTF-8, the former the local code page, but both use the same type. Actually, if you get right down to the definition, they use different types with the same name and none of the type-safety features you expect from a typed programming language! > > It would be a lot easier if the types had different names. I agree. YT

Y.Tomino wrote: >>The C runtime library provides functions to convert from wide char to the local code page and vice versa. We can use those for conversions of this kind. > > > Oh! I did not know. I don't have a good knowledge of C. Thank you. > We can use setlocale, mbstowcs, wcstombs, etc. Actually, I was only referring to converting from/to the current code page, not changing the code page. I'm not sure whether all possible code pages are supported on all systems. > By the way, excuse me. Are these different? > Only in the second case, I can get the right result. > > setlocale(LC_ALL, null); (returns "C") > setlocale(LC_ALL, ""); (returns "Japanese_Japan.932") The first one does not change anything. It simply returns the current locale settings. From the CRT docs: """ The null pointer is a special directive that tells setlocale to query rather than set the international environment. """ Hauke

December 29, 2003

Re: UTF-8

Posted by Hauke Duden
in reply to Matthew

Permalink

Hauke Duden

Posted in reply to Matthew

Permalink

Matthew wrote:
> Walter, internationalisation is an issue so fraught with confusion and
> misunderstanding that I would suggest it is a +ve step to have new, and
> ugly, types with which to deal with the different coding schemes. No-one who
> does not understand it should go near such things.
> 
> Naturally that leads us to the position where all these issues must be
> handled by the language for us, so people (which I think includes just about
> all of us) who do not understand the issues do not need to care and yet can
> still write correct programs.

I agree. To get us near this goal, I have begun work on a set of string interfaces and classes over the holidays. They have the properties I expect of good string handling (I have mentioned most of these in other threads):

- people should only have to think "string", not "UTF-8/16/32/ASCII/Shift-JIS..."
- hide the encoding most of the time
- allow character-based indexing and iteration
- provide basic string operations: (case-insensitive) comparison, concatenation, ...
- prevent unnecessary copying of the data
- allow access to the raw encoded data, if necessary (for interacting with C functions)
- automatically make strings null-terminated if needed, but do not treat the terminator as part of the string (e.g. do not include it in the length)
- enable implementations using other encodings. There is a lot of non-ASCII legacy code out there, so UTF-8 alone just doesn't cut it.

My hope is that when I'm done with this, Walter will declare those interfaces (or a similar solution) the default way to deal with strings. The goal is to never see those raw data strings anywhere in a normal D program, except under special circumstances or when interacting with C code.

And if that doesn't happen then at the very least I hope that this will inspire some more work to be done in this area. Bad string support can take all the fun out of a language ;) .

Hauke

"Matthew" <matthew.hat@stlsoft.dot.org> wrote in message news:bsnpkn$30qk$1@digitaldaemon.com... > Naturally that leads us to the position where all these issues must be handled by the language for us, so people (which I think includes just about > all of us) who do not understand the issues do not need to care and yet can > still write correct programs. There is no way you can "not care" and still write correct programs unless you're also willing to abandon all hope of writing competitively fast applications. D will provide the capability to write correct programs, but it will still be up to the programmer to use that capability. What I'll probably wind up doing is writing a tutorial page on it.

"Walter" <walter@digitalmars.com> wrote in message news:bt7pts$28dc$2@digitaldaemon.com... > > "Matthew" <matthew.hat@stlsoft.dot.org> wrote in message news:bsnpkn$30qk$1@digitaldaemon.com... > > Naturally that leads us to the position where all these issues must be handled by the language for us, so people (which I think includes just > about > > all of us) who do not understand the issues do not need to care and yet > can > > still write correct programs. > > There is no way you can "not care" and still write correct programs unless you're also willing to abandon all hope of writing competitively fast applications. > > D will provide the capability to write correct programs, but it will still be up to the programmer to use that capability. What I'll probably wind up doing is writing a tutorial page on it. Good answer. :)

Forums