Unicode discussion (page 3) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » D » Unicode discussion (page 3)

December 16, 2003

Re: Unicode discussion

Posted by Elias Martenson
in reply to Elias Martenson

Elias Martenson

Posted in reply to Elias Martenson

Elias Martenson wrote:

>     char c = str[9000];
                   ^^^^ 8999 of course

Regards

Elias Mårtenson

December 16, 2003

Re: Unicode discussion

Posted by Charles
in reply to Ben Hinkle

Charles

Posted in reply to Ben Hinkle

It does sound insane, I like it.  I vote for this.

C

"Ben Hinkle" <bhinkle4@juno.com> wrote in message news:brn1eq$ppk$1@digitaldaemon.com...
> I think Walter once said char had been called 'ascii'. That doesn't sound all that bad to me. Perhaps we should have the primitive types 'ascii','utf8','utf16' and 'utf32' and remove char, wchar and dchar.
Insane,
> I know, but at least then you never will mistake an ascii[] for a utf32[] (or a utf8[], for that matter).
>
> -Ben
>
> "Elias Martenson" <elias-m@algonet.se> wrote in message news:brml3p$7hp$1@digitaldaemon.com...
> > Walter wrote:
> >
> > > "Elias Martenson" <no@spam.spam> wrote in message news:pan.2003.12.15.23.07.24.569047@spam.spam...
> > >
> > >>Actually, byte or ubyte doesn't really matter. One is not supposed to look at the individual elements in a UTF-8 or a UTF-16 string anyway.
> > >
> > > In a higher level language, yes. But in doing systems work, one always
> seems
> > > to be looking at the lower level elements anyway. I wrestled with this
> for a
> > > while, and eventually decided that char[], wchar[], and dchar[] would
be
> low
> > > level representations. One could design a wrapper class for them that overloads [] to provide automatic decoding if desired.
> >
> > All right. I can accept this, of course. The problem I still have with this is the syntax though. We got to remember here that most english-only speaking people have little or no understanding of Unicode and are quite happy using someCharString[n] to access individual
> characters.
> >
> > > I see your point, but I just can't see making utf8byte into a keyword
> <g>.
> > > The world has already gotten used to multibyte 'char' in C and the
funky
> > > 'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don't
see
> much
> > > of an issue here.
> >
> > Yes, they have gotten used to it in C, and it's still a horrible hack. At least in C. It is possiblt to get the multiple encoding support to work in D, but it needs wrappers. More on that later.
> >
> > >>And here is also the core of the problem: having an array of "char" implies to the unwary programmer that the elements in the sequence are in fact "characters", and that you should be allowed to do stuff like isspace() on them. The fact that the libraries provide such function doesn't help either.
> > >
> > > I think the library functions should be improved to handle unicode
> chars.
> > > But I'm not much of an expert on how to do it right, so it is the way
it
> is
> > > for the moment.
> >
> > As for the functions that handle individual characters, the first thing that absolutely has to be done is to change them to accept dchar instead of char.
> >
> > >>I'd love to help out and do these things. But two things are needed
> first:
> > >>    - At least one other person needs to volunteer.
> > >>      I've had bad experiences when one person does this by himself,
> > >
> > > You're not by yourself. There's a whole D community here!
> >
> > Indeed, but no one else volunteered yet. :-)
> >
> > >>    - The core concepts needs to be decided upon. Things seems to be
> > >>      somewhat in flux right now, with three different string types
> > >>      and all. At the very least it needs to be deicded what a
"string"
> > >>      really is, is it a UTF-8 byte sequence or a UTF-32 character
> > >>      sequence? I haven't hid the fact that I would prefer the latter.
> > >
> > >
> > > A string in D can be char[], wchar[], or dchar[], corresponding to
> UTF-8,
> > > UTF-16, or UTF-32 representations.
> >
> > OK, if that is your descision then you will not see me argue against it.
> :-)
> >
> > However, suppose you are going to write a function that accepts a string. Let's call it log_to_file(). How do you declare it? Today, you have three different options:
> >
> >      void log_to_file(char[] str);
> >      void log_to_file(wchar[] str);
> >      void log_to_file(dchar[] str);
> >
> > Which one of these should I use? Should I use all of them? Today, people seems to use the first option, but UTF-8 is horribly inefficient performance-wise.
> >
> > Also, in the case of char and wchar strings, how do I access an individual character? Unless I missed something, the only way today is to use decode(). This is a fairly common operation which needs a better syntax, or people will keep accessing individual elements using the array notation (str[n]).
> >
> > Obviously the three different string types needs to be wrapped somehow. Either through a class (named "String" perhaps?) or through a keyword ("string"?) that is able to encapsulate the different behaviour of the three different kinds of strings.
> >
> > Would it be possible to use something like this?
> >
> >      dchar get_first_char(string str)
> >      {
> >          return str[0];
> >      }
> >
> >      string str1 = (dchar[])"A UTF-32 string";
> >      string str2 = (char[])"A UTF-8 string";
> >
> >      // call the function to demonstrate that the "string"
> >      // type can be used in declarations
> >      dchar x = get_first_char(str1);
> >      dchar y = get_first_char(str2);
> >
> > I.e. the "string" data type would be a wrapper or supertype for the three different string types.
> >
> > > char[] strings are UTF-8, and as such I don't know what you mean by
> 'native
> > > decoding'. There is only one possible conversion of UTF-8 to UTF-16.
> >
> > The native encoding is what the operating system uses. In Windows this is typically UTF-16, although it really depends. It's really a mess, since most applications actually use various locale-specific encodings, such as ISO-8859-1 or KOI8-R.
> >
> > In Unix the platform specific encoding is determined by the environment variable LC_CTYPE, although the trend is to be moving towards UTF-8 for all locales. We're not quite there yet though. Check out http://www.utf-8.org/ for some information about this.
> >
> > > If you're talking about win32 code pages, I'm going to draw a line in
> the
> > > sand and assert that D char[] strings are NOT locale or code page
> dependent.
> > > They are UTF-8 strings. If you are reading code page or locale
dependent
> > > strings, to put them into a char[] will require running it through a conversion.
> >
> > Right. So what you are saying is basically that there is a difference between reading to a ubyte[] and a char[] in that native decoding is performed in the latter case but not the former? (in other words, when reading to a char[] the data is passed through mbstowcs() internally?)
> >
> > > The UTF-8 to UTF-16 conversion is defined and platform independent.
The
> D
> > > runtime library includes routines to convert back and forth between
> them.
> > > They could probably be optimized better, but that's another issue. I
> feel
> > > that by designing D around UTF-8, UTF-16 and UTF-32 the problems with
> locale
> > > dependent character sets are pushed off to the side as merely an input
> or
> > > output translation nuisance. The core routines all expect UTF strings,
> and
> > > so are platform and language independent. I personally think the
future
> is
> > > UTF, and locale dependent encodings will fall by the wayside.
> >
> > Internally, yes. But there needs to be a clear layer where the platform encoding is converted to the internal UTF-8, UTF-16 or UTF-32 encoding. Obviously this layer seems to be located in the streams. But we need a separate function to do this for byte arrays as well (since there are other ways of communicating with the outside world, memory mapped files for example). Why not use the same names as are used in C? mbstowcs() and wcstombs()?
> >
> > > After wrestling with this issue for some time, I finally realized that supporting locale dependent character sets in the core of the language
> and
> > > runtime library is a bad idea. The core will support UTF, and locale dependent representations will only be supported by translating
to/from
> UTF.
> > > This should wind up making D a far more portable language for internationalization than C/C++ are (ever wrestle with tchar.h? How
> about
> > > wchar_t's being 32 bits wide on linux vs 16 bits on win32? How about
> having
> > > #ifdef _UNICODE all over the place? I've done that too much already.
No
> > > thanks!)
> >
> > Indeed. The wchar_t being UTF-16 on Windows is horrible. This actually stems from the fact that according to the C standard wchar_t is not Unicode. It's simply a "wide character". The Unix standard goes a step further and defines wchar_t to be a unicode character. Obviously D goes the Unix route here (for dchar), and that is very good.
> >
> > However, Windows defined wchar_t to be a 16-bit Unicode character back in the days Unicode fit inside 16 bits. This is the same mistake Java did, and we have now ended up with having UTF-16 strings internally.
> >
> > So, in the end, C (if you want to be portable between Unix and Windows) and Java both no longer allows you to work with individual characters, unless you know what you are doing (i.e. you are prepared to deal with surrogate pairs manually). My suggestion for the "string" data type will hide all the nitty gritty details with various encodins and allow you to extract the n'th dchar from a string, regardless of the internal
encoding.
> >
> > > UTF-8 is really quite brilliant. With just some minor extra care over writing ordinary ascii code, you can write portable code that is fully capable of handling the complete unicode character set.
> >
> > Indeed. And using UTF-8 internally is not a bad idea. The problem is that we're also allowed to use UTF-16 and UTF-32 as internal encoding, and if this is to remain, it needs to be abstracted away somehow.
> >
> > >>In general, you should be able to open a file, by specifying the file name as a dchar[], and then the libraries should handle the rest.
> > >
> > > It does that now, except they take a char[].
> >
> > Right. But wouldn't it be nicer if they accepted a "string"? The compiler could add automatic conversion to and from the "string" type as needed.
> >
> > Regards
> >
> > Elias Mårtenson
>
>

December 16, 2003

Re: Unicode discussion

Posted by Walter
in reply to Sean L. Palmer

Walter

Posted in reply to Sean L. Palmer

"Sean L. Palmer" <palmer.sean@verizon.net> wrote in message news:brmeos$2v9c$1@digitaldaemon.com...
> "Walter" <walter@digitalmars.com> wrote in message
>> One could design a wrapper class for them that
> > overloads [] to provide automatic decoding if desired.
>
> The problem is that [] would be a horribly inefficient way to index UTF-8 characters.  foreach would be ok.

You're right.

December 16, 2003

Re: Unicode discussion

Posted by Carlos Santander B.
in reply to Elias Martenson

Carlos Santander B.

Posted in reply to Elias Martenson

"Elias Martenson" <elias-m@algonet.se> wrote in message
news:brml3p$7hp$1@digitaldaemon.com...
| for example). Why not use the same names as are used in C? mbstowcs()
| and wcstombs()?
|

Sorry to ask, but what do those do? What do they stand for?

-------------------------
Carlos Santander
"Elias Martenson" <elias-m@algonet.se> wrote in message
news:brml3p$7hp$1@digitaldaemon.com...
| for example). Why not use the same names as are used in C? mbstowcs()
| and wcstombs()?
|

Sorry to ask, but what do those do? What do they stand for?

-------------------------
Carlos Santander

December 16, 2003

Re: Unicode discussion

Posted by Walter
in reply to Hauke Duden

Walter

Posted in reply to Hauke Duden

"Hauke Duden" <H.NS.Duden@gmx.net> wrote in message news:brnas5$1940$1@digitaldaemon.com...
> Walter wrote:
> > I see your point, but I just can't see making utf8byte into a keyword
<g>.
> > The world has already gotten used to multibyte 'char' in C and the funky 'wchar_t' for UTF16 (for win32, UTF32 for linux) in C, that I don't see
much
> > of an issue here.
> This is simply not true, Walter. The world has not gotten used to multibyte chars in C at all.

Multibyte char programming in C has been common on the IBM PC for 20 years now (my C compiler has supported it for that long, since it was distributed to an international community), and it was standardized into C in 1989. I agree that many ignore it, but that's because it's badly designed. Dealing with locale-dependent encodings is a real chore in C.

> A lot of english-speaking programmers
> simply treat chars as ASCII characters, even if there's some comment
> somewhere stating that the data should be UTF-8.

True, but code doesn't have to be changed much to allow for UTF-8. For example, D source text is UTF-8, and supporting that required little change in the D front end, and none in the back end. Trying to use UTF-32 internally to support this would have been a disaster.

> I agree with Elias that the "char" type should be 32 bit, so that people who simply use a char array as a string, as they have done for years in other languages, will actually get the behaviour they expect, without losing the Unicode support.

Other problems are introduced with that for the naive programmer who expects it to work just like ascii. For example, many people don't bother multiplying by sizeof(char) when allocating storage for char arrays. chars and 'bytes' in C are used willy-nilly interchangeably. Direct manipulation of chars (without going through ctype.h) is common for converting lower case to upper case. Etc.

The nice thing about UTF-8 is it does work just like ascii when you're dealing with ascii data.

> Btw: this could also be used to solve the "oops, I forgot to make the string null-terminated" problem when interacting with C functions. If the D char is a different type than the old C char (which could be called char_c or charz instead) then people will automatically be reminded that they need to convert them.
>
> So how about the following proposal:
>
> - char is a 32 bit Unicode character

Already have that, it's 'dchar' <g>. There is nothing in D that prevents a programmer from using dchar's for his character handling chores.

> - wcharz (or wchar_c? c_wchar?) is a C wide char character of either 16
> or 32 bits (depending on the system), provided for interoperability with
> C functions

I've dealt with porting large projects between win32 and linux and the change in wchar_t size from 16 to 32. I've come to believe that method is a mistake, hence wchar and dchar in D. (One of the wretched problems is one cannot intermingle printf and wprintf to stdout in C.)

> - charz (or char_c? c_char?) is a normal 8 bit C character, also provided for interoperability with C functions

I agree that the 0 termination is an issue when calling C functions. I think this issue will fade, however, as the D libraries get more comprehensive. Another problem with 'normal' C chars is the confusion about whether they are signed or unsigned. The D char type is unsigned, period <g>.

> UTF-8 and UTF-16 strings could simply use ubyte and ushort types. This would at the same time remind users that the elements are NOT characters but simply a bunch of binary data. I don't see the need to define a new type for these - there are a lot of encodings out there, so why treat UTF-8 and UTF-16 specially?

Treating UTF-8 and UTF-16 specially in D has great advantages in making the internal workings of the compiler and runtime library consistent. (No more problems mixing printf and wprintf!) I'm convinced that UTF is becoming the linqua franca of computing, and the other encodings will be relegated to sideshow status.

> With this system it would be instantly obvious that D strings are Unicode. Interacting with legacy C code is still possible, and accidentally passing a wrong (e.g. UTF-8) string to a C function that expects ASCII or Latin-1 is impossible.

Windows NT, 2000, XP, and onwards are internally all UTF-16. Any win32 API functions that accept 8 bit chars will immediately convert them to UTF-16. wchar_t's under win32 are UTF-16 encodings (including the 2 word encodings of UTF-16). Linux is internally UTF-8, if I'm not mistaken. This means D code will feel right at home with linux. Under win32, I plan on fixing all the runtime library functions to convert UTF-8 to UTF-16 internally and use the win32 API UTF-16 functions.

Hence, UTF is where the operating systems are going, and D is looking forward to mapping cleanly onto that. I believe that following the C approach of code pages, signed/unsigned char confusion, varying wchar_t sizes, etc., is rapidly becoming obsolete.

> Also, pure D code will
> automatically be UTF-32, which is exactly what you need if you want to
> make the lives of newbies easier. Otherwise people WILL end up using
> ASCII strings when they start out.

Over the last 10 years, I wrote two major internationalized apps. One used UTF-8 internally, and converted other encodings to/from it on input/output. The other used wchar_t throughout, and was ported to win32 and linux which mapped wchar_t to UTF-16 and UTF-32, respectively.

The former project ran much faster, consumed far less memory, and (aside from the lack of support from C for UTF-8) simply had far fewer problems. The latter was big and slow. Especially and linux with the wchar_t's being UTF-32, it really hogged the memory.

December 16, 2003

Re: Unicode discussion

Posted by Walter
in reply to Elias Martenson

Walter

Posted in reply to Elias Martenson

"Elias Martenson" <elias-m@algonet.se> wrote in message news:brml3p$7hp$1@digitaldaemon.com...
> As for the functions that handle individual characters, the first thing that absolutely has to be done is to change them to accept dchar instead of char.

Yes.

> However, suppose you are going to write a function that accepts a string. Let's call it log_to_file(). How do you declare it? Today, you have three different options:
>
>      void log_to_file(char[] str);
>      void log_to_file(wchar[] str);
>      void log_to_file(dchar[] str);
>
> Which one of these should I use? Should I use all of them? Today, people seems to use the first option, but UTF-8 is horribly inefficient performance-wise.

Do it as char[]. Have the internal implementation convert it to whatever format the underling operating system API uses. I don't agree that UTF-8 is horribly inefficient (this is from experience, UTF-32 is much, much worse).

> Also, in the case of char and wchar strings, how do I access an individual character? Unless I missed something, the only way today is to use decode(). This is a fairly common operation which needs a better syntax, or people will keep accessing individual elements using the array notation (str[n]).

It's fairly easy to write a wrapper class for it that decodes it automatically with foreach and [] overloads.

> Obviously the three different string types needs to be wrapped somehow. Either through a class (named "String" perhaps?) or through a keyword ("string"?) that is able to encapsulate the different behaviour of the three different kinds of strings.
>
> Would it be possible to use something like this?
>
>      dchar get_first_char(string str)
>      {
>          return str[0];
>      }
>
>      string str1 = (dchar[])"A UTF-32 string";
>      string str2 = (char[])"A UTF-8 string";
>
>      // call the function to demonstrate that the "string"
>      // type can be used in declarations
>      dchar x = get_first_char(str1);
>      dchar y = get_first_char(str2);
>
> I.e. the "string" data type would be a wrapper or supertype for the three different string types.

The best thing is to stick with one scheme for a program.

> > char[] strings are UTF-8, and as such I don't know what you mean by
'native
> > decoding'. There is only one possible conversion of UTF-8 to UTF-16.
> The native encoding is what the operating system uses. In Windows this is typically UTF-16, although it really depends. It's really a mess, since most applications actually use various locale-specific encodings, such as ISO-8859-1 or KOI8-R.

For char types, yes. But not for UTF-16, and win32 internally is all UTF-16. There are no locale-specific encodings in UTF-16.

> In Unix the platform specific encoding is determined by the environment variable LC_CTYPE, although the trend is to be moving towards UTF-8 for all locales. We're not quite there yet though. Check out http://www.utf-8.org/ for some information about this.

Since we're moving to UTF-8 for all locales, D will be there with UTF-8 <g>. Let's look forward instead of those backward locale dependent encodings.

> > If you're talking about win32 code pages, I'm going to draw a line in
the
> > sand and assert that D char[] strings are NOT locale or code page
dependent.
> > They are UTF-8 strings. If you are reading code page or locale dependent strings, to put them into a char[] will require running it through a conversion.
> Right. So what you are saying is basically that there is a difference between reading to a ubyte[] and a char[] in that native decoding is performed in the latter case but not the former? (in other words, when reading to a char[] the data is passed through mbstowcs() internally?)

No, I think D will provide an optional filter for I/O which will translate to/from locale dependent encodings. Wherever possible, the UTF-16 API's will be used to avoid any need for locale dependent encodings.


> Internally, yes. But there needs to be a clear layer where the platform encoding is converted to the internal UTF-8, UTF-16 or UTF-32 encoding. Obviously this layer seems to be located in the streams. But we need a separate function to do this for byte arrays as well (since there are other ways of communicating with the outside world, memory mapped files for example). Why not use the same names as are used in C? mbstowcs() and wcstombs()?

'cuz I can never remember how they're spelled <g>.


> > After wrestling with this issue for some time, I finally realized that supporting locale dependent character sets in the core of the language
and
> > runtime library is a bad idea. The core will support UTF, and locale dependent representations will only be supported by translating to/from
UTF.
> > This should wind up making D a far more portable language for internationalization than C/C++ are (ever wrestle with tchar.h? How
about
> > wchar_t's being 32 bits wide on linux vs 16 bits on win32? How about
having
> > #ifdef _UNICODE all over the place? I've done that too much already. No thanks!)
>
> Indeed. The wchar_t being UTF-16 on Windows is horrible. This actually stems from the fact that according to the C standard wchar_t is not Unicode. It's simply a "wide character".

Frankly, I think the C standard is out to lunch on this. wchar_t should be unicode, and there really isn't a problem with using it as unicode. The C standard is also not helpful in the undefined size of wchar_t, or the sign of 'char'.

> The Unix standard goes a step
> further and defines wchar_t to be a unicode character. Obviously D goes
> the Unix route here (for dchar), and that is very good.
>
> However, Windows defined wchar_t to be a 16-bit Unicode character back in the days Unicode fit inside 16 bits. This is the same mistake Java did, and we have now ended up with having UTF-16 strings internally.

Windows made the right decision given what was known at the time, it was the unicode folks who goofed by not defining unicode right in the first place.

> Indeed. And using UTF-8 internally is not a bad idea. The problem is that we're also allowed to use UTF-16 and UTF-32 as internal encoding, and if this is to remain, it needs to be abstracted away somehow.
> >>In general, you should be able to open a file, by specifying the file name as a dchar[], and then the libraries should handle the rest.
> > It does that now, except they take a char[].
> Right. But wouldn't it be nicer if they accepted a "string"? The compiler could add automatic conversion to and from the "string" type as needed.

It already does that for string literals. I've thought about implicit conversions for runtime strings, but sometimes trouble results from too many implicit conversions, so I'm hanging back a bit on this to see how things evolve.

December 16, 2003

Re: Unicode discussion

Posted by Elias Martenson
in reply to Carlos Santander B.

Elias Martenson

Posted in reply to Carlos Santander B.

Den Tue, 16 Dec 2003 13:38:59 -0500 skrev Carlos Santander B.:

> "Elias Martenson" <elias-m@algonet.se> wrote in message
> news:brml3p$7hp$1@digitaldaemon.com...
> | for example). Why not use the same names as are used in C? mbstowcs()
> | and wcstombs()?
> |
> 
> Sorry to ask, but what do those do? What do they stand for?

mbstowcs() = multi byte string to wide character string
wcstombs() = wide character string to multi byte string

A multi byte string is a (char *), i.e. the platform encoding. This means that if you are running Unix in a UTF-8 locale (standard these days) then it contains a UTF-8 string. If you are running Unix or Windows with an ISO-8859-1 locale, then it contains ISO-8859-1 data.

A wide character string is a (wchar_t *) which is a UTF-32 string on Unix, and a UTF-16 string on Windows.

As you can see, the windows way of using UTF-16 causes the exact same problems as you would suffer when using UTF-8, so working with wchar_t on Windows would be of doubtful use if not for the fact that all Unicode functions in Windows deal with wchar_t. On Unix it's easier, since you know that the full Unicode range fits in a wchar_t.

This is the reason why I have been advocating against the UTF-16 representation in D. It makes little sense compared to UTF-8 and UTF-32.

Regards

Elias Mårtenson

December 16, 2003

Re: Unicode discussion

Posted by Ben Hinkle
in reply to Elias Martenson

Ben Hinkle

Posted in reply to Elias Martenson

>      char c = str[8999];
>      // now play happily(?) with the char "c" that probably isn't the
>      // 9000'th character and maybe was a part of a UTF-8 multi byte
>      // character

which was why I suggested doing away with the generic "char" type entirely.
If str was declared as an ascii array then it would be
 ascii c = str[8999];
Which is completely safe and reasonable. If it was declared as utf8[] then
when the user writes
 ubyte c = str[8999]
and they don't have any a-priori knowledge about str they should feel very
nervous since I completely agree indexing into an arbitrary utf-8 encoded
array is pretty meaningless. Plus in my experience using individual
characters isn't that common - I'd say easily 90% of the time a variable is
declared as char* or char[] rather than just char.
By the way, I also think any utf8, utf16 and utf32 types should be aliased
to ubyte, ushort, and uint. Should ascii be aliased to ubyte as well? I
dunno.

About Java and D: when I program in Java I never worry about the size of a char because Java is very different than C and you have to jump through hoops to call C. But when I program in D I feel like it is an extension of C like C++. Imagine if C++ decided that char should be 32 bits. That would have been very painful.

-Ben

December 16, 2003

RE: Unicode discussion

Posted by Julio César Carrascal Urquijo
in reply to Carlos Santander B.

Julio César Carrascal Urquijo

Posted in reply to Carlos Santander B.

mbstowcs - Multi Byte to Wide Character String
wcstombs - Wide Character String to Multi Byte


Carlos Santander B. <carlos8294@msn.com> escribió en el mensaje de noticias brnpe0$206i$3@digitaldaemon.com...
> "Elias Martenson" <elias-m@algonet.se> wrote in message
> news:brml3p$7hp$1@digitaldaemon.com...
> | for example). Why not use the same names as are used in C? mbstowcs()
> | and wcstombs()?
> |
>
> Sorry to ask, but what do those do? What do they stand for?
>

December 16, 2003

Re: Unicode discussion

Posted by Carlos Santander B.
in reply to Elias Martenson

Carlos Santander B.

Posted in reply to Elias Martenson

Thank you both.

"Elias Martenson" <no@spam.spam> wrote in message
news:pan.2003.12.16.22.27.42.233945@spam.spam...
| Den Tue, 16 Dec 2003 13:38:59 -0500 skrev Carlos Santander B.:
|
| > "Elias Martenson" <elias-m@algonet.se> wrote in message
| > news:brml3p$7hp$1@digitaldaemon.com...
| > | for example). Why not use the same names as are used in C? mbstowcs()
| > | and wcstombs()?
| > |
| >
| > Sorry to ask, but what do those do? What do they stand for?
|
| mbstowcs() = multi byte string to wide character string
| wcstombs() = wide character string to multi byte string
|
| A multi byte string is a (char *), i.e. the platform encoding. This means
| that if you are running Unix in a UTF-8 locale (standard these days) then
| it contains a UTF-8 string. If you are running Unix or Windows with an
| ISO-8859-1 locale, then it contains ISO-8859-1 data.
|
| A wide character string is a (wchar_t *) which is a UTF-32 string on
| Unix, and a UTF-16 string on Windows.
|
| As you can see, the windows way of using UTF-16 causes the exact same
| problems as you would suffer when using UTF-8, so working with wchar_t on
| Windows would be of doubtful use if not for the fact that all Unicode
| functions in Windows deal with wchar_t. On Unix it's easier, since you
| know that the full Unicode range fits in a wchar_t.
|
| This is the reason why I have been advocating against the UTF-16
| representation in D. It makes little sense compared to UTF-8 and UTF-32.
|
| Regards
|
| Elias MÃ¥rtenson
|


---

Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.552 / Virus Database: 344 - Release Date: 2003-12-15

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation