ICU (International Components for Unicode) (page 2) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » ICU (International Components for Unicode) (page 2)

August 23, 2004

Re: ICU (International Components for Unicode)

Posted by Walter
in reply to Juanjo Álvarez

Walter

Posted in reply to Juanjo Álvarez

"Juanjo Álvarez" <juanjuxNO@SPAMyahoo.es> wrote in message news:cgdu4b$2ed$1@digitaldaemon.com...
> Regan Heath wrote:
>
> >> Ok, but suppose we ditch char[]. Then, we find some great library we
> >> want to
> >> bring into D, or build a D interface too, that is in char[].
>
> Excuse me if I'm saying something stupid but byte[] would not do the job
of
> interfacing with C char[]?

That could work, but it just wouldn't look right.

August 23, 2004

Re: ICU (International Components for Unicode)

Posted by Regan Heath
in reply to Walter

Regan Heath

Posted in reply to Walter

On Mon, 23 Aug 2004 16:41:10 -0700, Walter <newshound@digitalmars.com> wrote:
> "Juanjo Álvarez" <juanjuxNO@SPAMyahoo.es> wrote in message
> news:cgdu4b$2ed$1@digitaldaemon.com...
>> Regan Heath wrote:
>>
>> >> Ok, but suppose we ditch char[]. Then, we find some great library we
>> >> want to
>> >> bring into D, or build a D interface too, that is in char[].
>>
>> Excuse me if I'm saying something stupid but byte[] would not do the job
> of
>> interfacing with C char[]?
>
> That could work, but it just wouldn't look right.

http://www.digitalmars.com/d/htomodule.html

Specifically states that C's 'char' should be represented by a 'byte' in D.
So when building an interface to the C lib that uses char[] you'd use byte[].

Regan.

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

August 24, 2004

Re: ICU (International Components for Unicode)

Posted by Walter
in reply to Regan Heath

Walter

Posted in reply to Regan Heath

"Regan Heath" <regan@netwin.co.nz> wrote in message news:opsc7lqdnp5a2sq9@digitalmars.com...
> http://www.digitalmars.com/d/htomodule.html
>
> Specifically states that C's 'char' should be represented by a 'byte' in
D.
> So when building an interface to the C lib that uses char[] you'd use byte[].

I'm sorry that wasn't clear, but I meant that when 'unsigned char' and 'signed char' in C are used not as text, but as very small integers, the corresponding D types should be ubyte and byte.

August 24, 2004

Re: ICU (International Components for Unicode)

Posted by Arcane Jill
in reply to Walter

Arcane Jill

Posted in reply to Walter

In article <cgdmj9$2v1a$1@digitaldaemon.com>, Walter says...

>My experience with all-wchar is that its performance is not the best. It'll also become a nuisance interfacing with C. I'd rather explore perhaps making implict conversions between the 3 utf types more seamless.

Implicit conversions are absolutely fine by me. In fact, that was the first suggestion (then the discussion wandered, as things do, along the lines of "if they're interchangable, why not have just the one type".

But sure, I'd be more than happy with implicit conversions between the three UTF types. Since such conversions lose no information in any direction, they are always guaranteed to be harmless.

Jill

August 24, 2004

Re: ICU (International Components for Unicode)

Posted by Arcane Jill
in reply to Walter

Arcane Jill

Posted in reply to Walter

In article <cgdmj9$2v1a$1@digitaldaemon.com>, Walter says...

>My experience with all-wchar is that its performance is not the best.

It's usually regarded as the best by most other sources, however. Converting between wchar[] and dchar[] is /almost/ as fast as doing a memcpy(), because UTF-16 encoding is very, very simple. UTF-16 is as efficient for the codepoint range U+0000 to U+FFFF as UTF-8 is for the codepoint range U+0000 to U+007F. Outside of these ranges, UTF-16 is still very fast, since all remaining characters consist of /precisely/ two wchars, wheras UTF-8 conversion outside of the ASCII range is always going to be inefficient, what with it's variable number of required bytes, variable width bitmasks, and the additional requirements of validation and rejection of non-shortest sequences. If you're arguing on the basis performance, UTF-8 loses hands down.


>It'll
>also become a nuisance interfacing with C.

Actually, it's D's char (which C doesn't have) which is a nuisance interfacing with C. As others have pointed out, C has no type which enforces UTF-8 encoding, and in fact on the Windows PC on which I am typing right now, every C char is going to be storing characters from Windows code page 1252 unless I take special action to do otherwise. That is /not/ interchangable with D's chars. Beyond U+007F, one C char corresponds to two (or more) D chars. You don't regard that as a nuisance?

In fact, I believe that comment of yours which I just quoted above, actually adds further weight to my argument. I argue that the existence of the char type *causes confusion*. People /think/ (erroneously) that it does the same job as C's char, and is interchangable therewith. Now if you, the architect of D, can fall prey to that confusion, I would take that as clear evidence that such confusion exists.




>I'd rather explore perhaps making
>implict conversions between the 3 utf types more seamless.

Yes. If all three D string types were implicitly convertable, then there would be nothing for me to complain about. (The char confusion would still exist, but that's just education)

Arcane Jill

August 24, 2004

Re: ICU (International Components for Unicode)

Posted by Julio César Carrascal Urquijo
in reply to Arcane Jill

Julio César Carrascal Urquijo

Posted in reply to Arcane Jill

Arcane Jill wrote:
> (...)
> adds further weight to my argument. I argue that the existence of the char type
> *causes confusion*. People /think/ (erroneously) that it does the same job as
> C's char, and is interchangable therewith. Now if you, the architect of D, can
> fall prey to that confusion, I would take that as clear evidence that such
> confusion exists.
> (...)

I'm sorry to interrupt. From your reply I would say that you are arguing more about the name of the type been used as the *representation of a character* (which is not).

Maybe we should ask Walter to change the names of the types to utf8, utf16 and utf32. I read somewhere that those where the original names in earlier DMD implementations.

If that's what you are asking, you have my vote.

--
Julio César Carrascal Urquijo

August 24, 2004

Re: ICU (International Components for Unicode)

Posted by Arcane Jill
in reply to Julio César Carrascal Urquijo

Arcane Jill

Posted in reply to Julio César Carrascal Urquijo

In article <cgfmdm$v3t$1@digitaldaemon.com>,

>Maybe we should ask Walter to change the names of the types to utf8, utf16 and utf32. I read somewhere that those where the original names in earlier DMD implementations.
>
>If that's what you are asking, you have my vote.

No, I wasn't asking that, and I don't really care what things are called. I do /try/ to stay on the topic of the thread title, and in this thread the discussion is about how/whether to make use of ICU for our internationalization and Unicode needs. I've been looking more closely at ICU, and I keep being (pleasantly) surprised to discover that it already has zillions of other goodies not previously mentioned in this thread, but which we've been talking about in this forum in the past. For example - Locales, ResourceBundles, everything you need for text internationalization/localization.

It's relevant to D's character types because ICU has only two character types - a type equivalent to wchar that is used to make UTF-16 strings, and a type equivalent to dchar that is used to access character properties. The important detail here is that ICU strings are wchar[]s, but D's basic "string" concept is char[]. So, calling lots of ICU routines would result in lots of explicit toUTF8() and toUTF16() calls all over your code, /unless/ either:

(1) D adopted wchar[] as the basic string type, or
(2) D implicitly auto-converted between its various string types as required

I'm trying to suggest that (1) is the best option. That's all. Walter prefers
(2), but that's acceptable too.

As a corrollary to (1), if we start using wchar[]s as the default native string used by Phobos and the compiler, it would then follow that the char type would be superfluous and could be dropped. Or at least, it seems that way to me. Opinions differ. However, the "should we ditch the char or not?" discussion is over on another thread.

Renaming the character types is kind of irrelevant to this, although it is pertainant to Walter's reply. No - I'm not asking that they be renamed (except insofar as, if "char" is ditched, then "wchar" could be renamed "char", but that again is for the other thread).

Arcane Jill

August 24, 2004

Re: ICU (International Components for Unicode)

Posted by Walter
in reply to Arcane Jill

Walter

Posted in reply to Arcane Jill

"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cgf8d1$os0$1@digitaldaemon.com...
> In article <cgdmj9$2v1a$1@digitaldaemon.com>, Walter says...
>
> >My experience with all-wchar is that its performance is not the best.
>
> It's usually regarded as the best by most other sources, however.
Converting
> between wchar[] and dchar[] is /almost/ as fast as doing a memcpy(),
because
> UTF-16 encoding is very, very simple. UTF-16 is as efficient for the
codepoint
> range U+0000 to U+FFFF as UTF-8 is for the codepoint range U+0000 to
U+007F.
> Outside of these ranges, UTF-16 is still very fast, since all remaining characters consist of /precisely/ two wchars, wheras UTF-8 conversion
outside of
> the ASCII range is always going to be inefficient, what with it's variable number of required bytes, variable width bitmasks, and the additional requirements of validation and rejection of non-shortest sequences. If
you're
> arguing on the basis performance, UTF-8 loses hands down.

Converting would be faster, sure, but if the bulk of your app is char[], there is little conversion happening.

> >It'll
> >also become a nuisance interfacing with C.
> Actually, it's D's char (which C doesn't have) which is a nuisance
interfacing
> with C. As others have pointed out, C has no type which enforces UTF-8
encoding,
> and in fact on the Windows PC on which I am typing right now, every C char
is
> going to be storing characters from Windows code page 1252 unless I take
special
> action to do otherwise. That is /not/ interchangable with D's chars.
Beyond
> U+007F, one C char corresponds to two (or more) D chars. You don't regard
that
> as a nuisance?

I've been dealing with multibyte charsets in C for decades - it's not just UTF-8 that's multibyte, there are also the Shift-JIS, Korean, and Taiwan code pages. You can also set up your windows machine so UTF-8 *is* the charset used by the "A" APIs. I've written UTF-8 apps in C, and D would map onto them directly.

There is no way to avoid, when interfacing with C, dealing with whatever charset it might be in. It can't happen automatically. And that is the source of the nuisance.

> In fact, I believe that comment of yours which I just quoted above,
actually
> adds further weight to my argument. I argue that the existence of the char
type
> *causes confusion*. People /think/ (erroneously) that it does the same job
as
> C's char, and is interchangable therewith. Now if you, the architect of D,
can
> fall prey to that confusion, I would take that as clear evidence that such confusion exists.
>
> >I'd rather explore perhaps making
> >implict conversions between the 3 utf types more seamless.
>
> Yes. If all three D string types were implicitly convertable, then there
would
> be nothing for me to complain about. (The char confusion would still
exist, but
> that's just education)

August 25, 2004

Re: ICU (International Components for Unicode)

Posted by Regan Heath
in reply to Walter

Regan Heath

Posted in reply to Walter

On Mon, 23 Aug 2004 23:35:07 -0700, Walter <newshound@digitalmars.com> wrote:
> "Regan Heath" <regan@netwin.co.nz> wrote in message
> news:opsc7lqdnp5a2sq9@digitalmars.com...
>> http://www.digitalmars.com/d/htomodule.html
>>
>> Specifically states that C's 'char' should be represented by a 'byte' in
> D.
>> So when building an interface to the C lib that uses char[] you'd use
>> byte[].
>
> I'm sorry that wasn't clear, but I meant that when 'unsigned char' and
> 'signed char' in C are used not as text, but as very small integers, the
> corresponding D types should be ubyte and byte.

However, an old C lib might return latin-1 (or any other encoding) encoded data, in which case you also have to use ubyte then transcode to utf-8 and store in char[] (if that is the desired result).

Right?

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

August 25, 2004

Re: ICU (International Components for Unicode)

Posted by Walter
in reply to Regan Heath

Walter

Posted in reply to Regan Heath

"Regan Heath" <regan@netwin.co.nz> wrote in message news:opsc9o90n65a2sq9@digitalmars.com...
> On Mon, 23 Aug 2004 23:35:07 -0700, Walter <newshound@digitalmars.com> wrote:
> > "Regan Heath" <regan@netwin.co.nz> wrote in message news:opsc7lqdnp5a2sq9@digitalmars.com...
> >> http://www.digitalmars.com/d/htomodule.html
> >>
> >> Specifically states that C's 'char' should be represented by a 'byte'
in
> > D.
> >> So when building an interface to the C lib that uses char[] you'd use byte[].
> >
> > I'm sorry that wasn't clear, but I meant that when 'unsigned char' and 'signed char' in C are used not as text, but as very small integers, the corresponding D types should be ubyte and byte.
>
> However, an old C lib might return latin-1 (or any other encoding) encoded
> data, in which case you also have to use ubyte then transcode to utf-8 and
> store in char[] (if that is the desired result).
>
> Right?

Yup. You'll have to understand what the C code is using the char type for, in order to select the best equivalent D type.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation