Jump to page: 1 2 3
Thread overview
ICU (International Components for Unicode)
Aug 23, 2004
Arcane Jill
Aug 23, 2004
Arcane Jill
Aug 23, 2004
Jörg Rüppel
Aug 23, 2004
Arcane Jill
Aug 23, 2004
antiAlias
Aug 23, 2004
Walter
Aug 23, 2004
Regan Heath
Aug 23, 2004
Juanjo Álvarez
Aug 23, 2004
Regan Heath
Aug 23, 2004
Walter
Aug 23, 2004
Regan Heath
Aug 24, 2004
Walter
Aug 25, 2004
Regan Heath
Aug 25, 2004
Walter
Aug 24, 2004
Arcane Jill
Aug 24, 2004
Arcane Jill
Aug 24, 2004
Arcane Jill
Aug 24, 2004
Walter
Aug 25, 2004
Walter
Aug 25, 2004
Arcane Jill
Aug 25, 2004
Roald Ribe
Aug 25, 2004
Arcane Jill
Aug 25, 2004
Sean Kelly
Aug 25, 2004
pragma
Aug 25, 2004
Arcane Jill
Aug 25, 2004
Walter
August 23, 2004
Following Ben's mention of ICU, I've been checking out what it does and doesn't do. Basically, it does EVERYTHING. There is the work of years there. Not just Unicode stuff, but a whole swathe of classes for internationalization and transcoding. It would take me a very long time to duplicate that. It's also free, open source, and with a license which basically says there's no problem with our using it.

So I'm thinking seriously about ditching the whole etc.unicode project and replacing it with a wrapper around ICU.

It's not completely straightforward. ICU is written in C++ (not C), and so we can't link against it directly. It uses classes, not raw functions. So, I'd have to write a C wrapper around ICU which gave me a C interface, and /then/ I'd have to write a D wrapper to call the C wrapper - at which point we could get the classes back again (and our own choice of architecture, so plugging into std or mango streams won't suffer).

But the outermost (D) wrapper can, at least, be composed of D classes.

If we want D to be the language of choice for Unicode, we would need all this functionality. So, if we went the ICU route, we'd need to bundle ICU (currently a ten megabyte zip file) with DMD, along with whatever wrapper I come up with. (etc.unicode is not likely to be smaller).

I'd like to see some discussion on this. Read this page to inform yourself: http://oss.software.ibm.com/icu/userguide/index.html


Finally, back to strings. Ben was right. The ICU says:

"In order to take advantage of Unicode with its large character repertoire and its well-defined properties, there must be types with consistent definitions and semantics. The Unicode standard defines a default encoding based on 16-bit code units. This is supported in ICU by the definition of the UChar to be an unsigned 16-bit integer type. This is the base type for character arrays for strings in ICU."

Get that? A *16-BIT* type is the basis for ICU strings. ICU defines no special string class - a string is just an array of wchars. So obviously, if we go the ICU route, I withdraw my suggestion to ditch the wchar.

What I now recommend is:

(1) Ditch "etc.unicode" in favor of - let's call it "etc.icu" (a D wrapper
around ICU). Eventually I hope for this to change into "std.icu" (as I
originally hoped that "etc.unicode" would turn into "std.unicode").

(2) Ditch the char. 8-bits is really too small for a character these days, honestly, and all previous arguments still apply. The existence of char only encourages ASCII and discourages Unicode anyway.

(3) Native D strings shall be arrays of wchars. This means that Object.toString() must return a wchar[], and string literals in D source must compile to wchar[]s. UCI's type UChar would map directly to wchar. To reinforce this, there should probably be an

#    alias wchar[] string;

in object.d.

(4) We retain dchar (so that we can get character properties), but all string
code is based on wchar[]s, not dchar[]s. UCI's type UChar32 would map directly
to dchar.

(5) Transcoding/streams/etc. go ahead as planned, but based around wchar[]s
instead of dchar[]s.

(6) That's pretty much it, although once "char" is gone, we could rename "wchar"
as "char" (a la Java).


Discussion please? And I really do want this talked through because it affects D work I'm currently involved in.

Input is also requested from Walter - in particular the request that Object.toString() be re-jigged to return wchar[] instead of char[].

Okay, let's chew this one over.

Jill


August 23, 2004
In article <cgcv4n$2fsf$1@digitaldaemon.com>, Arcane Jill says...

>It's not completely straightforward. ICU is written in C++ (not C), and so we can't link against it directly. It uses classes, not raw functions. So, I'd have to write a C wrapper around ICU which gave me a C interface, and /then/ I'd have to write a D wrapper to call the C wrapper - at which point we could get the classes back again (and our own choice of architecture, so plugging into std or mango streams won't suffer).

Or I could port it to D!

Or...

In article <cgd0h9$2ggf$1@digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?= says...
>
>AJ, I don't know a shit^D^D^D^D too much about Unicode but you excitement about ICU is really contagious, only one question, are the C wrappers at the same level than the C++/Java ones? If so it seems that with a little easy and boring (compared to writing etc.unicode) wrapping we're going to have a first-class Unicode lib :) => (i18n version of <g>)

I don't know, as I've only just started looking into it. But either way (port or wrap C) we move the time spend on development from a year or so down to only a few months. It really is worth thinking about, but it /does/ mean that D really should standardize on wchar[] strings, and this has consequences for (a) parsing string literals, and (b) Object.toString() - and probably a few other things too, not to mention all the code it would break, and the future (or not) of the char type. It's all this that I'd be concerned about, and should really be discussed by all of us in the D community, and Walter as its architect - not just those of us interested in Unicode.

Arcane Jill


August 23, 2004
Arcane Jill wrote:
> 
> So I'm thinking seriously about ditching the whole etc.unicode project and
> replacing it with a wrapper around ICU.
> 
> It's not completely straightforward. ICU is written in C++ (not C), and so we
> can't link against it directly. It uses classes, not raw functions. So, I'd have
> to write a C wrapper around ICU which gave me a C interface, and /then/ I'd have
> to write a D wrapper to call the C wrapper - at which point we could get the
> classes back again (and our own choice of architecture, so plugging into std or
> mango streams won't suffer).

According to the API docs at http://oss.software.ibm.com/icu/apiref/index.html there is a C API. Didn't you see that or is there a reason why that can't be used?

Regards,
Jörg
August 23, 2004
In article <cgd5go$2j1k$1@digitaldaemon.com>, =?ISO-8859-1?Q?J=F6rg_R=FCppel?= says...

>According to the API docs at http://oss.software.ibm.com/icu/apiref/index.html there is a C API. Didn't you see that or is there a reason why that can't be used?

I just didn't see it, that's all.

I think now that the best approach would be something which is part-port and part-wrapper around the C API. I would want the D interface to maintain the classes and so forth which are present in the C++ and Java APIs, so the D API's C wrappers would have to be part-port anyway, even if only to recreate the class heirarchy and put things back into member functions.

Jill


August 23, 2004
Most of ICU it's written in C with C++ wrappers.



Arcane Jill wrote:
> Following Ben's mention of ICU, I've been checking out what it does and doesn't
> do. Basically, it does EVERYTHING. There is the work of years there. Not just
> Unicode stuff, but a whole swathe of classes for internationalization and
> transcoding. It would take me a very long time to duplicate that. It's also
> free, open source, and with a license which basically says there's no problem
> with our using it.
> 
> So I'm thinking seriously about ditching the whole etc.unicode project and
> replacing it with a wrapper around ICU.
> 
> It's not completely straightforward. ICU is written in C++ (not C), and so we
> can't link against it directly. It uses classes, not raw functions. So, I'd have
> to write a C wrapper around ICU which gave me a C interface, and /then/ I'd have
> to write a D wrapper to call the C wrapper - at which point we could get the
> classes back again (and our own choice of architecture, so plugging into std or
> mango streams won't suffer).
> 
> But the outermost (D) wrapper can, at least, be composed of D classes.
> 
> If we want D to be the language of choice for Unicode, we would need all this
> functionality. So, if we went the ICU route, we'd need to bundle ICU (currently
> a ten megabyte zip file) with DMD, along with whatever wrapper I come up with.
> (etc.unicode is not likely to be smaller).
> 
> I'd like to see some discussion on this. Read this page to inform yourself:
> http://oss.software.ibm.com/icu/userguide/index.html
> 
> 
> Finally, back to strings. Ben was right. The ICU says:
> 
> "In order to take advantage of Unicode with its large character repertoire and
> its well-defined properties, there must be types with consistent definitions and
> semantics. The Unicode standard defines a default encoding based on 16-bit code
> units. This is supported in ICU by the definition of the UChar to be an unsigned
> 16-bit integer type. This is the base type for character arrays for strings in
> ICU."
> 
> Get that? A *16-BIT* type is the basis for ICU strings. ICU defines no special
> string class - a string is just an array of wchars. So obviously, if we go the
> ICU route, I withdraw my suggestion to ditch the wchar.
> 
> What I now recommend is:
> 
> (1) Ditch "etc.unicode" in favor of - let's call it "etc.icu" (a D wrapper
> Path: digitalmars.com!drn
> From: Arcane Jill <Arcane_member@pathlink.com>
> Newsgroups: digitalmars.D
> Subject: Syntax for pinning
> Date: Sun, 22 Aug 2004 09:18:11 +0000 (UTC)
> Organization: [http://www.pathlink.com]
> Lines: 39
> Sender: usenet@www.digitalmars.com
> Message-ID: <cg9ocj$9ce$1@digitaldaemon.com>
> X-Trace: digitaldaemon.com 1093166291 9614 63.105.9.61 (22 Aug 2004 09:18:11 GMT)
> X-Complaints-To: usenet@digitalmars.com
> NNTP-Posting-Date: Sun, 22 Aug 2004 09:18:11 +0000 (UTC)
> X-Newsreader: Direct Read News v3.11a
> Xref: digitalmars.com digitalmars.D:9385
> 
> In article <cfit27$snc$1@digitaldaemon.com>, Walter says...
> 
> 
>>>>I know how to build a gc that will compact
>>>>memory to deal with fragmentation when it does arise, so although this is
>>>>not in the current D it is not a technical problem to do it.
>>>
>>>That's probably good news, but:
>>>
>>>Will it be possible to mark specific arrays on the heap as "immovable"?
>>
>>(Please
>>
>>>say yes. I'm sure it is not a technical problem).
>>
>>Yes. It's called "pinning". And I can guess why you need this feature <g>.
> 
> 
> 
> Walter, hi.
> 
> Listen - while the prospect of a compacting GC may be a long way away, I'd like
> to be able to mark specific arrays on the heap as immovable right now, for
> forward compatibility, so that when the new GC comes along everything will still
> behave as intended.
> 
> Could you perhaps reserve a keyword and/or some syntax for pinning? It only has
> around ICU). Eventually I hope for this to change into "std.icu" (as I
> originally hoped that "etc.unicode" would turn into "std.unicode").
> 
> (2) Ditch the char. 8-bits is really too small for a character these days,
> honestly, and all previous arguments still apply. The existence of char only
> encourages ASCII and discourages Unicode anyway.
> 
> (3) Native D strings shall be arrays of wchars. This means that
> Object.toString() must return a wchar[], and string literals in D source must
> compile to wchar[]s. UCI's type UChar would map directly to wchar. To reinforce
> this, there should probably be an
> 
> #    alias wchar[] string;
> 
> in object.d.
> 
> (4) We retain dchar (so that we can get character properties), but all string
> code is based on wchar[]s, not dchar[]s. UCI's type UChar32 would map directly
> to dchar.
> 
> (5) Transcoding/streams/etc. go ahead as planned, but based around wchar[]s
> instead of dchar[]s.
> 
> (6) That's pretty much it, although once "char" is gone, we could rename "wchar"
> as "char" (a la Java).
> 
> 
> Discussion please? And I really do want this talked through because it affects D
> work I'm currently involved in.
> 
> Input is also requested from Walter - in particular the request that
> Object.toString() be re-jigged to return wchar[] instead of char[].
> 
> Okay, let's chew this one over.
> 
> Jill
> 
> 
August 23, 2004
This would be a great thing for D to adopt. Just a few things to note:

1) This would be best served as a DLL (given it's size). In fact, the team apparently like to compile the string-resource files into DLLs (which makes a lot of sense IMO). If D treated DLLs as first-class citizens, this would be a no-brainer. Right now, that's not the case.

2) There's a rather nice String class (C++). That's a perfect candidate for
porting directly to D.

3) From what I've seen, the lib is mostly C. Even better, it eschews the traditional morass of header files. Building a ICU.d import will be much easier because of this. The project on dsource.org might handle that part without issue? Wrapping those library functions with D shells would be nice, if only to take advantage of D arrays.

4) the transcoders deal with arrays of buffered data, so they're efficient. ICU has transcoders and code-page tables up-the-wazoo.

For those who haven't looked through the lib, it's far more than just Unicode transcoders (as Jill notes): you get sophisticated and flexible date & number parsing/formatting; I18N message ordering; BiDi support; collation/sorting support; text-break algorithms; a text layout engine; unicode regex; and much more.

It's a first-class suite of libraries, and an awesome resource to leverage.



"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cgcv4n$2fsf$1@digitaldaemon.com...
>
> Following Ben's mention of ICU, I've been checking out what it does and
doesn't
> do. Basically, it does EVERYTHING. There is the work of years there. Not
just
> Unicode stuff, but a whole swathe of classes for internationalization and transcoding. It would take me a very long time to duplicate that. It's
also
> free, open source, and with a license which basically says there's no
problem
> with our using it.
>
> So I'm thinking seriously about ditching the whole etc.unicode project and replacing it with a wrapper around ICU.
>
> It's not completely straightforward. ICU is written in C++ (not C), and so
we
> can't link against it directly. It uses classes, not raw functions. So,
I'd have
> to write a C wrapper around ICU which gave me a C interface, and /then/
I'd have
> to write a D wrapper to call the C wrapper - at which point we could get
the
> classes back again (and our own choice of architecture, so plugging into
std or
> mango streams won't suffer).
>
> But the outermost (D) wrapper can, at least, be composed of D classes.
>
> If we want D to be the language of choice for Unicode, we would need all
this
> functionality. So, if we went the ICU route, we'd need to bundle ICU
(currently
> a ten megabyte zip file) with DMD, along with whatever wrapper I come up
with.
> (etc.unicode is not likely to be smaller).
>
> I'd like to see some discussion on this. Read this page to inform
yourself:
> http://oss.software.ibm.com/icu/userguide/index.html
>
>
> Finally, back to strings. Ben was right. The ICU says:
>
> "In order to take advantage of Unicode with its large character repertoire
and
> its well-defined properties, there must be types with consistent
definitions and
> semantics. The Unicode standard defines a default encoding based on 16-bit
code
> units. This is supported in ICU by the definition of the UChar to be an
unsigned
> 16-bit integer type. This is the base type for character arrays for
strings in
> ICU."
>
> Get that? A *16-BIT* type is the basis for ICU strings. ICU defines no
special
> string class - a string is just an array of wchars. So obviously, if we go
the
> ICU route, I withdraw my suggestion to ditch the wchar.
>
> What I now recommend is:
>
> (1) Ditch "etc.unicode" in favor of - let's call it "etc.icu" (a D wrapper
> around ICU). Eventually I hope for this to change into "std.icu" (as I
> originally hoped that "etc.unicode" would turn into "std.unicode").
>
> (2) Ditch the char. 8-bits is really too small for a character these days, honestly, and all previous arguments still apply. The existence of char
only
> encourages ASCII and discourages Unicode anyway.
>
> (3) Native D strings shall be arrays of wchars. This means that
> Object.toString() must return a wchar[], and string literals in D source
must
> compile to wchar[]s. UCI's type UChar would map directly to wchar. To
reinforce
> this, there should probably be an
>
> #    alias wchar[] string;
>
> in object.d.
>
> (4) We retain dchar (so that we can get character properties), but all
string
> code is based on wchar[]s, not dchar[]s. UCI's type UChar32 would map
directly
> to dchar.
>
> (5) Transcoding/streams/etc. go ahead as planned, but based around
wchar[]s
> instead of dchar[]s.
>
> (6) That's pretty much it, although once "char" is gone, we could rename
"wchar"
> as "char" (a la Java).
>
>
> Discussion please? And I really do want this talked through because it
affects D
> work I'm currently involved in.
>
> Input is also requested from Walter - in particular the request that Object.toString() be re-jigged to return wchar[] instead of char[].
>
> Okay, let's chew this one over.
>
> Jill
>
>


August 23, 2004
"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cgcv4n$2fsf$1@digitaldaemon.com...
> Get that? A *16-BIT* type is the basis for ICU strings. ICU defines no
special
> string class - a string is just an array of wchars. So obviously, if we go
the
> ICU route, I withdraw my suggestion to ditch the wchar.

Ok, but suppose we ditch char[]. Then, we find some great library we want to bring into D, or build a D interface too, that is in char[].

> Input is also requested from Walter - in particular the request that Object.toString() be re-jigged to return wchar[] instead of char[].

My experience with all-wchar is that its performance is not the best. It'll also become a nuisance interfacing with C. I'd rather explore perhaps making implict conversions between the 3 utf types more seamless.


August 23, 2004
On Mon, 23 Aug 2004 14:06:57 -0700, Walter <newshound@digitalmars.com> wrote:
> "Arcane Jill" <Arcane_member@pathlink.com> wrote in message
> news:cgcv4n$2fsf$1@digitaldaemon.com...
>> Get that? A *16-BIT* type is the basis for ICU strings. ICU defines no
> special
>> string class - a string is just an array of wchars. So obviously, if we go
> the
>> ICU route, I withdraw my suggestion to ditch the wchar.
>
> Ok, but suppose we ditch char[]. Then, we find some great library we want to
> bring into D, or build a D interface too, that is in char[].
>
>> Input is also requested from Walter - in particular the request that
>> Object.toString() be re-jigged to return wchar[] instead of char[].
>
> My experience with all-wchar is that its performance is not the best. It'll
> also become a nuisance interfacing with C. I'd rather explore perhaps making
> implict conversions between the 3 utf types more seamless.

YAY! .. Sorry I can't help myself, I think _this_ is the way to go, see my post here:
http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/9494

for my arguments.

I would add one additional argument. I can't imagine the suggested change breaking existing code.. more likely it will fix existing bugs.

Regan.

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
August 23, 2004
Regan Heath wrote:

>> Ok, but suppose we ditch char[]. Then, we find some great library we
>> want to
>> bring into D, or build a D interface too, that is in char[].

Excuse me if I'm saying something stupid but byte[] would not do the job of interfacing with C char[]?


August 23, 2004
On Tue, 24 Aug 2004 01:18:09 +0200, Juanjo Álvarez <juanjuxNO@SPAMyahoo.es> wrote:
> Regan Heath wrote:

I didn't write this! :)

>>> Ok, but suppose we ditch char[]. Then, we find some great library we
>>> want to
>>> bring into D, or build a D interface too, that is in char[].
>
> Excuse me if I'm saying something stupid but byte[] would not do the job of
> interfacing with C char[]?

I almost made that same comment in reply to 'Walter' (who made the comment above)

I think you're right, you could use byte[], in fact it'd be more correct to use byte[] as the C 'char' type is a byte with no specified encoding (whereas D's char[] is utf-8 encoded).

If we had no char[] you'd have to transcode the byte[] to dchar[], this is why I disagree with removing char[], I think char[] has a place in D, I just want to see implicit transcoding of char[] to wchar[] to dchar[].

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
« First   ‹ Prev
1 2 3