Why is there no String? (page 2)

Hello. While char[] is a good and native thing for working with console, simple textfiles, and such, it is not a solution for applications processing any data subject to internalisation. And that is almost every piece of text currently out there. However, i keep thinking that one dedicated string class is not at all enough. I propose - not one - not two - but *at least* three of them. First and the basic one - --- a String type - should be an array of 4-byte characters. It is used inside functions for processing strings. With modern processors, handling 4-byte values may be cheaper than 2-byte and not much costier than of 1-byte. As to space considerations - forget them, this type is for local chewing only. If you want to keep this string in memory or some database consider the second one - --- a CompactString type - should consist of 2 arrays, first one for raw characters, second one for mode changes. The second one is the key. It should store a list of events like "at character x change to codepage y, encoding z" "or at character x make an exceptional 4-byte value", which could be swizzled into a few bytes each. It should also be quite fast to handle, since unlike UTF7/8/16, the raw string need not be scanned to determine its length, this can be done by scanning mode changes, which has to be an order or two of magnitude shorter. And it can adapt itself to whatever takes the least space - 8-bit with explicit codepage for e.g. european and russian, 16-bit for japanese kanji and somesuch, or even 32-bit in rare case you mix all languages evenly. But this type would not be directly standards-complying. There should obviously also be - --- another type which corresponds to the underlying system's preferred encoding. A set of functions also has to be provided to convert any of these types to and from any of the other standard unicode types. As to templates - i don't hold much of them for these purposes. There is a limited number of types - you don't want to create a string of floats, do you? And besides, their handling differs in some ways. But making them into classes could give further flexibility, at the price of an (8-byte IIRC) space overhead per instance. -i. PS. i've been away for a short while... 300 new messages! i wonder how Walter manages to read them AND maintain two complex compilers! Jonathan Andrew wrote: > Hello, > > I think that an array of chars was probably appropriate, but now that Unicode is > being considered for the language, I think a primitive string type might be > necessary, because an array of uneven sized chars would be very awkward when > talking about indexing (ie. is mystr[6] talking about the 6th byte, or the 6th > character?) and declaring new strings ("char [40] mystr;", is this 40 bytes, or > 40 characters long?). Basically, stuff that has already been talked about in > here. > A dedicated string type might resolve some of this ambiguity by providing, for > example, both .length (characters) and .size properties (byte-size). Stuff that > is important for strings, but not really appropriate for other array types. I > don't really care too much either way, and if we are stuck with good old ASCII, > it really doesn't matter either way. But if Unicode is put in, then some > mechanism should be put in place to take care of these issues, whether its a > string type or not. And yes, this is probably going to be the start of a long, > painful thread. =) > > -Jon

April 15, 2003

Re: Why is there no String?

Posted by Achilleas Margaritis
in reply to Ilya Minkov

Permalink

Achilleas Margaritis

Posted in reply to Ilya Minkov

Permalink

"Ilya Minkov" <midiclub@8ung.at> wrote in message news:b6t1i8$nt4$1@digitaldaemon.com...
> Hello.
>
> While char[] is a good and native thing for working with console, simple textfiles, and such, it is not a solution for applications processing any data subject to internalisation. And that is almost every piece of text currently out there.
>
> However, i keep thinking that one dedicated string class is not at all enough. I propose - not one - not two - but *at least* three of them.
>
> First and the basic one -
> --- a String type - should be an array of 4-byte characters. It is used
> inside functions for processing strings. With modern processors, handling 4-byte values may be cheaper than 2-byte and not much costier than of 1-byte. As to space considerations - forget them, this type is for local chewing only. If you want to keep this string in memory or some database consider the second one -
> --- a CompactString type - should consist of 2 arrays, first one for raw
> characters, second one for mode changes. The second one is the key. It should store a list of events like "at character x change to codepage y, encoding z" "or at character x make an exceptional 4-byte value", which could be swizzled into a few bytes each. It should also be quite fast to handle, since unlike UTF7/8/16, the raw string need not be scanned to determine its length, this can be done by scanning mode changes, which has to be an order or two of magnitude shorter. And it can adapt itself to whatever takes the least space - 8-bit with explicit codepage for e.g. european and russian, 16-bit for japanese kanji and somesuch, or even 32-bit in rare case you mix all languages evenly. But this type would not be directly standards-complying. There should obviously also
be -
> --- another type which corresponds to the underlying system's preferred
> encoding.
>
> A set of functions also has to be provided to convert any of these types to and from any of the other standard unicode types. As to templates - i don't hold much of them for these purposes. There is a limited number of types - you don't want to create a string of floats, do you? And besides, their handling differs in some ways. But making them into classes could give further flexibility, at the price of an (8-byte IIRC) space overhead per instance.
>
> -i.
>
> PS. i've been away for a short while... 300 new messages! i wonder how Walter manages to read them AND maintain two complex compilers!
>
>
> Jonathan Andrew wrote:
> > Hello,
> >
> > I think that an array of chars was probably appropriate, but now that
Unicode is
> > being considered for the language, I think a primitive string type might
be
> > necessary, because an array of uneven sized chars would be very awkward
when
> > talking about indexing (ie. is mystr[6] talking about the 6th byte, or
the 6th
> > character?) and declaring new strings ("char [40] mystr;", is this 40
bytes, or
> > 40 characters long?). Basically, stuff that has already been talked
about in
> > here.
> > A dedicated string type might resolve some of this ambiguity by
providing, for
> > example, both .length (characters) and .size properties (byte-size).
Stuff that
> > is important for strings, but not really appropriate for other array
types. I
> > don't really care too much either way, and if we are stuck with good old
ASCII,
> > it really doesn't matter either way. But if Unicode is put in, then some mechanism should be put in place to take care of these issues, whether
its a
> > string type or not. And yes, this is probably going to be the start of a
long,
> > painful thread. =)
> >
> > -Jon
>

Why complicate our lives ? D should use 16-bit unicode and provide implicit conversions in any I/O, according to environment. These conversions should be transparent to the user. 65536 characters are enough to represent most Earth languages...

Achilleas Margaritis wrote: > > Why complicate our lives ? D should use 16-bit unicode and provide implicit > conversions in any I/O, according to environment. These conversions should > be transparent to the user. 65536 characters are enough to represent most > Earth languages... > It doesn't yet make life easy. IIRC you only have less than 1/4 of this set available. Besides, how do you treat separate accents, if they cannot be combined into the letters? This is really rare though. 32-bit is better for speed. And besides, you have already let everyone down for years with that endless "we don't care, since *our* language fits in less than seven bits. now the rest of teh world may share the leftovers, if they like." Now, aren't we letting anyone down with 16-bit? Besides, the second type i proposed, would usually use 1 byte for european, cyrillic, arabic, hebrew, greek and such, unlike UTF-8 and UTF-16 which *both* requiere 2, so it's better for storage! -i.

As UNICODE 3.0 shows, 16 bits is not enough. It's looking like around 23 bits are needed, a value which of course isn't very practical on today's computers. Dan "Achilleas Margaritis" <axilmar@in.gr> wrote in message news:b7h8tm$t5j$1@digitaldaemon.com... > [...] > be transparent to the user. 65536 characters are enough to represent most > Earth languages... >

Forums