character indexing [was Re: strings in D] (page 3)

>> String as an entity is a sequence of "code points" - ascii, ucs-2(basic >> multilang plane) >> and ucs-4 so operator[] always returns character in full (for the given >> supported plane). >> The same should apply to foreach(). > >You can "foreach dchar", over all three string types. >If you want to index by code point, you will need to >convert the two smaller code units to UTF-32 first... A while ago I posted some tiny helper functions to do on-the-fly character indexing, but I can't find them so I'll just post them again in case the OP finds them useful: # import std.utf; # # // return the index of the nth character in s # size_t character(char[] s, size_t n) { # size_t i=0; # while (n--) decode(s,i); # return i; # } # # // return the index of the nth character in s # size_t character(wchar[] s, size_t n) { # size_t i=0; # while (n--) decode(s,i); # return i; # } -Ben

February 20, 2005

Re: strings in D

Posted by Anders F Björklund
in reply to Thomas Kühne

Permalink

Anders F Björklund

Posted in reply to Thomas Kühne

Permalink

Thomas Kühne wrote:

> | Note that "char" only holds ASCII in
> | D, wchar must be used for Latin-1.
> 
> clarification
> 
> char:
> can only hold 0x00 -> 0x80, otherwise it's an illegal UTF-8 fragment

Yes, that's what I said :-) (not my fault char[] sounds a lot like char)
And that should probably be 0x00-0x7F, or 0x00..0x80 in exclusive style?

We mean the same thing, the 7-bit ASCII subset of ISO-8859-1 and UTF-8.
(as in the table: http://www.algonet.se/~afb/d/latin1/iso-8859-1.html)

>    TYPE        ALIAS     // RANGE
>    char        utf8_t    // \x00-\x7F (ASCII)
>   wchar       utf16_t    // \u0000-\uD7FF, \uE000-\uFFFF
>   dchar       utf32_t    // \U00000000-\U0010FFFF (Unicode)

66 codepoints are invalid "noncharacters", but that's beside the point.
(\uFDD0-\uFDEF,\uFFFE-\uFFFF http://www.unicode.org/faq/utf_bom.html#40)

The code unit arrays, char[]/wchar[]/dchar[] can all hold any UTF string
But only "dchar" is fully standalone for all different codepoint values.
This does not stop "char" and "wchar" from being useful for loops and other special uses, just as the limitations are being accounted for ?

--anders

Forums