February 20, 2005 character indexing [was Re: strings in D] | ||||
---|---|---|---|---|
| ||||
Posted in reply to Anders F Björklund |
>> String as an entity is a sequence of "code points" - ascii, ucs-2(basic
>> multilang plane)
>> and ucs-4 so operator[] always returns character in full (for the given
>> supported plane).
>> The same should apply to foreach().
>
>You can "foreach dchar", over all three string types.
>If you want to index by code point, you will need to
>convert the two smaller code units to UTF-32 first...
A while ago I posted some tiny helper functions to do on-the-fly character indexing, but I can't find them so I'll just post them again in case the OP finds them useful:
# import std.utf;
#
# // return the index of the nth character in s
# size_t character(char[] s, size_t n) {
# size_t i=0;
# while (n--) decode(s,i);
# return i;
# }
#
# // return the index of the nth character in s
# size_t character(wchar[] s, size_t n) {
# size_t i=0;
# while (n--) decode(s,i);
# return i;
# }
-Ben
|
February 20, 2005 Re: strings in D | ||||
---|---|---|---|---|
| ||||
Posted in reply to Thomas Kühne | Thomas Kühne wrote: > | Note that "char" only holds ASCII in > | D, wchar must be used for Latin-1. > > clarification > > char: > can only hold 0x00 -> 0x80, otherwise it's an illegal UTF-8 fragment Yes, that's what I said :-) (not my fault char[] sounds a lot like char) And that should probably be 0x00-0x7F, or 0x00..0x80 in exclusive style? We mean the same thing, the 7-bit ASCII subset of ISO-8859-1 and UTF-8. (as in the table: http://www.algonet.se/~afb/d/latin1/iso-8859-1.html) > TYPE ALIAS // RANGE > char utf8_t // \x00-\x7F (ASCII) > wchar utf16_t // \u0000-\uD7FF, \uE000-\uFFFF > dchar utf32_t // \U00000000-\U0010FFFF (Unicode) 66 codepoints are invalid "noncharacters", but that's beside the point. (\uFDD0-\uFDEF,\uFFFE-\uFFFF http://www.unicode.org/faq/utf_bom.html#40) The code unit arrays, char[]/wchar[]/dchar[] can all hold any UTF string But only "dchar" is fully standalone for all different codepoint values. This does not stop "char" and "wchar" from being useful for loops and other special uses, just as the limitations are being accounted for ? --anders |
Copyright © 1999-2021 by the D Language Foundation