Thread overview | |||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
October 13, 2013 Inconsitency | ||||
---|---|---|---|---|
| ||||
Why does <string>.length return the number of bytes and not the number of UTF-8 characters, whereas <wstring.>length and <dstring>.length return the number of UTF-16 and UTF-32 characters? Wouldn't it be more consistent to have <string>.length return the number of UTF-8 characters as well (instead of having to use std.utf.count(<string>)? |
October 13, 2013 Re: Inconsitency | ||||
---|---|---|---|---|
| ||||
Posted in reply to nickles | On Sunday, 13 October 2013 at 12:36:20 UTC, nickles wrote:
> Why does <string>.length return the number of bytes and not the
> number of UTF-8 characters, whereas <wstring.>length and
> <dstring>.length return the number of UTF-16 and UTF-32
> characters?
>
> Wouldn't it be more consistent to have <string>.length return the
> number of UTF-8 characters as well (instead of having to use
> std.utf.count(<string>)?
Because `length` must be O(1) operation for built-in arrays and for UTF-8 strings it would require storing additional length field making it binary incompatible with other array types.
|
October 13, 2013 Re: Inconsitency | ||||
---|---|---|---|---|
| ||||
Posted in reply to nickles | 13-Oct-2013 16:36, nickles пишет: > Why does <string>.length return the number of bytes and not the > number of UTF-8 characters, whereas <wstring.>length and > <dstring>.length return the number of UTF-16 and UTF-32 > characters? > ??? This is simply wrong. All strings return number of codeunits. And it's only UTF-32 where codepoint (~ character) happens to fit into one codeunit. > Wouldn't it be more consistent to have <string>.length return the > number of UTF-8 characters as well (instead of having to use > std.utf.count(<string>)? It's consistent as is. -- Dmitry Olshansky |
October 13, 2013 Re: Inconsitency | ||||
---|---|---|---|---|
| ||||
Posted in reply to nickles | On Sunday, 13 October 2013 at 12:36:20 UTC, nickles wrote:
> Why does <string>.length return the number of bytes and not the
> number of UTF-8 characters, whereas <wstring.>length and
> <dstring>.length return the number of UTF-16 and UTF-32
> characters?
>
> Wouldn't it be more consistent to have <string>.length return the
> number of UTF-8 characters as well (instead of having to use
> std.utf.count(<string>)?
Technically, UTF-16 can contain 2 ushort's for 1 character, so <wstring.>length return the number of ushort's, not the UTF-16 characters.
|
October 13, 2013 Re: Inconsitency | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dmitry Olshansky | > This is simply wrong. All strings return number of codeunits. And it's only UTF-32 where codepoint (~ character) happens to fit into one codeunit. I do not agree: writeln("säд".length); => 5 chars: 5 (1 + 2 [C3A4] + 2 [D094], UTF-8) writeln(std.utf.count("säд")) => 3 chars: 5 (ibidem) writeln("säд"w.length); => 3 chars: 6 (2 x 3, UTF-16) writeln("säд"d.length); => 3 chars: 12 (4 x 3, UTF-32) This is not consistent - from my point of view. |
October 13, 2013 Re: Inconsitency | ||||
---|---|---|---|---|
| ||||
Posted in reply to nickles | On Sunday, 13 October 2013 at 13:14:59 UTC, nickles wrote:
> I do not agree:
>
> writeln("säд".length); => 5 chars: 5 (1 + 2 [C3A4] + 2 [D094], UTF-8)
> writeln(std.utf.count("säд")) => 3 chars: 5 (ibidem)
> writeln("säд"w.length); => 3 chars: 6 (2 x 3, UTF-16)
> writeln("säд"d.length); => 3 chars: 12 (4 x 3, UTF-32)
>
> This is not consistent - from my point of view.
Because you have wrong understanding of what does "length" mean.
|
October 13, 2013 Re: Inconsitency | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dicebot | Ok, if my understandig is wrong, how do YOU measure the length of a string? Do you always use count(), or is there an alternative? |
October 13, 2013 Re: Inconsitency | ||||
---|---|---|---|---|
| ||||
Posted in reply to nickles | On Sunday, 13 October 2013 at 13:25:08 UTC, nickles wrote:
> Ok, if my understandig is wrong, how do YOU measure the length of a string?
Depends on how you define the "length" of a string. Doing that is surprisingly difficult once the full variety of Unicode code points comes into play, even if you ignore the question of encoding (UTF-8, UTF-16, …).
David
|
October 13, 2013 Re: Inconsitency | ||||
---|---|---|---|---|
| ||||
Posted in reply to nickles | Am 13.10.2013 15:25, schrieb nickles: > Ok, if my understandig is wrong, how do YOU measure the length of a string? > Do you always use count(), or is there an alternative? > > The thing is that even count(), which gives you the number of *code points*, isn't necessarily what is desired - that is, the number of actual display characters. UTF is quite a complex beast and doing any operations on it _correctly_ generally requires a lot of care. If you need to do these kinds of operations, I would highly recommend to read up the basics of UTF and Unicode first (quick overview on Wikipedia: <http://en.wikipedia.org/wiki/Unicode#Mapping_and_encodings>). arr.length is meant to be used in conjunction with array indexing and slicing (arr[...]) and its value is consistent for all string and array types for this purpose. |
October 13, 2013 Re: Inconsitency | ||||
---|---|---|---|---|
| ||||
Posted in reply to nickles | 13-Oct-2013 17:25, nickles пишет: > Ok, if my understandig is wrong, how do YOU measure the length of a string? > Do you always use count(), or is there an alternative? > > It's all there: http://www.unicode.org/glossary/ http://www.unicode.org/versions/Unicode6.3.0/ I measure string length in code units (as defined in the above standard). This bears no easy relation to the number of visible characters but I don't mind it. Measuring number of visible characters isn't trivial but can be done by counting number of graphemes. For simple alphabets counting code points will do the trick as well (what count does). -- Dmitry Olshansky |
Copyright © 1999-2021 by the D Language Foundation