Jump to page: 1 25  
Page
Thread overview
Inconsitency
Oct 13, 2013
nickles
Oct 13, 2013
Dicebot
Oct 13, 2013
Dmitry Olshansky
Oct 13, 2013
nickles
Oct 13, 2013
Dicebot
Oct 13, 2013
nickles
Oct 13, 2013
David Nadlinger
Oct 13, 2013
Sönke Ludwig
Oct 13, 2013
nickles
Oct 13, 2013
Michael
Oct 13, 2013
Sönke Ludwig
Oct 13, 2013
nickles
Oct 13, 2013
Dicebot
Oct 15, 2013
Kagamin
Oct 13, 2013
anonymous
Oct 13, 2013
Peter Alexander
Oct 13, 2013
Temtaime
Oct 13, 2013
deadalnix
Oct 13, 2013
Andrej Mitrovic
Oct 13, 2013
Maxim Fomin
Oct 13, 2013
monarch_dodra
Oct 13, 2013
deadalnix
Oct 14, 2013
nickles
Oct 15, 2013
Kagamin
Oct 16, 2013
qznc
Oct 16, 2013
Chris
Oct 16, 2013
monarch_dodra
Oct 16, 2013
Chris
Oct 16, 2013
Maxim Fomin
Oct 16, 2013
Jacob Carlborg
Oct 16, 2013
qznc
Oct 16, 2013
Jacob Carlborg
Oct 16, 2013
monarch_dodra
Oct 16, 2013
qznc
Oct 16, 2013
Dmitry Olshansky
Oct 16, 2013
monarch_dodra
Oct 20, 2013
Kagamin
Oct 14, 2013
Chris
Oct 13, 2013
Dmitry Olshansky
Oct 13, 2013
Sönke Ludwig
Oct 13, 2013
Maxim Fomin
Oct 13, 2013
ilya-stromberg
October 13, 2013
Why does <string>.length return the number of bytes and not the
number of UTF-8 characters, whereas <wstring.>length and
<dstring>.length return the number of UTF-16 and UTF-32
characters?

Wouldn't it be more consistent to have <string>.length return the
number of UTF-8 characters as well (instead of having to use
std.utf.count(<string>)?
October 13, 2013
On Sunday, 13 October 2013 at 12:36:20 UTC, nickles wrote:
> Why does <string>.length return the number of bytes and not the
> number of UTF-8 characters, whereas <wstring.>length and
> <dstring>.length return the number of UTF-16 and UTF-32
> characters?
>
> Wouldn't it be more consistent to have <string>.length return the
> number of UTF-8 characters as well (instead of having to use
> std.utf.count(<string>)?

Because `length` must be O(1) operation for built-in arrays and for UTF-8 strings it would require storing additional length field making it binary incompatible with other array types.
October 13, 2013
13-Oct-2013 16:36, nickles пишет:
> Why does <string>.length return the number of bytes and not the
> number of UTF-8 characters, whereas <wstring.>length and
> <dstring>.length return the number of UTF-16 and UTF-32
> characters?
>

???
This is simply wrong. All strings return number of codeunits. And it's only UTF-32 where codepoint (~ character) happens to fit into one codeunit.

> Wouldn't it be more consistent to have <string>.length return the
> number of UTF-8 characters as well (instead of having to use
> std.utf.count(<string>)?

It's consistent as is.

-- 
Dmitry Olshansky
October 13, 2013
On Sunday, 13 October 2013 at 12:36:20 UTC, nickles wrote:
> Why does <string>.length return the number of bytes and not the
> number of UTF-8 characters, whereas <wstring.>length and
> <dstring>.length return the number of UTF-16 and UTF-32
> characters?
>
> Wouldn't it be more consistent to have <string>.length return the
> number of UTF-8 characters as well (instead of having to use
> std.utf.count(<string>)?

Technically, UTF-16 can contain 2 ushort's for 1 character, so <wstring.>length return the number of ushort's, not the UTF-16 characters.
October 13, 2013
> This is simply wrong. All strings return number of codeunits. And it's only UTF-32 where codepoint (~ character) happens to fit into one codeunit.

I do not agree:

   writeln("säд".length);        => 5  chars: 5 (1 + 2 [C3A4] + 2 [D094], UTF-8)
   writeln(std.utf.count("säд")) => 3  chars: 5 (ibidem)
   writeln("säд"w.length);       => 3  chars: 6 (2 x 3, UTF-16)
   writeln("säд"d.length);       => 3  chars: 12 (4 x 3, UTF-32)

This is not consistent - from my point of view.
October 13, 2013
On Sunday, 13 October 2013 at 13:14:59 UTC, nickles wrote:
> I do not agree:
>
>    writeln("säд".length);        => 5  chars: 5 (1 + 2 [C3A4] + 2 [D094], UTF-8)
>    writeln(std.utf.count("säд")) => 3  chars: 5 (ibidem)
>    writeln("säд"w.length);       => 3  chars: 6 (2 x 3, UTF-16)
>    writeln("säд"d.length);       => 3  chars: 12 (4 x 3, UTF-32)
>
> This is not consistent - from my point of view.

Because you have wrong understanding of what does "length" mean.
October 13, 2013
Ok, if my understandig is wrong, how do YOU measure the length of a string?
Do you always use count(), or is there an alternative?


October 13, 2013
On Sunday, 13 October 2013 at 13:25:08 UTC, nickles wrote:
> Ok, if my understandig is wrong, how do YOU measure the length of a string?

Depends on how you define the "length" of a string. Doing that is surprisingly difficult once the full variety of Unicode code points comes into play, even if you ignore the question of encoding (UTF-8, UTF-16, …).

David
October 13, 2013
Am 13.10.2013 15:25, schrieb nickles:
> Ok, if my understandig is wrong, how do YOU measure the length of a string?
> Do you always use count(), or is there an alternative?
>
>

The thing is that even count(), which gives you the number of *code points*, isn't necessarily what is desired - that is, the number of actual display characters. UTF is quite a complex beast and doing any operations on it _correctly_ generally requires a lot of care. If you need to do these kinds of operations, I would highly recommend to read up the basics of UTF and Unicode first (quick overview on Wikipedia: <http://en.wikipedia.org/wiki/Unicode#Mapping_and_encodings>).

arr.length is meant to be used in conjunction with array indexing and slicing (arr[...]) and its value is consistent for all string and array types for this purpose.
October 13, 2013
13-Oct-2013 17:25, nickles пишет:
> Ok, if my understandig is wrong, how do YOU measure the length of a string?
> Do you always use count(), or is there an alternative?
>
>
It's all there:
http://www.unicode.org/glossary/
http://www.unicode.org/versions/Unicode6.3.0/

I measure string length in code units (as defined in the above standard). This bears no easy relation to the number of visible characters but I don't mind it.

Measuring number of visible characters isn't trivial but can be done by counting number of graphemes. For simple alphabets counting code points will do the trick as well (what count does).

-- 
Dmitry Olshansky
« First   ‹ Prev
1 2 3 4 5