Inconsitency (page 3)

October 13, 2013

Re: Inconsitency

Posted by monarch_dodra
in reply to nickles

Permalink

monarch_dodra

Posted in reply to nickles

Permalink

On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote:
> Ok, I understand, that "length" is - obviously - used in analogy to any array's length value.
>
> Still, this seems to be inconsistent. D elaborates on implementing "char"s as UTF-8 which means that a "char" in D can be of any length between 1 and 4 bytes for an arbitrary Unicode code point. Shouldn't then this (i.e. the character's length) be the "unit of measurement" for "char"s - like e.g. the size of the underlying struct in an array of "struct"s? The story continues with indexing "string"s: In a consistent implementation, shouldn't
>
>    writeln("säд"[2])
>
> return "д" instead of the trailing surrogate of this cyrillic letter?

I think the root misunderstanding is that you think that a string is random access.

A string *isn't* random access. They are implemented *inside* an array, but unless you know *exactly* what you are doing, you shouldn't index, slice or take the length of a string.

A string should be handled like a bidirectional range.

Once you've understood that, it becomes much simpler.
You want the first character? front.
You want to skip the first character? popFront.

You want an arbitrary character in o(N) time?
myString.dropFrontExactly(N).front;
You want an arbitrary character in o(1) time?
You can't.

On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote: > Ok, I understand, that "length" is - obviously - used in analogy to any array's length value. > That isn't an analogy. It is usually a good idea to try to understand thing before judging if it is consistent.

I've found another one inconsitency problem. void foo(const char *); void foo(const wchar *); void foo(const dchar *); void main() { foo(`123`); foo(`123`w); foo(`123`d); } Error: function hello.foo (const(char*)) is not callable using argument types (immutable(wchar)[]) Error: function hello.foo (const(char*)) is not callable using argument types (immutable(dchar)[]) And typeof(`123`).stringof == `string`. Why `123` can be stored as null terminated utf8 string in rdata segment and `123`w nor `123`d are not? For example wide strings(utf16) are usable with windows *W functions.

On Sunday, 13 October 2013 at 22:34:00 UTC, Temtaime wrote: > I've found another one inconsitency problem. > > void foo(const char *); > void foo(const wchar *); > void foo(const dchar *); > > void main() { > foo(`123`); > foo(`123`w); > foo(`123`d); > } > > Error: function hello.foo (const(char*)) is not callable using argument types (immutable(wchar)[]) > Error: function hello.foo (const(char*)) is not callable using argument types (immutable(dchar)[]) > > And typeof(`123`).stringof == `string`. Why `123` can be stored as null terminated utf8 string in rdata segment and `123`w nor `123`d are not? For example wide strings(utf16) are usable with windows *W functions. The first one is made to interface with C. It is a special case.

On 10/14/13, Temtaime <temtaime@gmail.com> wrote: > And typeof(`123`).stringof == `string`. Why `123` can be stored as null terminated utf8 string in rdata segment and `123`w nor `123`d are not? For example wide strings(utf16) are usable with windows *W functions. > http://d.puremagic.com/issues/show_bug.cgi?id=6032

It's easy to state this, but - please - don't get sarcastical! I'm obviously (as I've learned) speaking about UTF-8 "char"s as they are NOT implemented right now in D; so I'm criticizing that D, as a language which emphasizes on "UTF-8 characters", isn't taking "the last step", like e.g. Python does (and no, I'm not a Python fan, nor do I consider D a bad language).

On Sunday, 13 October 2013 at 13:40:21 UTC, Sönke Ludwig wrote: > Am 13.10.2013 15:25, schrieb nickles: >> Ok, if my understandig is wrong, how do YOU measure the length of a string? >> Do you always use count(), or is there an alternative? >> >> > > The thing is that even count(), which gives you the number of *code points*, isn't necessarily what is desired - that is, the number of actual display characters. UTF is quite a complex beast and doing any operations on it _correctly_ generally requires a lot of care. If you need to do these kinds of operations, I would highly recommend to read up the basics of UTF and Unicode first (quick overview on Wikipedia: <http://en.wikipedia.org/wiki/Unicode#Mapping_and_encodings>). > > arr.length is meant to be used in conjunction with array indexing and slicing (arr[...]) and its value is consistent for all string and array types for this purpose. I recently discovered a bug in my program. If you take the letter "é" for example (Linux, Ubuntu 12.04), std.utf.count() returns 1 and .length returns 2. I needed the length to slice the string at a given point. Using .length instead of std.utf.count() fixed the bug.

On 10/14/13 1:09 AM, nickles wrote: > It's easy to state this, but - please - don't get sarcastical! Thanks for making this point. String handling in D follows two simple principles: 1. The support is a slice of code units (which often are immutable, seeing as string is an alias for immutable(char)[]). Slice primitives are readily accessible. 2. The standard library (and the foreach language construct) recognize that arrays of code units are special and define bidirectional range primitives on top of them. These are empty, save, front, back, popFront, and popBack. So for a string you may use the range primitives and related algorithms to manipulate code points, or the slice primitives to manipulate code units. This duality has been discussed in the past, and alternatives have proposed (mainly gravitating around making one of the aspects explicit rather than implicit). It is my opinion that a better solution exists (in the form of making representation accessible only through a property .rep). But the current design has "won" not only because it's the existing one, but also because it has good simplicity and flexibility advantages. At this point there is no question about changing the semantics of existing constructs. Andrei

On Sunday, 13 October 2013 at 17:01:15 UTC, Dicebot wrote: > If single element access is needed, str.front yields decoded `dchar`. Or simple `foreach (dchar d; str)` - it won't hide the fact it is O(n) operation at least. As `str.front` yields dchar, most `std.algorithm` and `std.range` utilities will also work correctly on default UTF-8 strings. No, he needs graphemes, so `std.algorithm` won't work correctly for him as Peter has shown: grapheme doesn't fit in dchar.

On Sunday, 13 October 2013 at 14:14:14 UTC, nickles wrote: > Also, I understand, that there is the std.utf.count() function which returns the length that I was searching for. However, why - if D is so UTF-8-centric - isn't this function implemented in the core like ".length"? Most code doesn't need to count graphemes and lives happily with just strings, that's why it's not in the core.

Forums