Why is string.front dchar?

Jan 13, 2014

TheFlyingFiddle

Jan 14, 2014

Jan 15, 2014

Jan 16, 2014

Jan 14, 2014

Jan 14, 2014

Jan 14, 2014

Jan 14, 2014

Jan 16, 2014

Jan 16, 2014

Jan 20, 2014

Jan 20, 2014

Jan 20, 2014

Jan 20, 2014

Jan 20, 2014

Jan 23, 2014

Jan 23, 2014

Jan 23, 2014

Jan 24, 2014

TheFlyingFiddle: > I'm curious, why is the .front property of narrow strings of type dchar? > And not the underlying character type for the string. There was a long discussion on this. It was chosen this way to allow most range-based algorithms to work correctly on UTF8 and UTF16 strings. In some cases you can use the std.string.representation function to avoid to pay the UTF decoding, or/and to use some algorithms as sort(). But for backwards compatibility reasons in this code: foreach (c; "somestring") c is a char, not a dchar. You have to type it explicitly to handle the UTF safely: foreach (dchar c; "somestring") Bye, bearophile

On Monday, January 13, 2014 23:10:03 TheFlyingFiddle wrote: > I'm curious, why is the .front property of narrow strings of type > dchar? > And not the underlying character type for the string. It's to promote the correct handling of Unicode. A couple of related questions and answers: http://stackoverflow.com/questions/12288465/std-algorithm-joinerstring-string-why-result-elements-are-dchar-and-not-ch http://stackoverflow.com/questions/16590650/how-to-read-a-string-character-by-character-as-a-range-in-d - Jonathan M Davis

On Tuesday, 14 January 2014 at 03:01:53 UTC, Jonathan M Davis wrote: > On Monday, January 13, 2014 23:10:03 TheFlyingFiddle wrote: >> I'm curious, why is the .front property of narrow strings of type >> dchar? >> And not the underlying character type for the string. > > It's to promote the correct handling of Unicode. A couple of related questions > and answers: > > http://stackoverflow.com/questions/12288465/std-algorithm-joinerstring-string-why-result-elements-are-dchar-and-not-ch > > http://stackoverflow.com/questions/16590650/how-to-read-a-string-character-by-character-as-a-range-in-d > > - Jonathan M Davis Also somewhat related: http://stackoverflow.com/questions/13368728/why-isnt-dchar-the-standard-character-type-in-d

On Monday, 13 January 2014 at 23:10:04 UTC, TheFlyingFiddle wrote: > I'm curious, why is the .front property of narrow strings of type dchar? > And not the underlying character type for the string. The root of the issue is that string literals containing characters which do not fit into signle byte are still converted to char[] array. This is strictly speaking not type safe because it allows to reinterpret 2 or 4 byte code unit as sequence of characters of 1 byte size. The string type is in some sense problematic in D. That's why the fact that .front returns dhcar is a way to correct the problem, it is not an attempt to introduce confusion.

On Tuesday, 14 January 2014 at 01:12:40 UTC, bearophile wrote: > TheFlyingFiddle: > But for backwards compatibility reasons in this code: > > foreach (c; "somestring") > > c is a char, not a dchar. You have to type it explicitly to handle the UTF safely: > > foreach (dchar c; "somestring") This is why i was confused really since the normal foreach is char it's weird that string.front is not a char. But if foreach being a char is only the way it is for legacy reasons it all makes sense.

On Tuesday, 14 January 2014 at 11:42:34 UTC, Maxim Fomin wrote: > The root of the issue is that string literals containing characters which do not fit into signle byte are still converted to char[] array. This is strictly speaking not type safe because it allows to reinterpret 2 or 4 byte code unit as sequence of characters of 1 byte size. The string type is in some sense problematic in D. That's why the fact that .front returns dhcar is a way to correct the problem, it is not an attempt to introduce confusion. This assertion makes all the wrong assumptions. `char` is a UTF-8 code unit[1], and `string` is an array of immutable UTF-8 code units. The whole point of UTF-8 is the ability to encode code points that need multiple bytes (UTF-8 code units), so the string literal behaviour is perfectly regular. Operations on code units are rare, which is why the standard library instead treats strings as ranges of code points, for correctness by default. However, we must not prevent the user from being able to work on arrays of code units, as many string algorithms can be optimized by not doing full UTF decoding. The standard library does this on many occasions, and there are more to come. Note that the Unicode definition of an unqualified "character" is the translation of a code *point*, which is very different from a *glyph*, which is what people generally associate the word "character" with. Thus, `string` is not an array of characters (i.e. an array where each element is a character), but `dstring` can be said to be. [1] http://dlang.org/type

On Wednesday, 15 January 2014 at 20:05:32 UTC, TheFlyingFiddle wrote: > This is why i was confused really since the normal foreach is char it's weird that string.front is not a char. But if foreach being a char is only the way it is for legacy reasons it all makes sense. Unfortunately, it's not that simple. D arrays/slices have two distinct interfaces - the slice interface and the range interface. The latter is a library convention built on top of the former - thus the existence of the slice interface is necessary. A generic algorithm can choose to work on arrays (array algorithm) or ranges (range algorithm) among other kinds of type federations: auto algo(E)(E[] t); // array algorithm auto algo(R)(R r) if (isInputRange!R); // range algorithm The array algorithm can assume that: foreach(e; t) static assert(is(typeof(e) == E)); While the range algorithm *cannot* assume that: foreach(e; r) static assert(is(typeof(e) == ElementType!R)); Because this fails when R is a narrow string (slice of UTF-8 or UTF-16 code units). Thus, the correct way to use foreach over a range in a generic range algorithm is: foreach(ElementType!R e; r) {} Swapping the default just swaps which kind of algorithm can make the assumption. The added cost of breaking existing algorithms is a big deal, but as demonstrated, it's not a panacea.