Thread overview
Using decodeFront with a generalised input range
Nov 09
Dennis
Nov 09
Dennis
November 09
According to the std.utf documentation,

> decode will only work with strings and random access ranges of code units with length and slicing, whereas decodeFront will work with any input range of code units.

However, I can't seem to get such a usage to compile: the following code

import std.range;
import std.utf;

void somefn(InputRange!(ubyte) r) {
    r.decodeFront!(No.useReplacementDchar, dchar)();
}

gives a compilation error:

onlineapp.d(5): Error: template std.utf.decodeFront cannot deduce function from argument types !(cast(Flag)false, dchar)(InputRange!ubyte), candidates are:
/dlang/dmd/linux/bin64/../../src/phobos/std/utf.d(1176):        std.utf.decodeFront(Flag useReplacementDchar = No.useReplacementDchar, S)(ref S str, out size_t numCodeUnits) if (!isSomeString!S && isInputRange!S && isSomeChar!(ElementType!S))
/dlang/dmd/linux/bin64/../../src/phobos/std/utf.d(1214):        std.utf.decodeFront(Flag useReplacementDchar = No.useReplacementDchar, S)(ref S str, out size_t numCodeUnits) if (isSomeString!S)
/dlang/dmd/linux/bin64/../../src/phobos/std/utf.d(1243):        std.utf.decodeFront(Flag useReplacementDchar = No.useReplacementDchar, S)(ref S str) if (isInputRange!S && isSomeChar!(ElementType!S))

I'm not sure what's wrong here. Also, the information in the error message seems to indicate that decodeFront isn't expecting to work on a generic InputRange interface, but rather a subset which comes from an actual string. Have I read that wrong? I get the same error with leaving out the useReplacementDchar flag, i.e. `r.decodeFront!(dchar)()`
November 09
On Friday, 9 November 2018 at 09:47:32 UTC, Vinay Sajip wrote:
> std.utf.decodeFront(Flag useReplacementDchar = No.useReplacementDchar, S)(ref S str) if (isInputRange!S && isSomeChar!(ElementType!S))

This is the overload you want, let's check if it matches:
ref S str - your InputRange can be passed by reference, but you specified S = dchar. S here is the type of the inputRange, and it is not of type dchar. It's best not to specify S so the compiler will infer it, range types can be very complicated. Once we fix that, let's look at the rest:

isInputRange!S - S is an inputRange
isSomeChar!(ElementType!S) - ElementType!S is ubyte, but isSomeChar!ubyte is not true.

The function wants characters, but you give bytes. A quick fix would be to do:
```
import std.algorithm: map;
auto mapped = r.map!(x => cast(char) x);
mapped.decodeFront!(No.useReplacementDchar)();
```

But it may be better for somefn to accept an InputRange!(char) instead.

Note that if you directly do:
```
r.map!(x => cast(char) x).decodeFront!(No.useReplacementDchar)();
```
It still won't work, since it wants `ref S str` and r.map!(...) is a temporary that can't be passed by reference.

As you can see, ensuring template constraints can be really difficult. The error messages give little help here, so you have to manually check whether the conditions of the overload you want hold.
November 09
On Friday, 9 November 2018 at 10:26:46 UTC, Dennis wrote:
> On Friday, 9 November 2018 at 09:47:32 UTC, Vinay Sajip wrote:
>> std.utf.decodeFront(Flag useReplacementDchar = No.useReplacementDchar, S)(ref S str) if (isInputRange!S && isSomeChar!(ElementType!S))
>
> This is the overload you want, let's check if it matches:
> ref S str - your InputRange can be passed by reference, but you specified S = dchar. S here is the type of the inputRange, and it is not of type dchar. It's best not to specify S so the compiler will infer it, range types can be very complicated. Once we fix that, let's look at the rest:
>
> isInputRange!S - S is an inputRange
> isSomeChar!(ElementType!S) - ElementType!S is ubyte, but isSomeChar!ubyte is not true.
>
> The function wants characters, but you give bytes. A quick fix would be to do:
> ```
> import std.algorithm: map;
> auto mapped = r.map!(x => cast(char) x);
> mapped.decodeFront!(No.useReplacementDchar)();
> ```
>
> But it may be better for somefn to accept an InputRange!(char) instead.
>
> Note that if you directly do:
> ```
> r.map!(x => cast(char) x).decodeFront!(No.useReplacementDchar)();
> ```
> It still won't work, since it wants `ref S str` and r.map!(...) is a temporary that can't be passed by reference.
>
> As you can see, ensuring template constraints can be really difficult. The error messages give little help here, so you have to manually check whether the conditions of the overload you want hold.

Thanks, that's helpful. My confusion seems due to my thinking that a decoding operation converts (unsigned) bytes to chars, which is not how the writers of std.utf seem to have thought of it. As I see it, a ubyte 0x20 could be decoded to an ASCII char ' ', and likewise to wchar or dchar. It doesn't (to me) make sense to decode a char to a wchar or dchar. Anyway, you've shown me how decodeFront can be used, so great!

Supplementary question: is an operation like r.map!(x => cast(char) x) effectively a run-time no-op and just to keep the compiler happy, or does it actually result in code being executed? I came across a similar issue with ranges recently where the answer was to map immutable(byte) to byte in the same way.

November 09
On Friday, 9 November 2018 at 10:45:49 UTC, Vinay Sajip wrote:
> As I see it, a ubyte 0x20 could be decoded to an ASCII char ' ', and likewise to wchar or dchar. It doesn't (to me) make sense to decode a char to a wchar or dchar. Anyway, you've shown me how decodeFront can be used, so great!

The character ' ' simply is the number 0x20 in char, wchar and dchar. The difficulty arises when you use non-ascii characters:

if ("€"[0] == '€')

The character code of € is U+20AC, but a char only goes to 0xFF. To work around that, UTF-8 gives higher code points multiple bytes (or code units). The € sign will be represented as [0xE2, 0x82, 0xAC]. So the code above actually checks 0xE2 == 0x20AC, which will return false. If you decodeFront on [0xE2, 0x82, 0xAC], it will actually output 0x20AC and modify the range to be [] since it consumed all three code units. That way you can handle code points properly.
See: https://en.wikipedia.org/wiki/UTF-8#Examples

On Friday, 9 November 2018 at 10:45:49 UTC, Vinay Sajip wrote:
> Supplementary question: is an operation like r.map!(x => cast(char) x) effectively a run-time no-op and just to keep the compiler happy, or does it actually result in code being executed? I came across a similar issue with ranges recently where the answer was to map immutable(byte) to byte in the same way.

On dmd without optimization, the map function will compile to:
	push	RBP          //
	mov	RBP,RSP      //
	sub	RSP,010h     // build stack frame
	mov	-8[RBP],EDI  // put argument0 on the stack
	mov	AL,-8[RBP]   // put the stack value in the lower 8 bits of the return register
	leave                // delete stack frame
	ret                  // return

So that will be essentially a run-time no-op. However, if you pass -O -inline to dmd I'm pretty sure it will optimize it away. GDC and LDC with -O1 or higher will certainly eliminate all run-time cost.
November 09
On Friday, November 9, 2018 3:45:49 AM MST Vinay Sajip via Digitalmars-d- learn wrote:
> On Friday, 9 November 2018 at 10:26:46 UTC, Dennis wrote:
> > On Friday, 9 November 2018 at 09:47:32 UTC, Vinay Sajip wrote:
> >> std.utf.decodeFront(Flag useReplacementDchar =
> >> No.useReplacementDchar, S)(ref S str) if (isInputRange!S &&
> >> isSomeChar!(ElementType!S))
> >
> > This is the overload you want, let's check if it matches:
> > ref S str - your InputRange can be passed by reference, but you
> > specified S = dchar. S here is the type of the inputRange, and
> > it is not of type dchar. It's best not to specify S so the
> > compiler will infer it, range types can be very complicated.
> > Once we fix that, let's look at the rest:
> >
> > isInputRange!S - S is an inputRange isSomeChar!(ElementType!S) - ElementType!S is ubyte, but isSomeChar!ubyte is not true.
> >
> > The function wants characters, but you give bytes. A quick fix
> > would be to do:
> > ```
> > import std.algorithm: map;
> > auto mapped = r.map!(x => cast(char) x);
> > mapped.decodeFront!(No.useReplacementDchar)();
> > ```
> >
> > But it may be better for somefn to accept an InputRange!(char)
> > instead.
> >
> > Note that if you directly do:
> > ```
> > r.map!(x => cast(char)
> > x).decodeFront!(No.useReplacementDchar)();
> > ```
> > It still won't work, since it wants `ref S str` and r.map!(...)
> > is a temporary that can't be passed by reference.
> >
> > As you can see, ensuring template constraints can be really difficult. The error messages give little help here, so you have to manually check whether the conditions of the overload you want hold.
>
> Thanks, that's helpful. My confusion seems due to my thinking that a decoding operation converts (unsigned) bytes to chars, which is not how the writers of std.utf seem to have thought of it. As I see it, a ubyte 0x20 could be decoded to an ASCII char ' ', and likewise to wchar or dchar. It doesn't (to me) make sense to decode a char to a wchar or dchar. Anyway, you've shown me how decodeFront can be used, so great!

decode and decodeFront are for converting a UTF code unit to a Unicode code point. So, you're taking UTF-8 code unit (char), UTF-16 code unit (wchar), or a UTF-32 code unit (dchar) and decoding it. In the case of UTF-32, that's a no-op, since UTF-32 code units are already code points, but for UTF-8 and UTF-16, they're not the same at all.

For UTF-8, a code point is encoded as 1 to 4 code units which are 8 bits in size (char). For UTF-16, a code point is encoded as 1 or 2 code units which are 16 bits in size (wchar), and for UTF-32, code points are encoded as code units which are 32-bits in size (dchar). The decoding is doing that conversion. None of this has anything to do with ASCII or any other encoding except insofar as ASCII happens to line up with Unicode.

Code points are then 32-bit integer values (which D represents as dchar). They are often called Unicode characters, and can be represented graphically, but many of them represent bits of what you would actually consider to be a character (e.g. an accent could be a code point on its own), so in many cases, code points have to be combine to create what's called a grapheme or grapheme cluster (which unfortunately, means that can can have to worry about normalizing code points). std.uni provides code for worrying about that sort of thing. Ultimately, what gets rendered to the screen by with a font is as grapheme. In the simplest case, with an ASCII character, a single character is a single code unit, a single code point, and a single grapheme in all representations, but with more complex characters (e.g. a Hebrew character or a character with a couple of accents on it), it could be several code units, one or more code points, and a single grapheme.

I would advise against doing much with decode or decodeFront without having a decent understanding of the basics of Unicode.

> Supplementary question: is an operation like r.map!(x => cast(char) x) effectively a run-time no-op and just to keep the compiler happy, or does it actually result in code being executed? I came across a similar issue with ranges recently where the answer was to map immutable(byte) to byte in the same way.

That would depend on the optimization flags chosen and the exact code in question. In general, ldc is more likely to do a good job at optimizing such code than dmd, though dmd doesn't necessarily do a bad job. I don't know how good a job dmd does in this particular case. It depends on the code. In general, dmd compiles very quickly and as such is great for development, whereas ldc does a better job at generating fast executables. I would expect ldc to optimize such code properly.

- Jonathan M Davis



November 09
On Friday, 9 November 2018 at 11:24:42 UTC, Jonathan M Davis wrote:
> decode and decodeFront are for converting a UTF code unit to a Unicode code point. So, you're taking UTF-8 code unit (char), UTF-16 code unit (wchar), or a UTF-32 code unit (dchar) and decoding it. In the case of UTF-32, that's a no-op, since UTF-32 code units are already code points, but for UTF-8 and UTF-16, they're not the same at all.

> I would advise against doing much with decode or decodeFront without having a decent understanding of the basics of Unicode.
>

I think I understand enough of the basics of Unicode, at least for my application; my unfamiliarity is with the D language and standard library, to which I am very new.

There are applications where one needs to decode a stream of bytes into Unicode text: perhaps it's just semantic quibbling distinguishing between "a ubyte" and "a UTF-8 code unit", as they're the same at the level of bits and bytes (as I understand it - please tell me if you think otherwise). If I open a file using mode "rb", I get a sequence of bytes, which may contain structured binary data, parts of which are to be interpreted as text encoded in UTF-8. Is there something in the D standard library which enables incremental decoding of such (parts of) a byte stream? Or does one have to resort to the `map!(x => cast(char) x)` method for this?
November 09
On Friday, November 9, 2018 5:22:27 AM MST Vinay Sajip via Digitalmars-d- learn wrote:
> On Friday, 9 November 2018 at 11:24:42 UTC, Jonathan M Davis
>
> wrote:
> > decode and decodeFront are for converting a UTF code unit to a Unicode code point. So, you're taking UTF-8 code unit (char), UTF-16 code unit (wchar), or a UTF-32 code unit (dchar) and decoding it. In the case of UTF-32, that's a no-op, since UTF-32 code units are already code points, but for UTF-8 and UTF-16, they're not the same at all.
> >
> > I would advise against doing much with decode or decodeFront without having a decent understanding of the basics of Unicode.
>
> I think I understand enough of the basics of Unicode, at least for my application; my unfamiliarity is with the D language and standard library, to which I am very new.
>
> There are applications where one needs to decode a stream of bytes into Unicode text: perhaps it's just semantic quibbling distinguishing between "a ubyte" and "a UTF-8 code unit", as they're the same at the level of bits and bytes (as I understand it - please tell me if you think otherwise). If I open a file using mode "rb", I get a sequence of bytes, which may contain structured binary data, parts of which are to be interpreted as text encoded in UTF-8. Is there something in the D standard library which enables incremental decoding of such (parts of) a byte stream? Or does one have to resort to the `map!(x => cast(char) x)` method for this?

In principle, a char is assumed to be a UTF-8 code unit, though it's certainly possible for code to manage to end up with a char that's not a valid UTF-8 code unit. So, char is specifically a character type, whereas byte and ubyte are 8 bit integer types which can contain arbitrary data. D purposefully has char, wchar, and dchar as separate types from byte, ubyte, short, ushort, etc. in order to distinguish between character types and integer types, and in general, the D standard library does not treat byte or ubyte as having anything to do with characters.

decode and decodeFront operate on ranges of characters, not ranges of arbitrary integer types. So, if you have a range of byte or ubyte which contains UTF-8 code units, and you want to use decode or decodeFront, then you will need to convert that range to a range of char. map would likely be the most straightforward way to do that.

- Jonathan M Davis