Thread overview
Finding chars in strings
Sep 05, 2017
Per Nordlöw
Sep 05, 2017
Per Nordlöw
Sep 05, 2017
ag0aep6g
Sep 05, 2017
Jonathan M Davis
Sep 05, 2017
ag0aep6g
Sep 05, 2017
Jonathan M Davis
September 05, 2017
If a character literal has type char, always below 128, can we always search for it's first byte offset in a string without decoding the string to a range of dchars?
September 05, 2017
On Tuesday, 5 September 2017 at 15:43:02 UTC, Per Nordlöw wrote:
> If a character literal has type char, always below 128, can we always search for it's first byte offset in a string without decoding the string to a range of dchars?

Follow up question: If a character literal has type char, can we always assume it's an ASCII character?
September 05, 2017
On 09/05/2017 05:43 PM, Per Nordlöw wrote:
> If a character literal has type char, always below 128, can we always search for it's first byte offset in a string without decoding the string to a range of dchars?

Yes. You can search for ASCII characters (< 128) without decoding. The values in multibyte sequences are always above 127.
September 05, 2017
On 09/05/2017 05:54 PM, Per Nordlöw wrote:
> Follow up question: If a character literal has type char, can we always assume it's an ASCII character?

Strictly speaking, this is a character literal of type char: '\xC3'. It's clearly above 0x7F, and not an ASCII character. So, no.

But if it's an actual character, not an escape sequence, then yes (I think). A wrong encoding setting in your text editor could mess with that, though.
September 05, 2017
On Tuesday, September 05, 2017 17:55:20 ag0aep6g via Digitalmars-d-learn wrote:
> On 09/05/2017 05:43 PM, Per Nordlöw wrote:
> > If a character literal has type char, always below 128, can we always search for it's first byte offset in a string without decoding the string to a range of dchars?
>
> Yes. You can search for ASCII characters (< 128) without decoding. The values in multibyte sequences are always above 127.

Unfortunately, you'll have to use something like std.utf.byCodeUnit or std.string.representation to do it; otherwise, you get hit with the autodecoding. But yeah, UTF-8 is designed to be compatible with ASCII, so all ASCII characters are valid UTF-8 code units and don't require decoding. The decoding is just required if you're dealing with non-ASCII characters, which is another reason why the autodecoding is annoying.

- Jonathan M Davis


September 05, 2017
On Tuesday, September 05, 2017 18:04:16 ag0aep6g via Digitalmars-d-learn wrote:
> On 09/05/2017 05:54 PM, Per Nordlöw wrote:
> > Follow up question: If a character literal has type char, can we always assume it's an ASCII character?
>
> Strictly speaking, this is a character literal of type char: '\xC3'. It's clearly above 0x7F, and not an ASCII character. So, no.
>
> But if it's an actual character, not an escape sequence, then yes (I think). A wrong encoding setting in your text editor could mess with that, though.

Aside from escape sequences, a literal should not result in a non-ASCII value for a char, but in general, it's a bad idea to assume that a char is an ASCII character unless you've verified that already or somehow know based on where the input came from that the char or chars that you're dealing with are all ASCII. And you have to remember that VRP is in play as well, so if it gets involved, you could end up with a char that's not an ASCII character. And IIRC, character literals are almost always treated as dchar unless a cast or VRP gets involved. So, I wouldn't be in a hurry to assume that using character literals would guarantee that you're dealing with only ASCII. Ultimately, std.ascii.isASCII is your friend if there's any risk of something not being ASCII when you need it to be ASCII.

- Jonathan M Davis