isAsciiString in Phobos?

Oct 07, 2013

Andrej Mitrovic

Oct 07, 2013

Adam D. Ruppe

Oct 07, 2013

Oct 07, 2013

Oct 07, 2013

Oct 07, 2013

If I want to transfer some string to a C function that expects ascii-only string. What can I use to verify there are no non-ascii characters in a D string? I haven't seen anything in Phobos. I was thinking of using: bool isAscii = mystring.all!(a => a <= 0xFF); Is this safe? I'm thinking of whether a code point can consist of two code units such as [C1][C2], where C2 may be in the range 0 - 0xFF. I don't know if that's possible (not a unicode pro here..).

On Monday, 7 October 2013 at 15:18:06 UTC, Andrej Mitrovic wrote: > bool isAscii = mystring.all!(a => a <= 0xFF); If you want strict ASCII, it should be <= 127 rather than 255 because the high bit can be all kinds of different encodings (the first 255 of unicode codepoints I think match latin-1 numerically, but that's different than windows-1252 or various non-English extended asciis.) You could also convert utf-8 to ascii.... sort of... by just stripping out any byte > 127 since bytes higher than that are multibyte sequences in utf8.

On 10/7/13, Adam D. Ruppe <destructionator@gmail.com> wrote: > If you want strict ASCII, it should be <= 127 rather than 255 because the high bit can be all kinds of different encodings (the first 255 of unicode codepoints I think match latin-1 numerically, but that's different than windows-1252 or various non-English extended asciis.) > > You could also convert utf-8 to ascii.... sort of... by just stripping out any byte > 127 since bytes higher than that are multibyte sequences in utf8. Thanks. I got some useful info from Jakob from IRC, and ended up with this: bool isAsciiString(string input) { auto data = cast(const(ubyte)[])input; return data.all!(a => a <= 0x7F); } The cast is needed to avoid decoding by the "all" function. Also there's isASCII that works on a dchar in std.ascii, but I was looking for something that works on entire strings at once. So the above function does the work for me. Should we put something like this in Phobos?

On Monday, 7 October 2013 at 15:57:15 UTC, Andrej Mitrovic wrote: > On 10/7/13, Adam D. Ruppe <destructionator@gmail.com> wrote: >> If you want strict ASCII, it should be <= 127 rather than 255 >> because the high bit can be all kinds of different encodings (the >> first 255 of unicode codepoints I think match latin-1 >> numerically, but that's different than windows-1252 or various >> non-English extended asciis.) >> >> You could also convert utf-8 to ascii.... sort of... by just >> stripping out any byte > 127 since bytes higher than that are >> multibyte sequences in utf8. > > Thanks. I got some useful info from Jakob from IRC, and ended up with this: > > bool isAsciiString(string input) > { > auto data = cast(const(ubyte)[])input; > return data.all!(a => a <= 0x7F); > } > > The cast is needed to avoid decoding by the "all" function. Also > there's isASCII that works on a dchar in std.ascii, but I was looking > for something that works on entire strings at once. So the above > function does the work for me. You can use std.string.representation to do the cast for you, and you might as well just use isASCII anyways. return data.representation().all!isASCII(); If we want even more efficiency, we could iterate on the string, interpreting it as a size_t[]. We mask each of its elements with 0x80808080/0x80808080_80808080, and if one of the resulting masked elements is not null, then the string isn't ASCII.

On 10/7/13, monarch_dodra <monarchdodra@gmail.com> wrote: > If we want even more efficiency, we could iterate on the string, interpreting it as a size_t[]. We mask each of its elements with 0x80808080/0x80808080_80808080, and if one of the resulting masked elements is not null, then the string isn't ASCII. Clever! So I think we should definitely try and push it to the library.