Thread overview | ||||||||
---|---|---|---|---|---|---|---|---|
|
October 07, 2013 isAsciiString in Phobos? | ||||
---|---|---|---|---|
| ||||
If I want to transfer some string to a C function that expects ascii-only string. What can I use to verify there are no non-ascii characters in a D string? I haven't seen anything in Phobos. I was thinking of using: bool isAscii = mystring.all!(a => a <= 0xFF); Is this safe? I'm thinking of whether a code point can consist of two code units such as [C1][C2], where C2 may be in the range 0 - 0xFF. I don't know if that's possible (not a unicode pro here..). |
October 07, 2013 Re: isAsciiString in Phobos? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrej Mitrovic | On Monday, 7 October 2013 at 15:18:06 UTC, Andrej Mitrovic wrote:
> bool isAscii = mystring.all!(a => a <= 0xFF);
If you want strict ASCII, it should be <= 127 rather than 255 because the high bit can be all kinds of different encodings (the first 255 of unicode codepoints I think match latin-1 numerically, but that's different than windows-1252 or various non-English extended asciis.)
You could also convert utf-8 to ascii.... sort of... by just stripping out any byte > 127 since bytes higher than that are multibyte sequences in utf8.
|
October 07, 2013 Re: isAsciiString in Phobos? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Adam D. Ruppe | On 10/7/13, Adam D. Ruppe <destructionator@gmail.com> wrote:
> If you want strict ASCII, it should be <= 127 rather than 255 because the high bit can be all kinds of different encodings (the first 255 of unicode codepoints I think match latin-1 numerically, but that's different than windows-1252 or various non-English extended asciis.)
>
> You could also convert utf-8 to ascii.... sort of... by just stripping out any byte > 127 since bytes higher than that are multibyte sequences in utf8.
Thanks. I got some useful info from Jakob from IRC, and ended up with this:
bool isAsciiString(string input)
{
auto data = cast(const(ubyte)[])input;
return data.all!(a => a <= 0x7F);
}
The cast is needed to avoid decoding by the "all" function. Also there's isASCII that works on a dchar in std.ascii, but I was looking for something that works on entire strings at once. So the above function does the work for me.
Should we put something like this in Phobos?
|
October 07, 2013 Re: isAsciiString in Phobos? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrej Mitrovic | On Monday, 7 October 2013 at 15:57:15 UTC, Andrej Mitrovic wrote: > On 10/7/13, Adam D. Ruppe <destructionator@gmail.com> wrote: >> If you want strict ASCII, it should be <= 127 rather than 255 >> because the high bit can be all kinds of different encodings (the >> first 255 of unicode codepoints I think match latin-1 >> numerically, but that's different than windows-1252 or various >> non-English extended asciis.) >> >> You could also convert utf-8 to ascii.... sort of... by just >> stripping out any byte > 127 since bytes higher than that are >> multibyte sequences in utf8. > > Thanks. I got some useful info from Jakob from IRC, and ended up with this: > > bool isAsciiString(string input) > { > auto data = cast(const(ubyte)[])input; > return data.all!(a => a <= 0x7F); > } > > The cast is needed to avoid decoding by the "all" function. Also > there's isASCII that works on a dchar in std.ascii, but I was looking > for something that works on entire strings at once. So the above > function does the work for me. You can use std.string.representation to do the cast for you, and you might as well just use isASCII anyways. return data.representation().all!isASCII(); If we want even more efficiency, we could iterate on the string, interpreting it as a size_t[]. We mask each of its elements with 0x80808080/0x80808080_80808080, and if one of the resulting masked elements is not null, then the string isn't ASCII. |
October 07, 2013 Re: isAsciiString in Phobos? | ||||
---|---|---|---|---|
| ||||
Posted in reply to monarch_dodra | On 10/7/13, monarch_dodra <monarchdodra@gmail.com> wrote:
> If we want even more efficiency, we could iterate on the string, interpreting it as a size_t[]. We mask each of its elements with 0x80808080/0x80808080_80808080, and if one of the resulting masked elements is not null, then the string isn't ASCII.
Clever! So I think we should definitely try and push it to the library.
|
October 07, 2013 Re: isAsciiString in Phobos? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrej Mitrovic | On Monday, 7 October 2013 at 16:23:12 UTC, Andrej Mitrovic wrote: > On 10/7/13, monarch_dodra <monarchdodra@gmail.com> wrote: >> If we want even more efficiency, we could iterate on the string, >> interpreting it as a size_t[]. We mask each of its elements with >> 0x80808080/0x80808080_80808080, and if one of the resulting >> masked elements is not null, then the string isn't ASCII. > > Clever! So I think we should definitely try and push it to the library. I wrote this: Only lightly tested. //-------- bool isASCII(const(char[]) str) { static if (size_t.sizeof == 8) { enum size = 8; enum size_t mask = 0x80808080_80808080; enum size_t alignMask = ~cast(size_t)0b111; } else { enum size = 4; enum size_t mask = 0x80808080; enum size_t alignMask = ~cast(size_t)0b11; } if (str.length < size) { foreach (c; str) if (c & 0x80) return false; return true; } immutable start = (cast(size_t)str.ptr & alignMask) + size; immutable end = cast(size_t)(str.ptr + str.length) & alignMask; //we start with block, because it is faster //and chances the start is aligned anyways (so we check it later). for ( auto p = cast(size_t*)start ; p != cast(size_t*)end ; ++p ) if (*p & mask) return false; //Then the trailing chars. for ( auto p = cast(char*)end ; p != str.ptr + str.length ; ++p ) if (*p & 0x80) return false; //Finally, the first chars. for ( auto p = str.ptr ; p != cast(char*)start ; ++p ) if (*p & 0x80) return false; return true; } //-------- assert( "hello".isASCII()); assert( "heellohelloellohelloellohelloellohellollohello"); assert( "hellellohelloellohelloo"[3 .. $].isASCII()); assert(!"heéppellohelloellohelloellohelloellohelloellohellollo".isASCII()); assert(!"heppellohelloellohelloellohéelloellohelloellohellollo".isASCII()); assert(!"heppellohelloellohelloellohelloellohelloellohellolléo".isASCII()); //-------- What do you think? I have some doubts though: 1. Does x64 require qword alignment for size_t, or is dword enough? 2. Isn't there some built-in that'll give me the wanted alignement, isntead of doing it by hand? 3. Are those casts 100% correct? |
Copyright © 1999-2021 by the D Language Foundation