Thread overview | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
July 26, 2013 Should I Use std.ascii.isWhite or std.uni.isWhite? | ||||
---|---|---|---|---|
| ||||
I'm confused about which isWhite function I should use. Aren't all chars in D (char, wchar, dchar) unicode characters? Should I always use std.uni.isWhite, unless I'm working with bytes and byte arrays? The documentation doesn't give me much to go on, beside "All of the functions in std.ascii accept unicode characters but effectively ignore them. All isX functions return false for unicode characters, and all toX functions do nothing to unicode characters." |
July 26, 2013 Re: Should I Use std.ascii.isWhite or std.uni.isWhite? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Meta | On Friday, July 26, 2013 06:09:39 Meta wrote:
> I'm confused about which isWhite function I should use. Aren't all chars in D (char, wchar, dchar) unicode characters? Should I always use std.uni.isWhite, unless I'm working with bytes and byte arrays? The documentation doesn't give me much to go on, beside "All of the functions in std.ascii accept unicode characters but effectively ignore them. All isX functions return false for unicode characters, and all toX functions do nothing to unicode characters."
Unicode contains ASCII, but very few Unicode characters are ASCII, because there just aren't very many ASCII characters and there and a _ton_ of Unicode characters. The std.ascii functions return true for certain sets of ASCII characters and false for everything else. The std.uni functions return true for many Unicode characters as well. You wouldn't normally use std.ascii if you're operating on non-ASCII Unicode characters, but it ignores them if it does run into them.
std.ascii.isWhite only cares about ASCII whitespace, which the documentation explicitly lists as the space, tab, vertical tab, form feed, carriage return, and linefeed characters. Those characters will return true. All other characters will return false.
std.uni.isWhite returns true for all of the characters that std.ascii.isWhite does plus a whole bunch of other non-ASCII characters that the Unicode standard considers to be whitespace.
Which function you use depends on what you're trying to do.
- Jonathan M Davis
|
July 26, 2013 Re: Should I Use std.ascii.isWhite or std.uni.isWhite? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Meta | On Friday, 26 July 2013 at 04:09:46 UTC, Meta wrote: > I'm confused about which isWhite function I should use. Aren't all chars in D (char, wchar, dchar) unicode characters? They are. > Should I always use std.uni.isWhite, unless I'm working with bytes and byte arrays? No, char vs byte isn't necessarily a thing here. > The documentation doesn't give me much to go on, beside "All of the functions in std.ascii accept unicode characters but effectively ignore them. All isX functions return false for unicode characters, and all toX functions do nothing to unicode characters." You should use std.uni.isWhite unless you want to match only ASCII white space. That could be the case when ... * You have data that is not in Unicode, but some other superset of ASCII. Then you shouldn't use std.uni.isWhite, of course. std.ascii.isWhite might be fine. In this case, you'd actually use u{byte,short,int} instead of {,w,d}char. * You're dealing with a grammar where ASCII white space is a thing, while Unicode white space is not. * There's really only ASCII white space in your data, and you want every bit of speed, and you've verified that std.ascii.isWhite is indeed faster than std.uni.isWhite. |
July 26, 2013 Re: Should I Use std.ascii.isWhite or std.uni.isWhite? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jonathan M Davis | On Friday, 26 July 2013 at 05:06:45 UTC, Jonathan M Davis wrote:
> Unicode contains ASCII, but very few Unicode characters are ASCII, because
> there just aren't very many ASCII characters and there and a _ton_ of Unicode
> characters. The std.ascii functions return true for certain sets of ASCII
> characters and false for everything else. The std.uni functions return true
> for many Unicode characters as well. You wouldn't normally use std.ascii if
> you're operating on non-ASCII Unicode characters, but it ignores them if it
> does run into them.
>
> std.ascii.isWhite only cares about ASCII whitespace, which the documentation
> explicitly lists as the space, tab, vertical tab, form feed, carriage return,
> and linefeed characters. Those characters will return true. All other
> characters will return false.
>
> std.uni.isWhite returns true for all of the characters that std.ascii.isWhite
> does plus a whole bunch of other non-ASCII characters that the Unicode
> standard considers to be whitespace.
>
> Which function you use depends on what you're trying to do.
>
> - Jonathan M Davis
That makes sense. I know that the first 127 unicode characters are equivalent to the 7-bit ASCII charset, but it confused me that the module is named std.ascii when it actually operates on unicode characters, I guess.
Another question, I'm not all that familiar with unicode, so what is the difference between std.uni.isNumber and std.ascii.isNumber? Am I right in thinking that std.uni.isNumber will match things outside of the basic 0..9?
|
July 26, 2013 Re: Should I Use std.ascii.isWhite or std.uni.isWhite? | ||||
---|---|---|---|---|
| ||||
Posted in reply to anonymous | On Friday, 26 July 2013 at 05:26:33 UTC, anonymous wrote: >> Should I always use std.uni.isWhite, unless I'm working with bytes and byte arrays? > > No, char vs byte isn't necessarily a thing here. I realized after I posted this that I was being stupid in even suggesting that, seeing as all the functions in std.ascii take dchars. >> The documentation doesn't give me much to go on, beside "All of the functions in std.ascii accept unicode characters but effectively ignore them. All isX functions return false for unicode characters, and all toX functions do nothing to unicode characters." > > You should use std.uni.isWhite unless you want to match only ASCII white space. > > That could be the case when ... > * You have data that is not in Unicode, but some other superset of ASCII. Then you shouldn't use std.uni.isWhite, of course. std.ascii.isWhite might be fine. In this case, you'd actually use u{byte,short,int} instead of {,w,d}char. > * You're dealing with a grammar where ASCII white space is a thing, while Unicode white space is not. > * There's really only ASCII white space in your data, and you want every bit of speed, and you've verified that std.ascii.isWhite is indeed faster than std.uni.isWhite. Thank you for the informative answer. |
July 26, 2013 Re: Should I Use std.ascii.isWhite or std.uni.isWhite? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Meta | On Friday, 26 July 2013 at 05:54:50 UTC, Meta wrote: > Another question, I'm not all that familiar with unicode, so what is the difference between std.uni.isNumber and std.ascii.isNumber? Am I right in thinking that std.uni.isNumber will match things outside of the basic 0..9? Starting at <http://dlang.org/phobos/std_uni#.isNumber>: > general Unicode category: Nd, Nl, No Via <http://www.google.com/search?q=general+Unicode+category:+Nd,+Nl,+No> to <http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#General_Category>: > Number (N) > Decimal digit (Nd) > Letter (Nl) — Numerals composed of letters or letterlike symbols (e.g., Roman numerals) > Other (No) — Includes vulgar fractions and superscript and subscript digits. |
July 26, 2013 Re: Should I Use std.ascii.isWhite or std.uni.isWhite? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Meta | On Friday, July 26, 2013 07:54:42 Meta wrote:
> Am I right in thinking that std.uni.isNumber
> will match things outside of the basic 0..9?
Yes. Expect all of the isX functions in std.uni to return true for characters outside of ASCII.
- Jonathan M Davis
|
July 26, 2013 Re: Should I Use std.ascii.isWhite or std.uni.isWhite? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jonathan M Davis | Jonathan M Davis: > Which function you use depends on what you're trying to do. Right. I have just added this: http://d.puremagic.com/issues/show_bug.cgi?id=10717 Bye, bearophile |
July 26, 2013 Re: Should I Use std.ascii.isWhite or std.uni.isWhite? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Meta | 26-Jul-2013 09:54, Meta пишет: > On Friday, 26 July 2013 at 05:06:45 UTC, Jonathan M Davis wrote: [snip] > > That makes sense. I know that the first 127 unicode characters are > equivalent to the 7-bit ASCII charset, but it confused me that the > module is named std.ascii when it actually operates on unicode > characters, I guess. > > Another question, I'm not all that familiar with unicode, so what is the > difference between std.uni.isNumber and std.ascii.isNumber? Am I right > in thinking that std.uni.isNumber will match things outside of the basic > 0..9? You are spot on. In case you want to further dig into Unicode characters and properties, there is this nice tool: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3AN%3A%5D&g= (e.g. this link shows all of 'N' = Number characters) -- Dmitry Olshansky |
July 26, 2013 Re: Should I Use std.ascii.isWhite or std.uni.isWhite? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dmitry Olshansky | On Friday, 26 July 2013 at 17:58:21 UTC, Dmitry Olshansky wrote:
> You are spot on. In case you want to further dig into Unicode characters and properties, there is this nice tool:
> http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3AN%3A%5D&g=
> (e.g. this link shows all of 'N' = Number characters)
That is indeed a helpful link. Thanks.
|
Copyright © 1999-2021 by the D Language Foundation