January 04, 2013
04-Jan-2013 21:48, monarch_dodra пишет:
> On Friday, 4 January 2013 at 13:18:48 UTC, Dmitry Olshansky wrote:
>> 04-Jan-2013 15:58, Jonathan M Davis пишет:
>>> On Thursday, January 03, 2013 20:40:47 monarch_dodra wrote:
>>>> So... do we agree on
>>>> ascii: int - not found => -1
>>>> uni: double - not found => nan
>>>
>>> I'm not a fan of the ASCII version returning -1, but I don't really
>>> have a
>>> better suggestion. I suppose that you could throw instead, but I
>>> don't know if
>>> that's a good idea or not. It _would_ be more consistent with our other
>>> conversion functions however.
>>>
>>> - Jonathan M Davis
>>
>> I find low-level stuff that throws to be overly awkward to deal with
>> (not to mention performance problems).
>>
>> Hm... I've found an brilliant primitive Expected!T that could be of
>> great help in error code vs exceptions problem. See the recent
>> Andrei's talk that went live not long ago:
>>
>> http://channel9.msdn.com/Shows/Going+Deep/C-and-Beyond-2012-Andrei-Alexandrescu-Systematic-Error-Handling-in-C
>>
>>
>> Time to put the analogous stuff into Phobos?
>
> I finished an implementation:
>
> https://github.com/D-Programming-Language/phobos/pull/1052
>
> It is not "pull ready", so we can still discuss it.
>

Well, for start it features tons of code duplication. But I'm replacing the whole std.uni anyway...

> I raised a couple of issues in the pull, which I'll copy here:
>
> //----
> I did run into a couple of issues, namelly that I'm not getting 100%
> equivalence between chars that are numeric, and chars with numeric
> value... Is this normal...?
>

Yes, it's called Unicode ;)

> * There's a fair bit of chars that have numeric value, but aren't
> isNumber. I think they might be new in 6.1.0. But I'm not sure. I
> decided it was best to have them return nan, instead of having
> inconsistent behavior.

You also might be using 6.2. It's released as of a fall of 2012.

> * There's a couple characters in tableLo that have numeric values. These
> aren't considered in isNumber either. I think this might be a bug though.
> * There are 4 "non-number numeric" characters in "CUNEIFORM NUMERIC
> SIGN". These return wild values, and in particular two of them return
> -1. I *think* this should actually return nan for us, because (AFAIK),
> -1 is just wild for invalid :/

Some have numeric value of '-1' I think. The truth of the matter is as usual with Unicode things are rather complicated.
So 'numeric character' is a category (general) and 'has numeric value' is some other property of codepoint that may or may not correlate directly with category.

Thus I think (looking ahead into your other post) that isNumber is correct as it follows its documented behavior.

>
> Maybe we should just return -1 on invalid unicode? Or maybe it's just my
> input file:
> http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
> It doesn't have a separate field for isNumber/numericValue, so it is
> forced to write a wild number. Maybe these four chars should return nan?

Nope. Does letter 'A' return a wild number?

> //----
>
> Oh yeah, I also added isNumber to std.ascii. Feels wrong to not have it
> if we have numericValue.


-- 
Dmitry Olshansky
January 04, 2013
On Friday, 4 January 2013 at 20:33:12 UTC, Dmitry Olshansky wrote:
> 04-Jan-2013 21:48, monarch_dodra пишет:
>>
>> I finished an implementation:
>>
>> https://github.com/D-Programming-Language/phobos/pull/1052
>>
>> It is not "pull ready", so we can still discuss it.
>>
>
> Well, for start it features tons of code duplication. But I'm replacing the whole std.uni anyway...

Well, I wrote that with duplication, keeping in mind you would
probably replace both. I thought it be cleaner to have some duplication, than a warped single implementation. I could also make the extra effort. I was really concerned with first having an implementation that is unicode correct.

I also though that, at worst, you could use my parsed data ;) to submit your own (superior?) pull.

>> * There's a couple characters in tableLo that have numeric values. These
>> aren't considered in isNumber either. I think this might be a bug though.
>> * There are 4 "non-number numeric" characters in "CUNEIFORM NUMERIC
>> SIGN". These return wild values, and in particular two of them return
>> -1. I *think* this should actually return nan for us, because (AFAIK),
>> -1 is just wild for invalid :/
>
> Some have numeric value of '-1' I think. The truth of the matter is as usual with Unicode things are rather complicated.
> So 'numeric character' is a category (general) and 'has numeric value' is some other property of codepoint that may or may not correlate directly with category.
>
> Thus I think (looking ahead into your other post) that isNumber is correct as it follows its documented behavior.
>
>>
>> Maybe we should just return -1 on invalid unicode? Or maybe it's just my
>> input file:
>> http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
>> It doesn't have a separate field for isNumber/numericValue, so it is
>> forced to write a wild number. Maybe these four chars should return nan?
>
> Nope. Does letter 'A' return a wild number?
>

Well, the thing is that I'm getting contradictory info from the
consortium itself:
Given 0x12456: "CUNEIFORM NUMERIC SIGN NIGIDAMIN"
According to the "UnicodeData.txt", its numeric value is -1.
According to The "Unocide utilities", it is not a numeric type,
and it's value is null:
http://unicode.org/cldr/utility/character.jsp?a=12456

Also according to the consortium: "-1" is an illegal numeric
value.
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Numeric_Value=-1:]

Really, all the info seems to indicate a bug in UnicodeData.txt:
They really seem like 4 entries in Nl that aren't numbers.

I've found a couple people on internet discussing this, but no
hard conclusion :/

****

Anyways, those 4 CUNEIFORM asside, what do you make of the
entries in Lo:
http://unicode.org/cldr/utility/character.jsp?a=F96B
These appear to be numeric, but aren't inside Nd/No/Nl. They
should return true to isNumber, no?

Maybe isNumber's "documented behavior" is wrong?
January 04, 2013
05-Jan-2013 00:51, monarch_dodra пишет:
> On Friday, 4 January 2013 at 20:33:12 UTC, Dmitry Olshansky wrote:
>> 04-Jan-2013 21:48, monarch_dodra пишет:
>>>
>>> I finished an implementation:
>>>
>>> https://github.com/D-Programming-Language/phobos/pull/1052
>>>
>>> It is not "pull ready", so we can still discuss it.
>>>
>>
>> Well, for start it features tons of code duplication. But I'm
>> replacing the whole std.uni anyway...
>
> Well, I wrote that with duplication, keeping in mind you would
> probably replace both. I thought it be cleaner to have some duplication,
> than a warped single implementation. I could also make the extra effort.
> I was really concerned with first having an implementation that is
> unicode correct.
>
> I also though that, at worst, you could use my parsed data ;) to submit
> your module that is well due for peer review.

Fixed ;)

>>> * There's a couple characters in tableLo that have numeric values. These
>>> aren't considered in isNumber either. I think this might be a bug
>>> though.
>>> * There are 4 "non-number numeric" characters in "CUNEIFORM NUMERIC
>>> SIGN". These return wild values, and in particular two of them return
>>> -1. I *think* this should actually return nan for us, because (AFAIK),
>>> -1 is just wild for invalid :/
>>
>> Some have numeric value of '-1' I think. The truth of the matter is as
>> usual with Unicode things are rather complicated.
>> So 'numeric character' is a category (general) and 'has numeric value'
>> is some other property of codepoint that may or may not correlate
>> directly with category.
>>
>> Thus I think (looking ahead into your other post) that isNumber is
>> correct as it follows its documented behavior.
>>
>>>
>>> Maybe we should just return -1 on invalid unicode? Or maybe it's just my
>>> input file:
>>> http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
>>> It doesn't have a separate field for isNumber/numericValue, so it is
>>> forced to write a wild number. Maybe these four chars should return nan?
>>
>> Nope. Does letter 'A' return a wild number?
>>
>
> Well, the thing is that I'm getting contradictory info from the
> consortium itself:
> Given 0x12456: "CUNEIFORM NUMERIC SIGN NIGIDAMIN"
> According to the "UnicodeData.txt", its numeric value is -1.
> According to The "Unocide utilities", it is not a numeric type,
> and it's value is null:
> http://unicode.org/cldr/utility/character.jsp?a=12456
>
> Also according to the consortium: "-1" is an illegal numeric
> value.
> http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Numeric_Value=-1:]
>
> Really, all the info seems to indicate a bug in UnicodeData.txt:
> They really seem like 4 entries in Nl that aren't numbers.
>
> I've found a couple people on internet discussing this, but no
> hard conclusion :/

Basically check the bottom of that page:
....
See also: Unicode Display Problems.
Version 3.6; ICU version: 50.0.1.0; Unicode version: 6.1.0.0

So it's not up to date. The file is. I can test with ICU 51 to see what it reports.

>
> ****
>
> Anyways, those 4 CUNEIFORM asside, what do you make of the
> entries in Lo:
> http://unicode.org/cldr/utility/character.jsp?a=F96B
> These appear to be numeric, but aren't inside Nd/No/Nl. They
> should return true to isNumber, no?

Hmmm. Take a look here:
http://unicode.org/cldr/utility/properties.jsp

There is a section called Numeric that has 3 properties,
and then there is a General section.
The General has Category which in turn has 'Number' category.

Bottom line is that I believe that std.uni isXXX queries the category of a symbol and not some other property. Let any mishaps in between properties and general category be consortium's headache.

>
> Maybe isNumber's "documented behavior" is wrong?

Problem is I can't come up with a good description of some other behavior. Maybe this one [^[:Numeric_Type=None:]]
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%5E%5B%3ANumeric_Type%3DNone%3A%5D%5D&g=


-- 
Dmitry Olshansky
January 04, 2013
On Friday, 4 January 2013 at 22:00:02 UTC, Dmitry Olshansky wrote:
> 05-Jan-2013 00:51, monarch_dodra пишет:
>> Anyways, those 4 CUNEIFORM asside, what do you make of the
>> entries in Lo:
>> http://unicode.org/cldr/utility/character.jsp?a=F96B
>> These appear to be numeric, but aren't inside Nd/No/Nl. They
>> should return true to isNumber, no?
>
> Hmmm. Take a look here:
> http://unicode.org/cldr/utility/properties.jsp
>
> There is a section called Numeric that has 3 properties,
> and then there is a General section.
> The General has Category which in turn has 'Number' category.
>
> Bottom line is that I believe that std.uni isXXX queries the category of a symbol and not some other property. Let any mishaps in between properties and general category be consortium's headache.
>>
>> Maybe isNumber's "documented behavior" is wrong?
>
> Problem is I can't come up with a good description of some other behavior. Maybe this one [^[:Numeric_Type=None:]]
> http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%5E%5B%3ANumeric_Type%3DNone%3A%5D%5D&g=

Sounds like the root of the problem is that isNumber != Numeric_Type[Decimal, Digit, Numeric]

Ergo, there is no correlation between isNumber and numericValue.

Feels like there is a lot missing from std.uni, but at the same time, unicode is really huge.

At the very least, I think we should have Category enum, along with a (get) "category" function.

I was just saying to jmdavis in the pull that std.ascii had "isDigit", but that uni didn't. In truth, both also lack isDecimal and isNumeric.

There would just be a bit of ambiguity now between the broad "isNumeric", and "all the chars that have a numeric value"... :/

Damn. Unicode is complicated.

Anyways, taking my weekend break.
January 05, 2013
On Fri, Jan 04, 2013 at 11:48:39PM +0100, monarch_dodra wrote: [...]
> Sounds like the root of the problem is that isNumber != Numeric_Type[Decimal, Digit, Numeric]
> 
> Ergo, there is no correlation between isNumber and numericValue.

Yikes. That's pretty ... nasty. :-(


> Feels like there is a lot missing from std.uni, but at the same time, unicode is really huge.

Yeah, Unicode is a lot more complex than most people realize. Recently I read through TR14 (proper line-breaking in Unicode), and I was gaping in awe at the insane complexity of such a seemingly-simple task.


> At the very least, I think we should have Category enum, along with a
> (get) "category" function.

Yes! We need that!!


> I was just saying to jmdavis in the pull that std.ascii had "isDigit", but that uni didn't. In truth, both also lack isDecimal and isNumeric.
> 
> There would just be a bit of ambiguity now between the broad "isNumeric", and "all the chars that have a numeric value"... :/
> 
> Damn. Unicode is complicated.
[...]

I, for one, would love to know why isNumeric != hasNumericValue.


T

-- 
Valentine's Day: an occasion for florists to reach into the wallets of nominal lovers in dire need of being reminded to profess their hypothetical love for their long-forgotten.
January 07, 2013
On Friday, 4 January 2013 at 22:00:02 UTC, Dmitry Olshansky wrote:
> [SNIP]

Thank you for all your feed back.

*everything* makes sense now.

However, the conclusion I'm comming to is that there needs some ground work before doing numeric value, which I am currently doing.
January 07, 2013
On Saturday, 5 January 2013 at 00:47:14 UTC, H. S. Teoh wrote:
> [...]
>
> I, for one, would love to know why isNumeric != hasNumericValue.
>
>
> T

I guess it's just bad wording from the standard.

The standard defined 3 groups that make up Number:
[Nd] 	Number, Decimal Digit
[Nl] 	Number, Letter
[No] 	Number, Other

However, there are a couple of characters that *are* numbers, but aren't in those goups.

The "Good" news is that the standard, *does* define number_types to classify the kind of number a char is:
* Null: Not a number
* Digit: Obvious
* Decimal: Any decimal number that is NOT a digit
* Numeric: Everything else.

So they used "Numeric" as wild, and "Number" as their general category.

This leaves us with ambiguity when choosing our word:
Technically '5' does not clasify as "numeric", although you could consider it "has a numeric value".

I hope that makes sense.
January 09, 2013
On Mon, Jan 07, 2013 at 07:51:19PM +0100, monarch_dodra wrote:
> On Saturday, 5 January 2013 at 00:47:14 UTC, H. S. Teoh wrote:
> >[...]
> >I, for one, would love to know why isNumeric != hasNumericValue.
[...]
> I guess it's just bad wording from the standard.
> 
> The standard defined 3 groups that make up Number:
> [Nd] 	Number, Decimal Digit
> [Nl] 	Number, Letter
> [No] 	Number, Other
> 
> However, there are a couple of characters that *are* numbers, but aren't in those goups.
> 
> The "Good" news is that the standard, *does* define number_types to
> classify the kind of number a char is:
> * Null: Not a number
> * Digit: Obvious
> * Decimal: Any decimal number that is NOT a digit
> * Numeric: Everything else.
> 
> So they used "Numeric" as wild, and "Number" as their general category.
> 
> This leaves us with ambiguity when choosing our word: Technically '5' does not clasify as "numeric", although you could consider it "has a numeric value".
> 
> I hope that makes sense.

Hmph. I guess we need to differentiate between the unicode category called "numeric", and the property of having a numerical value. So we'd need both isNumeric and hasNumericValue. Ugh. It's ugly but if that's what the standard is, then that's what it is.

Anyway, I'd love to see std.uni cover all unicode categories.

Offhanded note: should we unify the various isX() functions into:

	bool inCategory(string category)(dchar ch)

where category is the Unicode designation, say "Nl", "Nd", etc.? That way, it's more future-proof in case the Unicode guys add more categories. Also makes it easier to remember which function to call; else you'd always have to remember "N" -> isNumeric, "L" -> isAlpha, etc..

The current names of course can be left as aliases.


T

-- 
The fact that anyone still uses AOL shows that even the presence of options doesn't stop some people from picking the pessimal one. - Mike Ellis
January 10, 2013
10-Jan-2013 03:21, H. S. Teoh пишет:
> On Mon, Jan 07, 2013 at 07:51:19PM +0100, monarch_dodra wrote:
>> On Saturday, 5 January 2013 at 00:47:14 UTC, H. S. Teoh wrote:
>>> [...]
>>> I, for one, would love to know why isNumeric != hasNumericValue.
> [...]
>> I guess it's just bad wording from the standard.
>>
>> The standard defined 3 groups that make up Number:
>> [Nd] 	Number, Decimal Digit
>> [Nl] 	Number, Letter
>> [No] 	Number, Other
>>
>> However, there are a couple of characters that *are* numbers, but
>> aren't in those goups.
>>
>> The "Good" news is that the standard, *does* define number_types to
>> classify the kind of number a char is:
>> * Null: Not a number
>> * Digit: Obvious
>> * Decimal: Any decimal number that is NOT a digit
>> * Numeric: Everything else.
>>
>> So they used "Numeric" as wild, and "Number" as their general
>> category.
>>
>> This leaves us with ambiguity when choosing our word:
>> Technically '5' does not clasify as "numeric", although you could
>> consider it "has a numeric value".
>>
>> I hope that makes sense.
>
> Hmph. I guess we need to differentiate between the unicode category
> called "numeric", and the property of having a numerical value. So we'd
> need both isNumeric and hasNumericValue. Ugh. It's ugly but if that's
> what the standard is, then that's what it is.

isNumber - _Number_ General category (as defined by Unicode 1:1)

isNumeric - as having NumericType != None (again going be definition of Unicode properties)

And that's all, correct and to the latter.

>
> Anyway, I'd love to see std.uni cover all unicode categories.
>
> Offhanded note: should we unify the various isX() functions into:
>
> 	bool inCategory(string category)(dchar ch)
>

No, no, no! It's a horrible idea. The main problem with it is: huge catalog of data has to be stored in Phobos (object code) of no (even niche) use. Also to be practical for use cases other then casual observation it has to be fast.. and it can't for any of the useful cases.

Just count the number of bits to store per codepoint and fairly irregular structure of the whole set of properties (unlike individual combinations that do have nice distribution e.g. Scripts as in Cyrillic).

I've been shoulder-deep in Unicode for about half a year now, and reading through TR-xx algorithms and *none* of them requires queries of the sort that tests all (more then 1-2?) of properties.

In all cases the algorithm itself defines a set(s) of codepoints with different meanings/values for this use case. These (useful) sets could be compressed to a fast multi-stage table, the whole catalog of properties - no, as it packs enormous heaps of unused junk (Unicode_Age anyone??). This junk is not fit for std library but the goal is to provide tool for the user to work with sets/data beyond the commonly useful in std.

> where category is the Unicode designation, say "Nl", "Nd", etc.? That
> way, it's more future-proof in case the Unicode guys add more
> categories.

I'm posting my work on std.uni as ready for review today or tomorrow.
It includes a type for a set of codepoints and ton of predefined sets for Nl, Nd and almost everything sensible (blocks, scripts, properties).
The user can then conjure whatever combination required.

And it still way smaller then having full 'query the database' thing. To check the full madness of all of the properties just use the web interface of unicode.org.

P.S. Hopefully, nobody rises the point of codepoint _names_ they are after all too part of Unicode standard (and character database).

-- 
Dmitry Olshansky
January 10, 2013
On Thursday, 10 January 2013 at 18:09:31 UTC, Dmitry Olshansky wrote:
> 10-Jan-2013 03:21, H. S. Teoh пишет:
>> On Mon, Jan 07, 2013 at 07:51:19PM +0100, monarch_dodra wrote:
>>> On Saturday, 5 January 2013 at 00:47:14 UTC, H. S. Teoh wrote:
>>>> [...]
>>>> I, for one, would love to know why isNumeric != hasNumericValue.
>> [...]
>>> I guess it's just bad wording from the standard.
>>>
>>> The standard defined 3 groups that make up Number:
>>> [Nd] 	Number, Decimal Digit
>>> [Nl] 	Number, Letter
>>> [No] 	Number, Other
>>>
>>> However, there are a couple of characters that *are* numbers, but
>>> aren't in those goups.
>>>
>>> The "Good" news is that the standard, *does* define number_types to
>>> classify the kind of number a char is:
>>> * Null: Not a number
>>> * Digit: Obvious
>>> * Decimal: Any decimal number that is NOT a digit
>>> * Numeric: Everything else.
>>>
>>> So they used "Numeric" as wild, and "Number" as their general
>>> category.
>>>
>>> This leaves us with ambiguity when choosing our word:
>>> Technically '5' does not clasify as "numeric", although you could
>>> consider it "has a numeric value".
>>>
>>> I hope that makes sense.
>>
>> Hmph. I guess we need to differentiate between the unicode category
>> called "numeric", and the property of having a numerical value. So we'd
>> need both isNumeric and hasNumericValue. Ugh. It's ugly but if that's
>> what the standard is, then that's what it is.
>
> isNumber - _Number_ General category (as defined by Unicode 1:1)
>
> isNumeric - as having NumericType != None (again going be definition of Unicode properties)
>
> And that's all, correct and to the latter.

Are you sure about that? The four values of Numeric_Type are:
* Decimal
* Digit
* None
* Numeric <= !!!
http://unicode.org/cldr/utility/properties.jsp?a=Numeric_Type#Numeric_Type

Hopefully, we'll have "isDecimal", "isDigit", and eventually "isNumeric", which according to definition, would simply be "Numeric_Type == Numeric_Type.Numeric"

The problem is that by the definitions of Unicode properties, there is no name for "not in Numeric_Type.None"

"hasNumericValue" is the best name I could come up with to differentiate between "Not Numeric_Type.None" and "Numeric_Type.Numeric"
1 2 3
Next ›   Last »