December 21, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=5543



--- Comment #10 from monarchdodra@gmail.com 2012-12-21 07:53:36 PST ---
(In reply to comment #5)
> 
> I'm wrapping up a revamp of std.uni that makes it piece of cake to create character sets. And maps are converted to multi-staged tables that are faster the binary search on a large set. I'd suggest to wait a bit on it (so as to not duplicate work) and introduce only std.ascii version as the most useful.
> 
> The ongoing polishing, fixing and testing against ICU is going on here: https://github.com/blackwhale/gsoc-bench-2012

OK: The thing I was having trouble though is that existing binary search returns a bool, whereas I need the actual entry, so I can do "value - entry[0]", eg:

//----
    static immutable dchar[2][] table1 = [
    [ 0x0030,  0x0039], //
    [ 0x0660,  0x0669], //ARABIC-INDIC
    [ 0x06F0,  0x06F9], //EXTENDED ARABIC-INDIC

...
//---
That's because all the entries in [Nd] are consecutive numerals starting at 0.
I can also cram a select couple of entries from [Nl] and [Po] that also use
this scheme.

So if I have the unicode 0x0665 (The ARABIC-INDIC numeral '6'), I'd want to find [ 0x0660,  0x0669], and then "return 0x0665 - 0x0660".

Well, I don't need the entire pair, but at least the lhs of the pair.

If you could keep that in mind during your re-write. Or not. Just throwing it out there.

For all other entries in [Nl] and [Po], I'd have:
    static immutable dchar[2][] table1 = [
    [ 0x261D,  100], //ROMAN NUMERAL ONE HUNDRED

So that's just basic dictionary. But I don't think you can statically allocate an AA. So yeah, just throwing that your direction too.

> > The file is too large for std.xml to handle, so it's back to C++ for me :/
> > 
> http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
> 
> Same thing but no useless XML trash. Description of fields is somewhere in the
> middle of this document
> http://www.unicode.org/reports/tr44/

Nice, TY.

> > The only questions I have is:
> > Return value: int or double?
> 
> Should be rational to acurately represent things like "1/5" character ;)
> I do suspect some simple custom type could do (2 shorts packed in one struct
> etc.).
> 
> > Input is not numeric: -1 or exception?
> 
> -1 is fine I think as this rather low level (per character) and it's not at all
> convenient to throw (and then catch).

The only issue I have with returning -1 is that it is a magic value. The fact
that there is no unicode for -1 is pure coincidence, and not by design. In
particular, any attempt to write "if (numericValue(c) < 0) fail" would also be
wrong because:
http://unicode.org/cldr/utility/character.jsp?a=0F33
The TIBETAN DIGIT HALF ZERO returns -0.5

Do we *really* want to standardize the syntax of "if (numericValue(c) < -0.7)"
?

...

Damn you unicode!

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
December 21, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=5543



--- Comment #11 from Dmitry Olshansky <dmitry.olsh@gmail.com> 2012-12-21 08:00:56 PST ---
(In reply to comment #10)
> (In reply to comment #5)
> > 
> > I'm wrapping up a revamp of std.uni that makes it piece of cake to create character sets. And maps are converted to multi-staged tables that are faster the binary search on a large set. I'd suggest to wait a bit on it (so as to not duplicate work) and introduce only std.ascii version as the most useful.
> > 
> > The ongoing polishing, fixing and testing against ICU is going on here: https://github.com/blackwhale/gsoc-bench-2012
> 
> OK: The thing I was having trouble though is that existing binary search returns a bool, whereas I need the actual entry, so I can do "value - entry[0]", eg:
> 
> //----
>     static immutable dchar[2][] table1 = [
>     [ 0x0030,  0x0039], //
>     [ 0x0660,  0x0669], //ARABIC-INDIC
>     [ 0x06F0,  0x06F9], //EXTENDED ARABIC-INDIC
> 
> ...
> //---
> That's because all the entries in [Nd] are consecutive numerals starting at 0.
> I can also cram a select couple of entries from [Nl] and [Po] that also use
> this scheme.
> 

Sometimes I was able to abuse the natural format of data and sometimes failed. But what proved to be quite good is varying sizes of multi-staged rable to match "periods" of data. In the end if the data has a lot of common "rows" a multi-staged table of certain size per stage is bound hit a sweet spot.

> So if I have the unicode 0x0665 (The ARABIC-INDIC numeral '6'), I'd want to find [ 0x0660,  0x0669], and then "return 0x0665 - 0x0660".
> 
> Well, I don't need the entire pair, but at least the lhs of the pair.
> 
> If you could keep that in mind during your re-write. Or not. Just throwing it out there.
> 
> For all other entries in [Nl] and [Po], I'd have:
>     static immutable dchar[2][] table1 = [
>     [ 0x261D,  100], //ROMAN NUMERAL ONE HUNDRED
> 
> So that's just basic dictionary. But I don't think you can statically allocate an AA. So yeah, just throwing that your direction too.
> 

Well, AA is a fat pig w.r.t RAM usage. But thanks anyway.

> > > The file is too large for std.xml to handle, so it's back to C++ for me :/
> > > 
> > http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
> > 
> > Same thing but no useless XML trash. Description of fields is somewhere in the
> > middle of this document
> > http://www.unicode.org/reports/tr44/
> 
> Nice, TY.
> 
> > > The only questions I have is:
> > > Return value: int or double?
> > 
> > Should be rational to acurately represent things like "1/5" character ;)
> > I do suspect some simple custom type could do (2 shorts packed in one struct
> > etc.).
> > 
> > > Input is not numeric: -1 or exception?
> > 
> > -1 is fine I think as this rather low level (per character) and it's not at all
> > convenient to throw (and then catch).
> 
> The only issue I have with returning -1 is that it is a magic value. The fact
> that there is no unicode for -1 is pure coincidence, and not by design. In
> particular, any attempt to write "if (numericValue(c) < 0) fail" would also be
> wrong because:
> http://unicode.org/cldr/utility/character.jsp?a=0F33
> The TIBETAN DIGIT HALF ZERO returns -0.5
> 
> Do we *really* want to standardize the syntax of "if (numericValue(c) < -0.7)"
> ?
> 
> ...
> 
> Damn you unicode!

Aye, and given there are things like "1e12" I don't think packing it would work any better... some kind of custom type is required.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
December 21, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=5543



--- Comment #12 from Andrej Mitrovic <andrej.mitrovich@gmail.com> 2012-12-21 08:04:19 PST ---
(In reply to comment #9)
> int numericValue(dchar c) @safe pure nothrow

What about int->dchar?

We could call it toNumericChar or something, but it would probably have to throw on invalid input? Or can we also return -1? E.g.

char toNumericChar(int i) @safe pure nothrow
{
    return cast(char)((0 <= i && i <= 9) ? (i + '0') : -1);
}

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
December 21, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=5543



--- Comment #13 from monarchdodra@gmail.com 2012-12-21 08:08:20 PST ---
(In reply to comment #12)
> (In reply to comment #9)
> > int numericValue(dchar c) @safe pure nothrow
> 
> What about int->dchar?
> 
> We could call it toNumericChar or something, but it would probably have to throw on invalid input? Or can we also return -1? E.g.
> 
> char toNumericChar(int i) @safe pure nothrow
> {
>     return cast(char)((0 <= i && i <= 9) ? (i + '0') : -1);
> }

-1 is char.init, so seems good to me. Although I'd go and write it as "char.init" explicitly in the code actually, so as to limit any possible confusion.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
December 21, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=5543



--- Comment #14 from monarchdodra@gmail.com 2012-12-21 08:11:21 PST ---
(In reply to comment #11)
> 
> Aye, and given there are things like "1e12" I don't think packing it would work any better... some kind of custom type is required.

Really? According to:

http://unicode.org/cldr/utility/properties.jsp?a=Numeric_Value#Numeric_Value

They only go from
-0.5 // TIBETAN DIGIT HALF ZERO
to
1_000_000 // ROMAN NUMERAL ONE HUNDRED THOUSAND

So I figured though we were in the number plane where there is a perfect "int <=> double" correlation. If this is not the case...

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
December 21, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=5543



--- Comment #15 from bearophile_hugs@eml.cc 2012-12-21 09:54:26 PST ---
Having functions in std.ascii (and elsewhere) seems acceptable. But I think the name of such functions shouldn't be too much long.


to!int raises exceptions. Returning -1 in case of errors seems able to cause some problems. One common use case for the char->int conversion:

auto s = "123x456";
auto digits = s.map!numericValue().array();

Now I have to scan digits again looking for any -1.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
December 21, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=5543



--- Comment #16 from Andrej Mitrovic <andrej.mitrovich@gmail.com> 2012-12-21 10:10:37 PST ---
(In reply to comment #15)
> Having functions in std.ascii (and elsewhere) seems acceptable. But I think the name of such functions shouldn't be too much long.
> 
> 
> to!int raises exceptions. Returning -1 in case of errors seems able to cause some problems. One common use case for the char->int conversion:
> 
> auto s = "123x456";
> auto digits = s.map!numericValue().array();
> 
> Now I have to scan digits again looking for any -1.

*But* you can wrap it inside a function which throws on -1 (pseudocode):

auto s = "123x456";
auto thr = (a) => a == -1 ? throw ConvException() : a;
auto digits = s.map!numericValue().array();

Whereas if it threw to begin with you're forced to catch exceptions.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
December 21, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=5543



--- Comment #17 from Dmitry Olshansky <dmitry.olsh@gmail.com> 2012-12-21 10:20:15 PST ---
(In reply to comment #14)
> (In reply to comment #11)
> > 
> > Aye, and given there are things like "1e12" I don't think packing it would work any better... some kind of custom type is required.
> 
> Really? According to:
> 
> http://unicode.org/cldr/utility/properties.jsp?a=Numeric_Value#Numeric_Value
> 
> They only go from
> -0.5 // TIBETAN DIGIT HALF ZERO
> to
> 1_000_000 // ROMAN NUMERAL ONE HUNDRED THOUSAND
> 
> So I figured though we were in the number plane where there is a perfect "int <=> double" correlation. If this is not the case...

You missed the nice and cool 1.0e12 !

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3AnumericValue%3D1.0E12%3A%5D&g=

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
December 21, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=5543



--- Comment #18 from bearophile_hugs@eml.cc 2012-12-21 10:24:12 PST ---
(In reply to comment #16)

> Whereas if it threw to begin with you're forced to catch exceptions.

There is no perfect solution. Exceptions are safer than error codes because if you forget to test for a negative result, your program stops. On the other hand exceptions are less efficient, less handy to use in nothrow functions, and often require some try-catch wrapping.

In this enhancement request I was originally asking for an overload of to!(), this means a solution that throws exceptions when the input is wrong.

Efficiency is not a significant problem for me here because where I need to convert char digits to numerical digits with max efficientcy I use a '0' subtraction (or a vectorized version of it). So with this overload of to!() I was looking for safety.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
December 21, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=5543



--- Comment #19 from monarchdodra@gmail.com 2012-12-21 10:53:11 PST ---
(In reply to comment #17)
> (In reply to comment #14)
> > (In reply to comment #11)
> > > 
> > > Aye, and given there are things like "1e12" I don't think packing it would work any better... some kind of custom type is required.
> > 
> > Really? According to:
> > 
> > http://unicode.org/cldr/utility/properties.jsp?a=Numeric_Value#Numeric_Value
> > 
> > They only go from
> > -0.5 // TIBETAN DIGIT HALF ZERO
> > to
> > 1_000_000 // ROMAN NUMERAL ONE HUNDRED THOUSAND
> > 
> > So I figured though we were in the number plane where there is a perfect "int <=> double" correlation. If this is not the case...
> 
> You missed the nice and cool 1.0e12 !
> 
> http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3AnumericValue%3D1.0E12%3A%5D&g=

Well, that still fits in both a long, and in a double with no loss, so we're still good. Crisis averted.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------