December 21, 2012 [Issue 5543] to!int to see a char as a single-char string | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile_hugs@eml.cc | http://d.puremagic.com/issues/show_bug.cgi?id=5543 --- Comment #10 from monarchdodra@gmail.com 2012-12-21 07:53:36 PST --- (In reply to comment #5) > > I'm wrapping up a revamp of std.uni that makes it piece of cake to create character sets. And maps are converted to multi-staged tables that are faster the binary search on a large set. I'd suggest to wait a bit on it (so as to not duplicate work) and introduce only std.ascii version as the most useful. > > The ongoing polishing, fixing and testing against ICU is going on here: https://github.com/blackwhale/gsoc-bench-2012 OK: The thing I was having trouble though is that existing binary search returns a bool, whereas I need the actual entry, so I can do "value - entry[0]", eg: //---- static immutable dchar[2][] table1 = [ [ 0x0030, 0x0039], // [ 0x0660, 0x0669], //ARABIC-INDIC [ 0x06F0, 0x06F9], //EXTENDED ARABIC-INDIC ... //--- That's because all the entries in [Nd] are consecutive numerals starting at 0. I can also cram a select couple of entries from [Nl] and [Po] that also use this scheme. So if I have the unicode 0x0665 (The ARABIC-INDIC numeral '6'), I'd want to find [ 0x0660, 0x0669], and then "return 0x0665 - 0x0660". Well, I don't need the entire pair, but at least the lhs of the pair. If you could keep that in mind during your re-write. Or not. Just throwing it out there. For all other entries in [Nl] and [Po], I'd have: static immutable dchar[2][] table1 = [ [ 0x261D, 100], //ROMAN NUMERAL ONE HUNDRED So that's just basic dictionary. But I don't think you can statically allocate an AA. So yeah, just throwing that your direction too. > > The file is too large for std.xml to handle, so it's back to C++ for me :/ > > > http://www.unicode.org/Public/UNIDATA/UnicodeData.txt > > Same thing but no useless XML trash. Description of fields is somewhere in the > middle of this document > http://www.unicode.org/reports/tr44/ Nice, TY. > > The only questions I have is: > > Return value: int or double? > > Should be rational to acurately represent things like "1/5" character ;) > I do suspect some simple custom type could do (2 shorts packed in one struct > etc.). > > > Input is not numeric: -1 or exception? > > -1 is fine I think as this rather low level (per character) and it's not at all > convenient to throw (and then catch). The only issue I have with returning -1 is that it is a magic value. The fact that there is no unicode for -1 is pure coincidence, and not by design. In particular, any attempt to write "if (numericValue(c) < 0) fail" would also be wrong because: http://unicode.org/cldr/utility/character.jsp?a=0F33 The TIBETAN DIGIT HALF ZERO returns -0.5 Do we *really* want to standardize the syntax of "if (numericValue(c) < -0.7)" ? ... Damn you unicode! -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
December 21, 2012 [Issue 5543] to!int to see a char as a single-char string | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile_hugs@eml.cc | http://d.puremagic.com/issues/show_bug.cgi?id=5543 --- Comment #11 from Dmitry Olshansky <dmitry.olsh@gmail.com> 2012-12-21 08:00:56 PST --- (In reply to comment #10) > (In reply to comment #5) > > > > I'm wrapping up a revamp of std.uni that makes it piece of cake to create character sets. And maps are converted to multi-staged tables that are faster the binary search on a large set. I'd suggest to wait a bit on it (so as to not duplicate work) and introduce only std.ascii version as the most useful. > > > > The ongoing polishing, fixing and testing against ICU is going on here: https://github.com/blackwhale/gsoc-bench-2012 > > OK: The thing I was having trouble though is that existing binary search returns a bool, whereas I need the actual entry, so I can do "value - entry[0]", eg: > > //---- > static immutable dchar[2][] table1 = [ > [ 0x0030, 0x0039], // > [ 0x0660, 0x0669], //ARABIC-INDIC > [ 0x06F0, 0x06F9], //EXTENDED ARABIC-INDIC > > ... > //--- > That's because all the entries in [Nd] are consecutive numerals starting at 0. > I can also cram a select couple of entries from [Nl] and [Po] that also use > this scheme. > Sometimes I was able to abuse the natural format of data and sometimes failed. But what proved to be quite good is varying sizes of multi-staged rable to match "periods" of data. In the end if the data has a lot of common "rows" a multi-staged table of certain size per stage is bound hit a sweet spot. > So if I have the unicode 0x0665 (The ARABIC-INDIC numeral '6'), I'd want to find [ 0x0660, 0x0669], and then "return 0x0665 - 0x0660". > > Well, I don't need the entire pair, but at least the lhs of the pair. > > If you could keep that in mind during your re-write. Or not. Just throwing it out there. > > For all other entries in [Nl] and [Po], I'd have: > static immutable dchar[2][] table1 = [ > [ 0x261D, 100], //ROMAN NUMERAL ONE HUNDRED > > So that's just basic dictionary. But I don't think you can statically allocate an AA. So yeah, just throwing that your direction too. > Well, AA is a fat pig w.r.t RAM usage. But thanks anyway. > > > The file is too large for std.xml to handle, so it's back to C++ for me :/ > > > > > http://www.unicode.org/Public/UNIDATA/UnicodeData.txt > > > > Same thing but no useless XML trash. Description of fields is somewhere in the > > middle of this document > > http://www.unicode.org/reports/tr44/ > > Nice, TY. > > > > The only questions I have is: > > > Return value: int or double? > > > > Should be rational to acurately represent things like "1/5" character ;) > > I do suspect some simple custom type could do (2 shorts packed in one struct > > etc.). > > > > > Input is not numeric: -1 or exception? > > > > -1 is fine I think as this rather low level (per character) and it's not at all > > convenient to throw (and then catch). > > The only issue I have with returning -1 is that it is a magic value. The fact > that there is no unicode for -1 is pure coincidence, and not by design. In > particular, any attempt to write "if (numericValue(c) < 0) fail" would also be > wrong because: > http://unicode.org/cldr/utility/character.jsp?a=0F33 > The TIBETAN DIGIT HALF ZERO returns -0.5 > > Do we *really* want to standardize the syntax of "if (numericValue(c) < -0.7)" > ? > > ... > > Damn you unicode! Aye, and given there are things like "1e12" I don't think packing it would work any better... some kind of custom type is required. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
December 21, 2012 [Issue 5543] to!int to see a char as a single-char string | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile_hugs@eml.cc | http://d.puremagic.com/issues/show_bug.cgi?id=5543 --- Comment #12 from Andrej Mitrovic <andrej.mitrovich@gmail.com> 2012-12-21 08:04:19 PST --- (In reply to comment #9) > int numericValue(dchar c) @safe pure nothrow What about int->dchar? We could call it toNumericChar or something, but it would probably have to throw on invalid input? Or can we also return -1? E.g. char toNumericChar(int i) @safe pure nothrow { return cast(char)((0 <= i && i <= 9) ? (i + '0') : -1); } -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
December 21, 2012 [Issue 5543] to!int to see a char as a single-char string | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile_hugs@eml.cc | http://d.puremagic.com/issues/show_bug.cgi?id=5543 --- Comment #13 from monarchdodra@gmail.com 2012-12-21 08:08:20 PST --- (In reply to comment #12) > (In reply to comment #9) > > int numericValue(dchar c) @safe pure nothrow > > What about int->dchar? > > We could call it toNumericChar or something, but it would probably have to throw on invalid input? Or can we also return -1? E.g. > > char toNumericChar(int i) @safe pure nothrow > { > return cast(char)((0 <= i && i <= 9) ? (i + '0') : -1); > } -1 is char.init, so seems good to me. Although I'd go and write it as "char.init" explicitly in the code actually, so as to limit any possible confusion. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
December 21, 2012 [Issue 5543] to!int to see a char as a single-char string | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile_hugs@eml.cc | http://d.puremagic.com/issues/show_bug.cgi?id=5543 --- Comment #14 from monarchdodra@gmail.com 2012-12-21 08:11:21 PST --- (In reply to comment #11) > > Aye, and given there are things like "1e12" I don't think packing it would work any better... some kind of custom type is required. Really? According to: http://unicode.org/cldr/utility/properties.jsp?a=Numeric_Value#Numeric_Value They only go from -0.5 // TIBETAN DIGIT HALF ZERO to 1_000_000 // ROMAN NUMERAL ONE HUNDRED THOUSAND So I figured though we were in the number plane where there is a perfect "int <=> double" correlation. If this is not the case... -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
December 21, 2012 [Issue 5543] to!int to see a char as a single-char string | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile_hugs@eml.cc | http://d.puremagic.com/issues/show_bug.cgi?id=5543 --- Comment #15 from bearophile_hugs@eml.cc 2012-12-21 09:54:26 PST --- Having functions in std.ascii (and elsewhere) seems acceptable. But I think the name of such functions shouldn't be too much long. to!int raises exceptions. Returning -1 in case of errors seems able to cause some problems. One common use case for the char->int conversion: auto s = "123x456"; auto digits = s.map!numericValue().array(); Now I have to scan digits again looking for any -1. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
December 21, 2012 [Issue 5543] to!int to see a char as a single-char string | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile_hugs@eml.cc | http://d.puremagic.com/issues/show_bug.cgi?id=5543 --- Comment #16 from Andrej Mitrovic <andrej.mitrovich@gmail.com> 2012-12-21 10:10:37 PST --- (In reply to comment #15) > Having functions in std.ascii (and elsewhere) seems acceptable. But I think the name of such functions shouldn't be too much long. > > > to!int raises exceptions. Returning -1 in case of errors seems able to cause some problems. One common use case for the char->int conversion: > > auto s = "123x456"; > auto digits = s.map!numericValue().array(); > > Now I have to scan digits again looking for any -1. *But* you can wrap it inside a function which throws on -1 (pseudocode): auto s = "123x456"; auto thr = (a) => a == -1 ? throw ConvException() : a; auto digits = s.map!numericValue().array(); Whereas if it threw to begin with you're forced to catch exceptions. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
December 21, 2012 [Issue 5543] to!int to see a char as a single-char string | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile_hugs@eml.cc | http://d.puremagic.com/issues/show_bug.cgi?id=5543 --- Comment #17 from Dmitry Olshansky <dmitry.olsh@gmail.com> 2012-12-21 10:20:15 PST --- (In reply to comment #14) > (In reply to comment #11) > > > > Aye, and given there are things like "1e12" I don't think packing it would work any better... some kind of custom type is required. > > Really? According to: > > http://unicode.org/cldr/utility/properties.jsp?a=Numeric_Value#Numeric_Value > > They only go from > -0.5 // TIBETAN DIGIT HALF ZERO > to > 1_000_000 // ROMAN NUMERAL ONE HUNDRED THOUSAND > > So I figured though we were in the number plane where there is a perfect "int <=> double" correlation. If this is not the case... You missed the nice and cool 1.0e12 ! http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3AnumericValue%3D1.0E12%3A%5D&g= -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
December 21, 2012 [Issue 5543] to!int to see a char as a single-char string | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile_hugs@eml.cc | http://d.puremagic.com/issues/show_bug.cgi?id=5543 --- Comment #18 from bearophile_hugs@eml.cc 2012-12-21 10:24:12 PST --- (In reply to comment #16) > Whereas if it threw to begin with you're forced to catch exceptions. There is no perfect solution. Exceptions are safer than error codes because if you forget to test for a negative result, your program stops. On the other hand exceptions are less efficient, less handy to use in nothrow functions, and often require some try-catch wrapping. In this enhancement request I was originally asking for an overload of to!(), this means a solution that throws exceptions when the input is wrong. Efficiency is not a significant problem for me here because where I need to convert char digits to numerical digits with max efficientcy I use a '0' subtraction (or a vectorized version of it). So with this overload of to!() I was looking for safety. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
December 21, 2012 [Issue 5543] to!int to see a char as a single-char string | ||||
---|---|---|---|---|
| ||||
Posted in reply to bearophile_hugs@eml.cc | http://d.puremagic.com/issues/show_bug.cgi?id=5543 --- Comment #19 from monarchdodra@gmail.com 2012-12-21 10:53:11 PST --- (In reply to comment #17) > (In reply to comment #14) > > (In reply to comment #11) > > > > > > Aye, and given there are things like "1e12" I don't think packing it would work any better... some kind of custom type is required. > > > > Really? According to: > > > > http://unicode.org/cldr/utility/properties.jsp?a=Numeric_Value#Numeric_Value > > > > They only go from > > -0.5 // TIBETAN DIGIT HALF ZERO > > to > > 1_000_000 // ROMAN NUMERAL ONE HUNDRED THOUSAND > > > > So I figured though we were in the number plane where there is a perfect "int <=> double" correlation. If this is not the case... > > You missed the nice and cool 1.0e12 ! > > http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3AnumericValue%3D1.0E12%3A%5D&g= Well, that still fits in both a long, and in a double with no loss, so we're still good. Crisis averted. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
Copyright © 1999-2021 by the D Language Foundation