Jump to page: 1 2 3
Thread overview
[Issue 5543] New: to!int to see a char as a single-char string
Dec 18, 2012
Andrej Mitrovic
Dec 21, 2012
Andrej Mitrovic
Dec 21, 2012
Andrej Mitrovic
Dec 21, 2012
Dmitry Olshansky
Dec 21, 2012
Andrej Mitrovic
Dec 21, 2012
Andrej Mitrovic
Dec 21, 2012
Dmitry Olshansky
Dec 21, 2012
Andrej Mitrovic
Dec 21, 2012
Andrej Mitrovic
Dec 21, 2012
Dmitry Olshansky
February 07, 2011
http://d.puremagic.com/issues/show_bug.cgi?id=5543

           Summary: to!int to see a char as a single-char string
           Product: D
           Version: D2
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Phobos
        AssignedTo: nobody@puremagic.com
        ReportedBy: bearophile_hugs@eml.cc


--- Comment #0 from bearophile_hugs@eml.cc 2011-02-07 14:34:44 PST ---
In DMD 2.051 to!int acts as cast(int) on chars:

import std.conv: to;
void main() {
    assert(to!int("1") == 1);
    assert(cast(int)'1' == 49);
    assert(to!int('1') == 49);
}


But I think this is more handy:

import std.conv: to;
void main() {
    assert(to!int("1") == 1);
    assert(cast(int)'1' == 49);
    assert(to!int('1') == 1);
}

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
December 18, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=5543


Andrej Mitrovic <andrej.mitrovich@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |pull
                 CC|                            |andrej.mitrovich@gmail.com
         AssignedTo|nobody@puremagic.com        |andrej.mitrovich@gmail.com


--- Comment #1 from Andrej Mitrovic <andrej.mitrovich@gmail.com> 2012-12-18 09:58:10 PST ---
https://github.com/D-Programming-Language/phobos/pull/1017

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
December 21, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=5543



--- Comment #2 from Andrej Mitrovic <andrej.mitrovich@gmail.com> 2012-12-21 06:37:34 PST ---
@bear: Please see the comments here: https://github.com/D-Programming-Language/phobos/pull/1017

The feature can be implemented but to!() was rejected, so we need to come up with some alternative function names and put them somewhere other than std.conv.

Personally I don't see how people will be expected to find an obscure function name like 'codePointIdx'. This isn't related unicode representation at all, there should be no confusion with Unicode when it comes to representing 0-9, it's always the same regardless of encoding.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
December 21, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=5543


monarchdodra@gmail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |monarchdodra@gmail.com


--- Comment #3 from monarchdodra@gmail.com 2012-12-21 06:58:42 PST ---
(In reply to comment #2)
> @bear: Please see the comments here: https://github.com/D-Programming-Language/phobos/pull/1017
> 
> The feature can be implemented but to!() was rejected, so we need to come up with some alternative function names and put them somewhere other than std.conv.
> 
> Personally I don't see how people will be expected to find an obscure function name like 'codePointIdx'. This isn't related unicode representation at all, there should be no confusion with Unicode when it comes to representing 0-9, it's always the same regardless of encoding.

Well, that's why we have std.ascii, no? For all char operations when we don't care about unicode.

In all fairness, unicode defines "is numeric" (which we already have) and
"numeric value" (which we *should* have).

C# and java both implement the methods "getNumericValue". Java even implements one taking chars, and another taking int (dchar) http://msdn.microsoft.com/en-us/library/system.char.getnumericvalue.aspx http://docs.oracle.com/javase/1.4.2/docs/api/java/lang/Character.html

I'd say we should just add:
std.ascii.getNumericValue
std.uni.getNumericValue
(or plain numericValue)

I already wrote the ascii version (easy as pie), and support for the [Nd] group, using a binary search, followed by an offset from the lower bound.

[Nl] and [Po] require a straight up mapping of codepoint to value, but I'm still writing the parser that extract the data for the raw UCD (http://www.unicode.org/Public/6.2.0/ucdxml/).

The file is too large for std.xml to handle, so it's back to C++ for me :/

The only questions I have is:
Return value: int or double?
Input is not numeric: -1 or exception?

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
December 21, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=5543



--- Comment #4 from Andrej Mitrovic <andrej.mitrovich@gmail.com> 2012-12-21 07:08:12 PST ---
(In reply to comment #3)
> Well, that's why we have std.ascii, no? For all char operations when we don't care about unicode.
> 
> In all fairness, unicode defines "is numeric" (which we already have) and
> "numeric value" (which we *should* have).

Damn Unicode, why does it need to have 10 different ways to represent something? :)

> The only questions I have is:
> Return value: int or double?

int, because int is implicitly convertible to double, not vice-versa. At least for the ascii part, if Unicode has code points that represent floating-point values.. then I really don't understand what Unicode is about anymore.

> Input is not numeric: -1 or exception?

Hmm.. although exceptions are preferred I think for performance reasons we might consider using -1.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
December 21, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=5543


Dmitry Olshansky <dmitry.olsh@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dmitry.olsh@gmail.com


--- Comment #5 from Dmitry Olshansky <dmitry.olsh@gmail.com> 2012-12-21 07:17:53 PST ---
>Java even implements
> one taking chars, and another taking int (dchar)

That's because Java folks used to have only 16bit chars. Now true codepoints are going in form of 'int'.

> http://msdn.microsoft.com/en-us/library/system.char.getnumericvalue.aspx http://docs.oracle.com/javase/1.4.2/docs/api/java/lang/Character.html
> 
> I'd say we should just add:
> std.ascii.getNumericValue
> std.uni.getNumericValue
> (or plain numericValue)
> 

Agreed and the name should be numericValue.

> I already wrote the ascii version (easy as pie), and support for the [Nd] group, using a binary search, followed by an offset from the lower bound.
> 
> [Nl] and [Po] require a straight up mapping of codepoint to value, but I'm still writing the parser that extract the data for the raw UCD (http://www.unicode.org/Public/6.2.0/ucdxml/).
> 

I'm wrapping up a revamp of std.uni that makes it piece of cake to create character sets. And maps are converted to multi-staged tables that are faster the binary search on a large set. I'd suggest to wait a bit on it (so as to not duplicate work) and introduce only std.ascii version as the most useful.

The ongoing polishing, fixing and testing against ICU is going on here: https://github.com/blackwhale/gsoc-bench-2012

> The file is too large for std.xml to handle, so it's back to C++ for me :/
> 
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

Same thing but no useless XML trash. Description of fields is somewhere in the
middle of this document
http://www.unicode.org/reports/tr44/

> The only questions I have is:
> Return value: int or double?

Should be rational to acurately represent things like "1/5" character ;)
I do suspect some simple custom type could do (2 shorts packed in one struct
etc.).

> Input is not numeric: -1 or exception?

-1 is fine I think as this rather low level (per character) and it's not at all
convenient to throw (and then catch).

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
December 21, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=5543



--- Comment #6 from Andrej Mitrovic <andrej.mitrovich@gmail.com> 2012-12-21 07:26:08 PST ---
Ok I think there are two enhancements here, one for the simple ascii int->char, char->int, and the other more complicated Unicode implementation which monarch/dmitry know more about.

I think we should split up the Unicode enhancement into a new bugzilla entry since the ASCII one can be implemented right now so this issue can be closed soon.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
December 21, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=5543


hsteoh@quickfur.ath.cx changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hsteoh@quickfur.ath.cx


--- Comment #7 from hsteoh@quickfur.ath.cx 2012-12-21 07:29:58 PST ---
It would be nice to have a separate issue filed for tracking Unicode support progress. It can maybe include things like issue 9173 too.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
December 21, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=5543



--- Comment #8 from Andrej Mitrovic <andrej.mitrovich@gmail.com> 2012-12-21 07:32:31 PST ---
(In reply to comment #7)
> It would be nice to have a separate issue filed for tracking Unicode support progress. It can maybe include things like issue 9173 too.

Reporters could add "Unicode" into the Keywords box for these types of issues so we can filter them out.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
December 21, 2012
http://d.puremagic.com/issues/show_bug.cgi?id=5543



--- Comment #9 from monarchdodra@gmail.com 2012-12-21 07:34:14 PST ---
> Ok I think there are two enhancements here, one for the simple ascii int->char, char->int, and the other more complicated Unicode implementation which monarch/dmitry know more about.
> 
> I think we should split up the Unicode enhancement into a new bugzilla entry since the ASCII one can be implemented right now so this issue can be closed soon.

I'm a bit too busy to do the actual pull, but I wrote code, doc and test for this already.

//----
/++
    If $(D c) is an ASCII digit, returns the
    corresponding numeric value. Returns -1 otherwise.
  +/
int numericValue(dchar c) @safe pure nothrow
{
    return ('0' <= c && c <= '9') ? (c - '0') : -1;
}
unittest
{
    int counter = 0;
    foreach (char c; 0 .. 80)
    {
        if (isDigit(c))
            assert(numericValue(c) == counter++);
        else
            assert(numericValue(c) == -1);
    }
}
//----

Not much, but there is never any reason to do the same work twice...

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
« First   ‹ Prev
1 2 3