Thread overview
[Issue 11017] New: std.string/uni.toLower is very slow
Sep 12, 2013
Peter Alexander
Sep 12, 2013
Dmitry Olshansky
Sep 12, 2013
Peter Alexander
Sep 12, 2013
Dmitry Olshansky
Sep 12, 2013
Peter Alexander
Sep 12, 2013
Dmitry Olshansky
Sep 12, 2013
Dmitry Olshansky
September 12, 2013
http://d.puremagic.com/issues/show_bug.cgi?id=11017

           Summary: std.string/uni.toLower is very slow
           Product: D
           Version: D2
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Phobos
        AssignedTo: nobody@puremagic.com
        ReportedBy: peter.alexander.au@gmail.com


--- Comment #0 from Peter Alexander <peter.alexander.au@gmail.com> 2013-09-12 10:52:33 PDT ---
char[] s = new char[10_000_000];
s[] = 'A';
auto s2 = s.toLower;

This takes 4.3 seconds on my machine.


char[] s = new char[10_000_000];
s[] = 'A';
auto s2 = s.map!toLower.to!string;

This only takes 1.1 seconds.

Looking at the code for std.uni.toLower, it appears the string is constructed using repeated ~=. It should use an Appender of some sort.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
September 12, 2013
http://d.puremagic.com/issues/show_bug.cgi?id=11017


Dmitry Olshansky <dmitry.olsh@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dmitry.olsh@gmail.com


--- Comment #1 from Dmitry Olshansky <dmitry.olsh@gmail.com> 2013-09-12 11:59:08 PDT ---
(In reply to comment #0)
> char[] s = new char[10_000_000];
> s[] = 'A';
> auto s2 = s.toLower;
> 
> This takes 4.3 seconds on my machine.
> 
> 
> char[] s = new char[10_000_000];
> s[] = 'A';
> auto s2 = s.map!toLower.to!string;
> 
> This only takes 1.1 seconds.
> 

There 2 things here to consider - first the 2nd one is not correct in general (1 codepoint can map to many e.g. german sharp S).

> Looking at the code for std.uni.toLower, it appears the string is constructed using repeated ~=. It should use an Appender of some sort.

This indeed could be fixed I do suspect put an optimisitc reserve(original.length) there would work even better. See also issue 10864: http://d.puremagic.com/issues/show_bug.cgi?id=10864

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
September 12, 2013
http://d.puremagic.com/issues/show_bug.cgi?id=11017



--- Comment #2 from Peter Alexander <peter.alexander.au@gmail.com> 2013-09-12 12:45:45 PDT ---
(In reply to comment #1)
> There 2 things here to consider - first the 2nd one is not correct in general (1 codepoint can map to many e.g. german sharp S).

Good point, although std.uni.toUpper doesn't handle it either :-)

assert("ß".toUpper == "ß"); // passes

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
September 12, 2013
http://d.puremagic.com/issues/show_bug.cgi?id=11017



--- Comment #3 from Dmitry Olshansky <dmitry.olsh@gmail.com> 2013-09-12 12:50:37 PDT ---
(In reply to comment #2)
> (In reply to comment #1)
> > There 2 things here to consider - first the 2nd one is not correct in general (1 codepoint can map to many e.g. german sharp S).
> 
> Good point, although std.uni.toUpper doesn't handle it either :-)
> 
> assert("ß".toUpper == "ß"); // passes

To Lower will do. Sharp S is capital ;)

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
September 12, 2013
http://d.puremagic.com/issues/show_bug.cgi?id=11017



--- Comment #4 from Peter Alexander <peter.alexander.au@gmail.com> 2013-09-12 12:52:31 PDT ---
(In reply to comment #3)
> To Lower will do. Sharp S is capital ;)

assert("ß".toLower == "ß");
assert("ß".toUpper == "ß");

Both pass.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
September 12, 2013
http://d.puremagic.com/issues/show_bug.cgi?id=11017



--- Comment #5 from Dmitry Olshansky <dmitry.olsh@gmail.com> 2013-09-12 14:01:05 PDT ---
(In reply to comment #4)
> (In reply to comment #3)
> > To Lower will do. Sharp S is capital ;)
> 
> assert("ß".toLower == "ß");
> assert("ß".toUpper == "ß");
> 
> Both pass.

Something wicked have happend.
I see that I've messed up toUpper in table generator while introducing
toTitleCase (that isn't even yet exposed!). toLower is fine, toUpper is broken
in half of cases apparently.
How I missed that I've no idea ... gotta expand the test coverage around
toLower/toUpper.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
September 12, 2013
http://d.puremagic.com/issues/show_bug.cgi?id=11017



--- Comment #6 from Dmitry Olshansky <dmitry.olsh@gmail.com> 2013-09-12 14:07:17 PDT ---
(In reply to comment #5)
> (In reply to comment #4)
> > (In reply to comment #3)
> > > To Lower will do. Sharp S is capital ;)
> > 
> > assert("ß".toLower == "ß");
> > assert("ß".toUpper == "ß");
> > 
> > Both pass.
> 
> Something wicked have happend.
> I see that I've messed up toUpper in table generator while introducing
> toTitleCase (that isn't even yet exposed!). toLower is fine, toUpper is broken
> in half of cases apparently.
> How I missed that I've no idea ... gotta expand the test coverage around
> toLower/toUpper.

P.S. And there are both kinds of sharp s ... \u1E9E and \u00df

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------