[Issue 10203] New: std.string.toUpperInPlace is... not in place

May 29, 2013

monarchdodra@gmail.com

May 29, 2013

monarchdodra@gmail.com

May 30, 2013

monarchdodra@gmail.com

May 30, 2013

monarchdodra@gmail.com

Sep 21, 2013

Dmitry Olshansky

Sep 21, 2013

bearophile_hugs@eml.cc

Sep 21, 2013

Dmitry Olshansky

Sep 21, 2013

w0rp

http://d.puremagic.com/issues/show_bug.cgi?id=10203 Summary: std.string.toUpperInPlace is... not in place Product: D Version: unspecified Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Phobos AssignedTo: nobody@puremagic.com ReportedBy: monarchdodra@gmail.com --- Comment #0 from monarchdodra@gmail.com 2013-05-29 09:53:34 PDT --- As reported by manu in his talk. All in the title. toUpperInPlace will actually allocated a new string. As a matter of fact, it will allocate, a LOT: a new string per char. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------

http://d.puremagic.com/issues/show_bug.cgi?id=10203 --- Comment #1 from monarchdodra@gmail.com 2013-05-29 12:18:46 PDT --- Fixed here https://github.com/D-Programming-Language/phobos/pull/1322 -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------

May 30, 2013

[Issue 10203] std.string.toUpperInPlace is... not in place

Posted by monarchdodra@gmail.com
in reply to monarchdodra@gmail.com

Permalink

monarchdodra@gmail.com

Posted in reply to monarchdodra@gmail.com

Permalink

http://d.puremagic.com/issues/show_bug.cgi?id=10203



--- Comment #2 from monarchdodra@gmail.com 2013-05-30 02:13:50 PDT ---
(In reply to comment #1)
> Fixed here
> 
> https://github.com/D-Programming-Language/phobos/pull/1322

Turns out, upon investigation, that my fix is incorrect. What is more problematic though, is that the review revealed that there are cases where toUpperInPlace simply *cannot* be inplace: there characters out there, whose UTF representation is not the same length for upper case and lower case.

What this means is that the lower case representation could be longer than the original length.

How could this be done in place?

It also reveals more problems: If the resulting string is actually smaller, than slices that aliased the same data will now reference garbage values. For example:

'İ', which is u0130, and a two byte encoding will become 'i'.

So when I take the string "İa":
auto a = "\xC4\xB0\x61";
auto b = a;
toLowerInPlace(a);

//Now:
//a == "\x69\xB0"
//b == "\x69\xB0\x61" Oops: Trailing code unit :/

Or: say, I have "aİa", and want to reduce in place "İ"
auto a = "\x61\xC4\xB0\x61";
auto b = a[1 .. 3];
toLowerInPlace(b);

//Now:
//b == "\x69" //OK
//a == "\x61\x69\xB0\x61" //Wait, what is that B0 doinh here?

--------
It seems to me that, overall, toLowerInPlace is a function that is broken, that cannot respect the specs it promises, and violates the principal of least surprise in regards to behavior.

I think it should either be tagged with a massive red unsafe, or deprecated.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------

http://d.puremagic.com/issues/show_bug.cgi?id=10203 monarchdodra@gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- See Also| |http://d.puremagic.com/issu | |es/show_bug.cgi?id=9629 AssignedTo|monarchdodra@gmail.com |nobody@puremagic.com --- Comment #3 from monarchdodra@gmail.com 2013-05-30 02:19:53 PDT --- (In reply to comment #2)> So when I take the string "İa": > auto a = "\xC4\xB0\x61"; > auto b = a; > toLowerInPlace(a); > > //Now: > //a == "\x69\xB0" > //b == "\x69\xB0\x61" Oops: Trailing code unit :/ Wait, this example is wrong, corrected as: take the string "aİ" auto a = "\x61\xC4\xB0"; auto b = a; toLowerInPlace(a); //Now: //a == "\x61\x69" //b == "\x61\x69\xB0" Oops: Trailing code unit :/ Sorry. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------

http://d.puremagic.com/issues/show_bug.cgi?id=10203 Dmitry Olshansky <dmitry.olsh@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |dmitry.olsh@gmail.com --- Comment #4 from Dmitry Olshansky <dmitry.olsh@gmail.com> 2013-09-21 13:58:19 PDT --- (In reply to comment #3) > Wait, this example is wrong, corrected as: > > take the string "aİ" > auto a = "\x61\xC4\xB0"; auto a = "\x61\xC4\xB0".dup; > auto b = a; > toLowerInPlace(a); > > //Now: > //a == "\x61\x69" > //b == "\x61\x69\xB0" Oops: Trailing code unit :/ > > Sorry. I belive we now should have solid treatment of toUpper/toLower (pending a bugfix in the works). For me this gives now: //a == 61 69 CC 87 //b == 61 C4 B0 i.e. toLowerInPlace fails to do this in place because resulting length increases. Should it try to extend if a.capacity allows it? (I bet it has about 15 bytes of storage) -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------

http://d.puremagic.com/issues/show_bug.cgi?id=10203 bearophile_hugs@eml.cc changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |bearophile_hugs@eml.cc --- Comment #5 from bearophile_hugs@eml.cc 2013-09-21 14:13:19 PDT --- (In reply to comment #2) > It seems to me that, overall, toLowerInPlace is a function that is broken, that cannot respect the specs it promises, and violates the principal of least surprise in regards to behavior. > > I think it should either be tagged with a massive red unsafe, or deprecated. Even if it's impossible for Unicode strings, I'd like to keep a version of it for just arrays of ASCII chars, that is a common enough use case. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------

http://d.puremagic.com/issues/show_bug.cgi?id=10203 --- Comment #6 from Dmitry Olshansky <dmitry.olsh@gmail.com> 2013-09-21 14:26:34 PDT --- (In reply to comment #5) > (In reply to comment #2) > > > It seems to me that, overall, toLowerInPlace is a function that is broken, that cannot respect the specs it promises, and violates the principal of least surprise in regards to behavior. > > > > I think it should either be tagged with a massive red unsafe, or deprecated. > > Even if it's impossible for Unicode strings, I'd like to keep a version of it for just arrays of ASCII chars, that is a common enough use case. It also works quite well for UTF-16. And it does now. And I would have kept it if only because of backwards compatibility perspective. There is no other primitive yet, but a version(s) with output range for all string transformations is something to look forward to. The only question is whether or not should this function try to use extra capacity beyond the length if it's present (that would make InPlace suffix look saner). -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------

http://d.puremagic.com/issues/show_bug.cgi?id=10203 w0rp <devw0rp@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |devw0rp@gmail.com --- Comment #7 from w0rp <devw0rp@gmail.com> 2013-09-21 14:35:51 PDT --- It may be worth considering presenting only what is possible, so offer toUpperInPlace or toLowerInPlace only on ASCII data. As you say, Unicode strings change in byte length dramatically, but the ASCII range stays the same. So I think it would be worth offering functions with different names, or perhaps with locale template arguments. (Locale.ascii, or whatever.) Just so there can be a guarantee of something never allocating when you want to avoid that, but still let you get at a "almost there" function when you want that. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: -------

Forums