Thread overview | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
March 02, 2013 [Issue 9629] New: toUpperInPlace doesn't work properly with unicode characters | ||||
---|---|---|---|---|
| ||||
http://d.puremagic.com/issues/show_bug.cgi?id=9629 Summary: toUpperInPlace doesn't work properly with unicode characters Product: D Version: D2 Platform: All OS/Version: All Status: NEW Severity: normal Priority: P2 Component: Phobos AssignedTo: nobody@puremagic.com ReportedBy: turkeyman@gmail.com --- Comment #0 from Manu <turkeyman@gmail.com> 2013-03-01 19:46:49 PST --- wchar[] test = "hello þ world"w.dup; toUpperInPlace(test[6..7]); assert(test == "hello Þ world"); The unicode letter Thorn is not converted in place. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
March 02, 2013 [Issue 9629] toUpperInPlace doesn't work properly with unicode characters | ||||
---|---|---|---|---|
| ||||
Posted in reply to Manu | http://d.puremagic.com/issues/show_bug.cgi?id=9629 Andrej Mitrovic <andrej.mitrovich@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |andrej.mitrovich@gmail.com --- Comment #1 from Andrej Mitrovic <andrej.mitrovich@gmail.com> 2013-03-01 20:24:24 PST --- (In reply to comment #0) > wchar[] test = "hello þ world"w.dup; > toUpperInPlace(test[6..7]); > assert(test == "hello Þ world"); > > The unicode letter Thorn is not converted in place. I'm noticing something strange: import std.string; import std.stdio; void main() { wchar[] test = "abþdefg"w.dup; toUpperInPlace(test[0 .. $]); writeln(test); } writes: ABþdefg Note how it stops uppercasing after that character. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
March 05, 2013 [Issue 9629] toUpperInPlace doesn't work properly with unicode characters | ||||
---|---|---|---|---|
| ||||
Posted in reply to Manu | http://d.puremagic.com/issues/show_bug.cgi?id=9629 --- Comment #2 from Andrej Mitrovic <andrej.mitrovich@gmail.com> 2013-03-04 20:55:59 PST --- This seems like a codegen bug: import std.ascii; import std.conv; import std.stdio; import std.utf; void upper(C)(ref C[] s) { for (size_t i = 0; i < s.length; ) { immutable c = s[i]; if ('a' <= c && c <= 'z') { s[i++] = cast(C) (c - (cast(C)'a' - 'A')); } else if (!std.ascii.isASCII(c)) { size_t j = i; dchar dc = decode(s, j); auto toAdd = to!(C[])(std.uni.toUpper(dc)); s = s[0 .. i] ~ toAdd ~ s[j .. $]; i += toAdd.length; } else { ++i; } } writefln("Inside: %s", s); } void main() { wchar[] s1 = "þ abcdef"w.dup; upper(s1[]); writefln("Outside: %s\n", s1); // Þ ABCDEF } If you change the call "upper(s1[]);" to "upper(s1[0..$]);" you get "Þ abcdef". The assembly looks significantly different between the two calls (removing all writefln calls first). The [] version: __Dmain:; Function begin, communal push ebp ; 0000 _ 55 mov ebp, esp ; 0001 _ 8B. EC sub esp, 8 ; 0003 _ 83. EC, 08 push dword [?_0003] ; 0006 _ FF. 35, 0000001C(segrel) push dword [?_0002] ; 000C _ FF. 35, 00000018(segrel) mov eax, FLAT:_D12TypeInfo_Ayu6__initZ ; 0012 _ B8, 00000000(segrel) push eax ; 0017 _ 50 call __adDupT ; 0018 _ E8, 00000000(rel) mov dword [ebp-8H], eax ; 001D _ 89. 45, F8 mov dword [ebp-4H], edx ; 0020 _ 89. 55, FC lea eax, [ebp-8H] ; 0023 _ 8D. 45, F8 call _D5upper12__T5upperTuZ5upperFKAuZv ; 0026 _ E8, 00000000(rel) xor eax, eax ; 002B _ 31. C0 add esp, 12 ; 002D _ 83. C4, 0C leave ; 0030 _ C9 ret ; 0031 _ C3 ; __Dmain End of function And the [0..$] version: __Dmain:; Function begin, communal push ebp ; 0000 _ 55 mov ebp, esp ; 0001 _ 8B. EC sub esp, 24 ; 0003 _ 83. EC, 18 push ebx ; 0006 _ 53 push dword [?_0003] ; 0007 _ FF. 35, 0000001C(segrel) push dword [?_0002] ; 000D _ FF. 35, 00000018(segrel) mov eax, FLAT:_D12TypeInfo_Ayu6__initZ ; 0013 _ B8, 00000000(segrel) push eax ; 0018 _ 50 call __adDupT ; 0019 _ E8, 00000000(rel) mov dword [ebp-18H], eax ; 001E _ 89. 45, E8 mov dword [ebp-14H], edx ; 0021 _ 89. 55, EC mov ecx, dword [ebp-18H] ; 0024 _ 8B. 4D, E8 mov dword [ebp-10H], ecx ; 0027 _ 89. 4D, F0 cmp ecx, ecx ; 002A _ 39. C9 jbe ?_0417 ; 002C _ 76, 0A mov eax, 37 ; 002E _ B8, 00000025 call _D5upper7__arrayZ ; 0033 _ E8, 00000000(rel) ?_0417: mov ebx, dword [ebp-10H] ; 0038 _ 8B. 5D, F0 mov edx, dword [ebp-14H] ; 003B _ 8B. 55, EC mov eax, dword [ebp-18H] ; 003E _ 8B. 45, E8 mov dword [ebp-8H], ebx ; 0041 _ 89. 5D, F8 mov dword [ebp-4H], edx ; 0044 _ 89. 55, FC lea eax, [ebp-8H] ; 0047 _ 8D. 45, F8 call _D5upper12__T5upperTuZ5upperFKAuZv ; 004A _ E8, 00000000(rel) xor eax, eax ; 004F _ 31. C0 add esp, 12 ; 0051 _ 83. C4, 0C pop ebx ; 0054 _ 5B leave ; 0055 _ C9 ret ; 0056 _ C3 ; __Dmain End of function -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
March 05, 2013 [Issue 9629] toUpperInPlace doesn't work properly with unicode characters | ||||
---|---|---|---|---|
| ||||
Posted in reply to Manu | http://d.puremagic.com/issues/show_bug.cgi?id=9629 --- Comment #3 from Manu <turkeyman@gmail.com> 2013-03-04 23:26:13 PST --- At a glance, it looks to me like the problem is this line: s = s[0 .. i] ~ toAdd ~ s[j .. $]; See, it's not overwriting any memory, it's allocating and writing into new memory... that contradicts the 'InPlace' specification. Shouldn't that line be more like: s[i .. j] = toAdd[]; And I don't think there's any reason for the function to receive a 'ref'. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
March 05, 2013 [Issue 9629] toUpperInPlace doesn't work properly with unicode characters | ||||
---|---|---|---|---|
| ||||
Posted in reply to Manu | http://d.puremagic.com/issues/show_bug.cgi?id=9629 --- Comment #4 from Andrej Mitrovic <andrej.mitrovich@gmail.com> 2013-03-04 23:51:22 PST --- (In reply to comment #3) > At a glance, it looks to me like the problem is this line: > > s = s[0 .. i] ~ toAdd ~ s[j .. $]; > > See, it's not overwriting any memory, it's allocating and writing into new memory... that contradicts the 'InPlace' specification. > > Shouldn't that line be more like: > s[i .. j] = toAdd[]; > > And I don't think there's any reason for the function to receive a 'ref'. Hmm yeah that's the problem. It's kind of odd that a slice which only really lives inside the function is allowed to be passed by ref. What I mean is: void foo(ref int[] a) { } int[] a = [1, 2]; foo(a[0..2]); It seems like this kind of slice should be treated as an rvalue, because ref semantics make no sense in this case as they won't propagate to the call site. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
March 06, 2013 [Issue 9629] toUpperInPlace doesn't work properly with unicode characters | ||||
---|---|---|---|---|
| ||||
Posted in reply to Manu | http://d.puremagic.com/issues/show_bug.cgi?id=9629 Andrej Mitrovic <andrej.mitrovich@gmail.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |k.hara.pg@gmail.com --- Comment #5 from Andrej Mitrovic <andrej.mitrovich@gmail.com> 2013-03-06 05:51:40 PST --- (In reply to comment #4) > It seems like this kind of slice should be treated as an rvalue, because ref semantics make no sense in this case as they won't propagate to the call site. See also Kenji's comment about slices: https://github.com/D-Programming-Language/dmd/pull/1343#issuecomment-14482830 @Kenji: Is there a bug opened for slices not being rvalues? -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
May 30, 2013 [Issue 9629] toUpperInPlace doesn't work properly with unicode characters | ||||
---|---|---|---|---|
| ||||
Posted in reply to Manu | http://d.puremagic.com/issues/show_bug.cgi?id=9629 monarchdodra@gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |monarchdodra@gmail.com See Also| |http://d.puremagic.com/issu | |es/show_bug.cgi?id=10203 --- Comment #6 from monarchdodra@gmail.com 2013-05-30 02:17:28 PDT --- (In reply to comment #0) > wchar[] test = "hello þ world"w.dup; > toUpperInPlace(test[6..7]); > assert(test == "hello Þ world"); > > The unicode letter Thorn is not converted in place. Apart from the fact that currently, toUpperInPlace is not actually InPlace, which means this code does not work, and the whole r-value issue thingy, there is something dreadfully wrong with this code: UTF simply does not work this way, as the upper or lower of a string may not have the same length as the original string. Under this situation, doing what you are doing is aboslutly unsafe, and may end up either not working (forced relocation), or filling the string with garbage. -------- I was under the impression that this bug was mostly about the codegen issue, so I did a new entry specifically for fixing the implementation of toUpperInPlace. You can read in more details the issue there: http://d.puremagic.com/issues/show_bug.cgi?id=10203 -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
September 27, 2013 [Issue 9629] toUpperInPlace doesn't work properly with unicode characters | ||||
---|---|---|---|---|
| ||||
Posted in reply to Manu | http://d.puremagic.com/issues/show_bug.cgi?id=9629 monarchdodra@gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED --- Comment #7 from monarchdodra@gmail.com 2013-09-27 07:11:31 PDT --- This was fixed in the new std.uni, and a regression introduced was also just fixed. So closing. -- Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- |
Copyright © 1999-2021 by the D Language Foundation