Jump to page: 1 2
Thread overview
[Issue 391] New: .sort and .reverse break utf8 encoding
Oct 02, 2006
d-bugmail
Oct 03, 2006
Stewart Gordon
Oct 03, 2006
Derek Parnell
Oct 04, 2006
Walter Bright
.sort and .reverse break utf8 encoding
Oct 04, 2006
Sean Kelly
Oct 05, 2006
Walter Bright
Oct 05, 2006
Lionello Lunesu
Oct 04, 2006
Thomas Kuehne
Oct 10, 2006
d-bugmail
Dec 23, 2006
d-bugmail
Jan 24, 2007
d-bugmail
Apr 21, 2009
d-bugmail
Nov 20, 2012
Walter Bright
Dec 28, 2012
Walter Bright
October 02, 2006
http://d.puremagic.com/issues/show_bug.cgi?id=391

           Summary: .sort and .reverse break utf8 encoding
           Product: D
           Version: unspecified
          Platform: PC
        OS/Version: All
            Status: NEW
          Severity: major
          Priority: P2
         Component: DMD
        AssignedTo: bugzilla@digitalmars.com
        ReportedBy: ddparnell@bigpond.com


import std.utf;
import std.stdio;
void main()
{
    char[] a;
    a = "\u3026\u2021\u3061\n";
    writefln("plain");    validate(a);
    writefln("sorted");   validate(a.sort);  // fails
    writefln("reversed"); validate(a.reverse); // fails
}


-- 

October 03, 2006
d-bugmail@puremagic.com wrote:
<snip>
> import std.utf;
> import std.stdio;
> void main()
> {
>     char[] a;
>     a = "\u3026\u2021\u3061\n";
>     writefln("plain");    validate(a);
>     writefln("sorted");   validate(a.sort);  // fails
>     writefln("reversed"); validate(a.reverse); // fails
> }

AIUI sort and reverse are defined to sort/reverse the individual elements of the array, rather than the Unicode characters that make up a string.  But hmm....

Stewart.

-- 
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCS/M d- s:-@ C++@ a->--- UB@ P+ L E@ W++@ N+++ o K-@ w++@ O? M V? PS- PE- Y? PGP- t- 5? X? R b DI? D G e++++ h-- r-- !y
------END GEEK CODE BLOCK------

My e-mail is valid but not my primary mailbox.  Please keep replies on the 'group where everyone may benefit.
October 03, 2006
On Tue, 03 Oct 2006 21:43:46 +0100, Stewart Gordon wrote:

> d-bugmail@puremagic.com wrote:
> <snip>
>> import std.utf;
>> import std.stdio;
>> void main()
>> {
>>     char[] a;
>>     a = "\u3026\u2021\u3061\n";
>>     writefln("plain");    validate(a);
>>     writefln("sorted");   validate(a.sort);  // fails
>>     writefln("reversed"); validate(a.reverse); // fails
>> }
> 
> AIUI sort and reverse are defined to sort/reverse the individual elements of the array, rather than the Unicode characters that make up a string.  But hmm....

Yes, I realize that but it makes Walter's statements that char[] is all we need and we do not need a 'string' a bit weaker.

-- 
Derek Parnell
Melbourne, Australia
"Down with mediocrity!"
October 04, 2006
Derek Parnell wrote:
> On Tue, 03 Oct 2006 21:43:46 +0100, Stewart Gordon wrote:
> 
>> d-bugmail@puremagic.com wrote:
>>>     writefln("sorted");   validate(a.sort);  // fails
>>>     writefln("reversed"); validate(a.reverse); // fails
>> AIUI sort and reverse are defined to sort/reverse the individual elements of the array, rather than the Unicode characters that make up a string.  But hmm....
> 
> Yes, I realize that but it makes Walter's statements that char[] is all we
> need and we do not need a 'string' a bit weaker.

.sort and .reverse should reverse the unicode characters. If you want to reverse/sort the individual bytes, you should cast it to a ubyte[] first.

Both behaviors will be fixed in the next update.
October 04, 2006
d-bugmail@puremagic.com schrieb am 2006-10-02:
> http://d.puremagic.com/issues/show_bug.cgi?id=391

> import std.utf;
> import std.stdio;
> void main()
> {
>     char[] a;
>     a = "\u3026\u2021\u3061\n";
>     writefln("plain");    validate(a);
>     writefln("sorted");   validate(a.sort);  // fails
>     writefln("reversed"); validate(a.reverse); // fails
> }

Added to DStress as http://dstress.kuehne.cn/run/r/reverse_08_A.d http://dstress.kuehne.cn/run/r/reverse_08_B.d http://dstress.kuehne.cn/run/r/reverse_08_C.d http://dstress.kuehne.cn/run/s/sort_16_A.d http://dstress.kuehne.cn/run/s/sort_16_B.d http://dstress.kuehne.cn/run/s/sort_16_C.d

Thomas


October 04, 2006
Walter Bright wrote:
> Derek Parnell wrote:
>> On Tue, 03 Oct 2006 21:43:46 +0100, Stewart Gordon wrote:
>>
>>> d-bugmail@puremagic.com wrote:
>>>>     writefln("sorted");   validate(a.sort);  // fails
>>>>     writefln("reversed"); validate(a.reverse); // fails
>>> AIUI sort and reverse are defined to sort/reverse the individual elements of the array, rather than the Unicode characters that make up a string.  But hmm....
>>
>> Yes, I realize that but it makes Walter's statements that char[] is all we
>> need and we do not need a 'string' a bit weaker.
> 
> .sort and .reverse should reverse the unicode characters. If you want to reverse/sort the individual bytes, you should cast it to a ubyte[] first.

Changing the behavior of .reverse kind of makes sense, but I don't understand the reason for changing .sort aside from consistency. Personally, I've never had a reason to sort a char array in the first place unless the chars were intended to represent something other than their lexical meaning.  And that aside, sorting chars in a string without a comparison predicate will do so using the char's binary value, which has no lexical significance beyond the 26 letters of the English alphabet (as represented in ASCII).  I'm starting to feel like people are harping on Unicode issues just for the sake of doing so rather than because these are actual problems.  Can someone please explain what I'm missing?


Sean
October 05, 2006
Sean Kelly wrote:
> Changing the behavior of .reverse kind of makes sense, but I don't understand the reason for changing .sort aside from consistency. Personally, I've never had a reason to sort a char array in the first place unless the chars were intended to represent something other than their lexical meaning.  And that aside, sorting chars in a string without a comparison predicate will do so using the char's binary value, which has no lexical significance beyond the 26 letters of the English alphabet (as represented in ASCII).  I'm starting to feel like people are harping on Unicode issues just for the sake of doing so rather than because these are actual problems.  Can someone please explain what I'm missing?

A use for it is collecting character usage frequency statistics is one such. Read a text file into a buffer, sort the buffer, and dump the result!

I don't mind the harping on it. Getting the details right is important, even if the details themselves aren't. Besides, it's an easy fix.
October 05, 2006
Sean Kelly wrote:
> Walter Bright wrote:
>> Derek Parnell wrote:
>>> On Tue, 03 Oct 2006 21:43:46 +0100, Stewart Gordon wrote:
>>>
>>>> d-bugmail@puremagic.com wrote:
>>>>>     writefln("sorted");   validate(a.sort);  // fails
>>>>>     writefln("reversed"); validate(a.reverse); // fails
>>>> AIUI sort and reverse are defined to sort/reverse the individual elements of the array, rather than the Unicode characters that make up a string.  But hmm....
>>>
>>> Yes, I realize that but it makes Walter's statements that char[] is all we
>>> need and we do not need a 'string' a bit weaker.
>>
>> .sort and .reverse should reverse the unicode characters. If you want to reverse/sort the individual bytes, you should cast it to a ubyte[] first.
> 
> Changing the behavior of .reverse kind of makes sense, but I don't understand the reason for changing .sort aside from consistency. Personally, I've never had a reason to sort a char array in the first place unless the chars were intended to represent something other than their lexical meaning.  And that aside, sorting chars in a string without a comparison predicate will do so using the char's binary value, which has no lexical significance beyond the 26 letters of the English alphabet (as represented in ASCII). 

What if you want to use a quick binary search look-up to see if a text contains a given character? ;)
Not that I've ever needed it, but it makes sense to just fix it.

How often do you .reverse a string, for that matter?

L.
October 10, 2006
http://d.puremagic.com/issues/show_bug.cgi?id=391


bugzilla@digitalmars.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED




------- Comment #2 from bugzilla@digitalmars.com  2006-10-10 03:29 -------
Fixed DMD 0.169


-- 

December 23, 2006
http://d.puremagic.com/issues/show_bug.cgi?id=391


thomas-dloop@kuehne.cn changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
         Resolution|FIXED                       |




------- Comment #3 from thomas-dloop@kuehne.cn  2006-12-23 07:10 -------
Process terminating with default action of signal 11 (SIGSEGV) Bad permissions for mapped region at address 0x805A0EC

at 0x80544A3: _D3std8typeinfo8ti_dchar10TypeInfo_w4swapMFPvPvZv (in
run/s/sort_16_A.d.exe)
by 0x8050ACD: _adSort (in run/s/sort_16_A.d.exe)
by 0x804A0F4: _Dmain (in run/s/sort_16_A.d:17)
by 0x804BBE6: main (in run/s/sort_16_A.d.exe)


-- 

« First   ‹ Prev
1 2