Jump to page: 1 2 3
Thread overview
strings in D
Feb 19, 2005
Andrew Fedoniouk
Feb 19, 2005
Kris
Feb 19, 2005
John Reimer
Feb 19, 2005
Charlie Patterson
Feb 20, 2005
John Reimer
Re: strings in D (FAQ)
Feb 19, 2005
Andrew Fedoniouk
Feb 19, 2005
Derek
Feb 19, 2005
Andrew Fedoniouk
Feb 19, 2005
Thomas Kühne
Feb 20, 2005
Andrew Fedoniouk
Feb 20, 2005
Thomas Kühne
Feb 20, 2005
Thomas Kühne
Feb 19, 2005
Thomas Kühne
Feb 19, 2005
Ben Hinkle
character indexing [was Re: strings in D]
Feb 20, 2005
Ben Hinkle
February 19, 2005
Is there any string class for the D?
Or are there any plans to create string for D?

char[], dchar[] and qchar[] cannot serve string purposes as they use utf encodings which are "transport" encodings and cannot be used in most cases as strings.

String as an entity is a sequence of "code points" - ascii, ucs-2(basic
multilang plane)
and ucs-4 so operator[] always returns character in full (for the given
supported plane).
The same should apply to foreach().

I personally would like to see something similar to Java strings (ucs-2)
with
methods like fromByteArray(encoding),  fromUtf8() , etc.

Probably such strings should use copy-on-write implementation.
I think that ucs-2 (unsigned word) as a string character whould be enough
for all active
languages.

Any other ideas, gentlemen?

Andrew Fedoniouk.
http://terrainformatica.com





February 19, 2005
In article <cv6d5q$19al$1@digitaldaemon.com>, Andrew Fedoniouk says...
>
>Is there any string class for the D?
>Or are there any plans to create string for D?
>
>char[], dchar[] and qchar[] cannot serve string purposes as they use utf encodings which are "transport" encodings and cannot be used in most cases as strings.
>
>String as an entity is a sequence of "code points" - ascii, ucs-2(basic
>multilang plane)
>and ucs-4 so operator[] always returns character in full (for the given
>supported plane).
>The same should apply to foreach().
>
>I personally would like to see something similar to Java strings (ucs-2)
>with
>methods like fromByteArray(encoding),  fromUtf8() , etc.
>
>Probably such strings should use copy-on-write implementation.
>I think that ucs-2 (unsigned word) as a string character whould be enough
>for all active
>languages.
>
>Any other ideas, gentlemen?
>
>Andrew Fedoniouk.
>http://terrainformatica.com
>

You're walking upon graves with that one, Andrew! I'm afraid there's been a lot of conflicting opinion around that particular subject.

Best bet is to get hold of a 'non-standard' library for such things, and go from there. The mango.icu package is a wrapper around the extensive ICU project, and may suit your needs ~ you can find that over at dsource.org: http://dsource.org/forums/viewtopic.php?t=420


February 19, 2005
On Fri, 18 Feb 2005 19:52:24 -0800, Andrew Fedoniouk wrote:

> Is there any string class for the D?
> Or are there any plans to create string for D?
> 
> char[], dchar[] and qchar[] cannot serve string purposes as they use utf encodings which are "transport" encodings and cannot be used in most cases as strings.
> 
> String as an entity is a sequence of "code points" - ascii, ucs-2(basic
> multilang plane)
> and ucs-4 so operator[] always returns character in full (for the given
> supported plane).
> The same should apply to foreach().
> 
> I personally would like to see something similar to Java strings (ucs-2)
> with
> methods like fromByteArray(encoding),  fromUtf8() , etc.
> 
> Probably such strings should use copy-on-write implementation.
> I think that ucs-2 (unsigned word) as a string character whould be enough
> for all active
> languages.
> 
> Any other ideas, gentlemen?
> 
> Andrew Fedoniouk.
> http://terrainformatica.com

This question has been asked many times in the D groups.  If there ever were a "big three" in the D debates department, I think this one would rank as one of them.

From what I gather, the opinions have settled into three groups:

1) Those that want a String class in D and think it is a critical addition
to the language.
2) Those that consider a String class contrary to the D methodology; they
thing char[] wchar[] and dchar[] are sufficient.
3) Those that think a String class could be a useful addition; but it
should be added to D for optional use.

If you do a search of this newsgroup and the old D newsgroup, I think you'll find how big the discussion has been!

- John R.
February 19, 2005
Forgive me, but isn't UCS-2 *essentially* 16 bit Unicode without the bom and maybe a few other things?  I may be wrong, but I would think that, if you want that, you can just use dchar[] or even wchar[]...

I'm not saying that strings are or aren't necessary, but if I do this: (let's see if I can post unicode on this newsgroup...)

wchar[] test = "ウェブ全体から検索";

foreach (wchar c; test)
   writef("%s ", c);

You'll get one iteration for each character (there are nine.)  Yes, this uses twice the memory, but it gives you the "character in full" you're asking for.  No replacement for a string class, and I'm not arguing either way on that, but foreach and [] (called opIndex, I believe, in D) work fine.

As for byte conversions, you can at least do that with unicode (simple casting between byte[] and char[], etc.) and I'm sure iconv could be useful if you need charset conversion.

-[Unknown]


> I think that ucs-2 (unsigned word) as a string character whould be enough for all active
> languages.
February 19, 2005
Ok. Seems like I did not explain this clearly. Let's try again then from
different
point of view (this time more technical).

UTF16 sequence cannot be treated as UCS-2 sequence (especially in D with its built-in conversion). This is just technically wrong.

See:
word    utf16string[] =
{
  0x0041,       // 'a' - Latin-1
  0x0020,       // ' ' - Latin-1
  0xD800,       // high-half zone part
  0xDC00,       // low-half zone part - value
  0xD800,       // high-half zone part
  0xDC01        // value
};

This example text contains 4 coded characters. The first two are BMP (basic
multiplane) characters coded with a single UCS-2 (BMP)code value; the last
two are non-BMP characters coded with two wordseach, a high-half code and a
low-half code. Translating this to UCS-4code values would produce the
following:
dword  ucs4string[] =
{
  0x00000041,   // 'a' // Latin-1
  0x00000020,   // ' ' // Latin-1
  0x00010000,   // hieroglyph foo
  0x00010001    // hieroglyph bar
};

What is the meaning of strlen() in utf16string case? 4 or 6?D thinks that utf16string is sequence of wchars. I wouldn't say so.These are not characters in common sense but just parts of the sequence of16bit units. You cannot treat them as characters e.g. you cannotinsert new wchar at position 3 of utf16string.

Only dchar could be considered as a real UNICODE character (UCS-4).
But modern computers are not ready yet for UCS-4. Too much memory needed.

Practical solution is to use ucs-2 - two-byte ucs-2 characters. (Again ucs-2 is BMP http://www.unicode.org/roadmaps/bmp/ and includes all active languages civilazation using now for writings)

typedef wchar char2; // new type, ucs-2 codes
typedef char2[] string2; // brand new type, strict ucs-2 string

conversion from utf16 wchar[] -> char2[] *must* interpret utf16 pairs
(0xD800,0xDC00) and
produce *one* char2 codewith value '?' (or any other with meaning
not supported character) Thus codes in the range D800 - DBFF *must* not
appear in char2[] string.

As soon as D has built-in conversion routines then list of character types should look like as:

char    - element of utf8 sequence. char[] - utf8 encoded unicode sequence.
wchar - element of utf16 sequence. wchar[] - utf16 encoded unicode sequence.
dchar  - ucs-4 character. full unicode character. dchar[] - ucs-4 string.
char2  - ucs-2 (BMP) character. codes D800 - DBFF do not represent start of
             UTF16 sequence - do not expand into ucs-4 by system.
char2[] - ucs-2 string - sequence of characters.
        Could be manipulated arbitrarye.g. characters (char2) could
        be inserted or deleted at any given position.

Let me highlight again:

/////
/////   elements of utf sequence *are not* characters.
/////

So such functions as strchr(string,char) must be declared either as

int strchr(char1[], char1 c) // latin-1 string
--or--
int strchr(char2[], char2 c) // ucs-2 string and char
--or--
int strchr(char4[], char4 c) // ucs-4 string or 'dchar'

This message has one sole reason: to make D close to perfect.

Andrew Fedoniouk.
http://terrainformatica.com


February 19, 2005
Yes, all true.  I know.  UCS-2 and UTF-16 are not exactly the same, but they are quite similar for many intents and purposes.

Again, you can get the conversion you want (Latin1 -> UCS-4, etc.) using iconv or similar.  Even if this was built in, it would have to be done using such a tool or a custom written one - it's not like it's an interrupt call or something :P.

And, to make a strlen that counted unqiue characters in a UTF-8/UTF-16/etc. string would be expensive performance wise.  Instead of just giving the array's length, which is lightning quick and very possibly imho why D performs better with string usage, you'll end up traversing the entire string again looking for characters.  Yes, this length could be (I would hope!) cached by the class to improve speed of sequential strpos's, substr's, etc.

But, if it had to traverse like that it would be so much better to use wchar, at least just for textual strings that might contain such characters, because then you could use the speedy method, instead of searching the whole string like C did.

Several other languages have these same problems: C, PHP, Perl, SQL, etc.  I'm quite sure most people who understand UTF-8 are aware that the number of bits divided by eight may or may not have anything to do with the actual length in characters of the string, though.  It's essential - and sometimes, you just have to know it.  Not everything can be abstracted to the point where you just type "do my homework" and hit compile...

Still, I don't think, personally, using a whole bunch of char types wouldn't solve this.  That's several times uglier than a string class, and since a char array is just an array, there isn't really any clean way to override the .length of it... it'd have to be a class.  And anyway, you're ignoring ISO-8859-2, Shift_JIS, and similar encodings. Why should ISO-8859-1 (Latin1) be special?

Anyway, I can just see a "i18n_length(char[] x)" function.... because sometimes, you really just want the number of bytes, not characters.

-[Unknown]
February 19, 2005
On Sat, 19 Feb 2005 01:14:46 -0800, Unknown W. Brackets wrote:


[snip]

> Anyway, I can just see a "i18n_length(char[] x)" function.... because sometimes, you really just want the number of bytes, not characters.

I submit this sample code ...
<code>
module i18n;
private import std.utf;
debug(1) private import std.stdio;

uint i18n_length( char[] x)
{
    return toUTF32(x).length;
}

uint i18n_length( wchar[] x)
{
    return toUTF32(x).length;
}

uint i18n_length( dchar[] x)
{
    return x.length;
}

unittest
{
    char[] tchar;
    wchar[] twchar;
    dchar[] tdchar;


    tdchar ~= 0x00000041;   // 'a' // Latin-1
    tdchar ~= 0x00000020;   // ' ' // Latin-1
    tdchar ~= 0x00010000;   // hieroglyph foo
    tdchar ~= 0x00010001;   // hieroglyph bar

    twchar = toUTF16(tdchar);
    tchar  = toUTF8(tdchar);

    debug(1) {writefln("dchar.length = %d (%d)", i18n_length(tdchar),
tdchar.length); }
    assert( i18n_length(tdchar) == 4);
    debug(1) {writefln("wchar.length = %d (%d)", i18n_length(twchar),
twchar.length); }
    assert( i18n_length(twchar) == 4);
    debug(1) {writefln(" char.length = %d (%d)", i18n_length(tchar),
tchar.length); }
    assert( i18n_length(tchar)  == 4);
}

debug(2)
{
    void main()
    {
    }
}

</code>

This can be compiled using "build i18n -debug=2" to generate the unittests and then run i18n to run the unittests.

Of course, it you want to you can create a doctored version of toUTFxx to just count codepoints rather than do an actual conversion.

-- 
Derek
Melbourne, Australia
February 19, 2005
>String as an entity is a sequence of "code points" - ascii, ucs-2(basic
>multilang plane)
>and ucs-4 so operator[] always returns character in full (for the given
>supported plane).
>The same should apply to foreach().

foreach already iterates over code points. Try something like
char[] str = ...some non-ascii string...
foreach(int n, dchar cp; str) {
.. cp is the nth codepoint of str ...
}

-Ben


February 19, 2005
"John Reimer" <brk_6502@yahoo.com> wrote in message news:pan.2005.02.19.05.02.08.170345@yahoo.com...
> On Fri, 18 Feb 2005 19:52:24 -0800, Andrew Fedoniouk wrote:
>
> > Is there any string class for the D?
> > ...

> This question has been asked many times in the D groups.  If there ever were a "big three" in the D debates department, I think this one would rank as one of them.

The D newsgroup could probably use a FAQ.  I also don't know where the land mines are buried!



February 19, 2005
"you're ignoring ISO-8859-2, Shift_JIS, and similar encodings."

Where I am ignoring them?

"Still, I don't think, personally, using a whole bunch of char types ...."

In fact I am not proposing new top level character types.

My point is simple:

'string' as an entity (or class) is different from wchar[] - sequence of
UTF16 characters
in the terms of following:

class string  // string which supports only ucs-2 code points
{
    typedef wchar char2; // ucs-2 code points only.
    char2[] chars;

    this( wchar[] utf16 )
    {
        // thanks to Ben Hinkle
        foreach(dchar cp; utf16)
       {
           if( dchar > 0xFFFF )
               chars ~= cast(char2) '?'; // ignorabimus et ignorabus
           else
               chars ~= cast(char2) cp;
        }
    }

    int length() {  return chars.length;  }
            // as chars ALWAYS contains code points.

    void set(int pos, wchar wc)
    {
         if( wc >= oxD800 && wc <= 0xDFFF)
             throw "invalid ucs-2 code point";
         else
             chars[pos] = cast(char2)wc;
    }

}

AFAIK this approach used in java.lang.String .

I think that existing names of entities in D are misleading.

'char' in fact is not a character but element of UTF-8 sequence - ubyte.
'wchar' in fact is not a "wide" character but element of UTF-16 sequence -
ushort.
and only 'dchar' has meaning of character.

Keeping this in mind declaration like

wchar a;

is a technical nonsense. The way it is implemented now and treated by D
wchar (and char) can be used *ONLY* as members of arrays (in sequence).



« First   ‹ Prev
1 2 3