Jump to page: 1 2
Thread overview
String Type Usage. String vs DString vs WString
Jan 15, 2018
Chris P
Jan 15, 2018
rikki cattermole
Jan 15, 2018
Tony
Jan 15, 2018
Jonathan M Davis
Jan 15, 2018
Patrick Schluter
Jan 15, 2018
Nicholas Wilson
Jan 15, 2018
Chris P
Jan 15, 2018
Jonathan M Davis
Jan 15, 2018
SimonN
Jan 15, 2018
Adam D. Ruppe
Jan 15, 2018
SimonN
Jan 15, 2018
Kagamin
Jan 15, 2018
Jonathan M Davis
January 15, 2018
Hello,

I'm extremely new to D and have a quick question regarding common practice when using strings. Is usage of one type over the others encouraged? When using 'string' it appears there is a length mismatch between the string length and the char array if large Unicode characters are used. So I figured I'd ask.

Thanks in advance,

Chris P - Tampa
January 15, 2018
On 15/01/2018 2:05 AM, Chris P wrote:
> Hello,
> 
> I'm extremely new to D and have a quick question regarding common practice when using strings. Is usage of one type over the others encouraged? When using 'string' it appears there is a length mismatch between the string length and the char array if large Unicode characters are used. So I figured I'd ask.
> 
> Thanks in advance,
> 
> Chris P - Tampa

D's strings are Unicode.

Unicode has three main variants, UTF-8, UTF-16 and UTF-32.
The size of a code point is 1, 2 or 4 bytes.
But here is the thing, what is displayed (a character) could be multiple code points and these can be combined to form a grapheme.

So yes, there will be length mismatches between them :)
January 15, 2018
On Monday, 15 January 2018 at 02:05:32 UTC, Chris P wrote:
> Hello,
>
> I'm extremely new to D and have a quick question regarding common practice when using strings. Is usage of one type over the others encouraged? When using 'string' it appears there is a length mismatch between the string length and the char array if large Unicode characters are used. So I figured I'd ask.
>
> Thanks in advance,
>
> Chris P - Tampa

 string == immutable( char)[], char == utf8
wstring == immutable(wchar)[], char == utf16
dstring == immutable(dchar)[], char == utf32

Unless you are dealing with windows, in which case you way need to consider using wstring, there is very little reason to use anything but string.

N.B. when you iterate over a string there are a number of different "flavours" (for want of a better term) you can iterate over, bytes, unicode codepoints and graphemes ( I'm possible forgetting some). have a look in std.uni and related modules. Iteration in Phobos defaults to coepoints I think.

TLDR use string.

January 15, 2018
On Monday, 15 January 2018 at 02:15:55 UTC, Nicholas Wilson wrote:
> On Monday, 15 January 2018 at 02:05:32 UTC, Chris P wrote:
>> [...]
>
>  string == immutable( char)[], char == utf8
> wstring == immutable(wchar)[], char == utf16
> dstring == immutable(dchar)[], char == utf32
>
> Unless you are dealing with windows, in which case you way need to consider using wstring, there is very little reason to use anything but string.
>
> N.B. when you iterate over a string there are a number of different "flavours" (for want of a better term) you can iterate over, bytes, unicode codepoints and graphemes ( I'm possible forgetting some). have a look in std.uni and related modules. Iteration in Phobos defaults to coepoints I think.
>
> TLDR use string.

Thank you (and rikki) for replying. Actually, I am using Windows (Doh!) but I now understand. Cheers!
January 14, 2018
On Monday, January 15, 2018 02:22:09 Chris P via Digitalmars-d-learn wrote:
> On Monday, 15 January 2018 at 02:15:55 UTC, Nicholas Wilson wrote:
> > On Monday, 15 January 2018 at 02:05:32 UTC, Chris P wrote:
> >> [...]
> >>
> >  string == immutable( char)[], char == utf8
> >
> > wstring == immutable(wchar)[], char == utf16
> > dstring == immutable(dchar)[], char == utf32
> >
> > Unless you are dealing with windows, in which case you way need to consider using wstring, there is very little reason to use anything but string.
> >
> > N.B. when you iterate over a string there are a number of different "flavours" (for want of a better term) you can iterate over, bytes, unicode codepoints and graphemes ( I'm possible forgetting some). have a look in std.uni and related modules. Iteration in Phobos defaults to coepoints I think.
> >
> > TLDR use string.
>
> Thank you (and rikki) for replying. Actually, I am using Windows
> (Doh!) but I now understand. Cheers!

Even with Windows, there usually isn't any reason to use wstring. The only reason that wstring might be more desirable on Windows is that you need UTF-16 when dealing with the Windows API calls, and that's normally only going to come up if you're not writing platform-independent code. The common stuff such as file access is already wrap by Phobos (e.g. in std.file and std.stdio), so most programs, don't need to worry about the Windows API calls. And even if you do, the best practice generally is to use string everywhere in your code and then only convert to a zero-terminated wchar* when making the Windows API calls (either by actually allocating a zero-terminated wchar* or using a static array with the appropriate wchar set to 0, depending on the context).

If you have to do a ton with Windows API calls, at some point, it arguably becomes better to just keep them as wstrings to avoid the conversions, but even then, because strings in D aren't zero-terminated, and the C API calls usually require them to be, you're often forced to copy the string to pass it to a Windows API call anyway, in which case, you lose most of the benefit of keeping stuff around in wstrings instead of just using strings everywhere.

If you do need to worry about call a Windows API call, then check out toUTFz in std.utf, since it will allow you to easily convert to zero-terminated strings of any character type (std.string.toStringz handles zero-terminated strings as well, but just for string).

- Jonathan M Davis

January 15, 2018
On Monday, 15 January 2018 at 02:09:25 UTC, rikki cattermole wrote:

>
> Unicode has three main variants, UTF-8, UTF-16 and UTF-32.
> The size of a code point is 1, 2 or 4 bytes.

I think to be technically correct, 1 (UTF-8), 2 (UTF-16) or 4 (UTF-32) bytes are referred to as "code units" and the size of a code point varies in UTF-8 and UTF-16.
January 14, 2018
On Monday, January 15, 2018 03:14:02 Tony via Digitalmars-d-learn wrote:
> On Monday, 15 January 2018 at 02:09:25 UTC, rikki cattermole
>
> wrote:
> > Unicode has three main variants, UTF-8, UTF-16 and UTF-32. The size of a code point is 1, 2 or 4 bytes.
>
> I think to be technically correct, 1 (UTF-8), 2 (UTF-16) or 4
> (UTF-32) bytes are referred to as "code units" and the size of a
> code point varies in UTF-8 and UTF-16.

Yes, for UTF-8, a code unit is 8 bits, and there can be up to 6 of them (IIRC) in a code point. For UTF-16, a code unit is 16 bits, and there are either 1 or 2 code units per code point. For UTF-32, a code unit is 32 bits, and there is always 1 code unit per code point.

For better or worse (mostly worse), ranges then treat all strings as ranges of code points and decode them to code points such that get a range of dchar (which means fun things like isRandomAccessRange!string and hasLength!string are false). As I understand it, each code point is then something which can be physically printed, but either way, it's not necessarily a full character.

Multiple code points can then be combined to make a grapheme cluster (which then corresponds to what we'd normally consider a full character - e.g. a letter and an accent can each be a code point which are then combined to create an accented character). std.uni provides the functionality for operating on graphemes.

And std.utf.byCodeUnit can be used to treat strings as ranges of code units instead of code points (and a fair bit of Phobos takes the solution of specializing range-based code for strings to avoid the auto-decoding).

All in all, the whole thing is annoyingly complicated, though at least D is much more explicit about it than most languages, and I suspect that your average D programmer is better educated about Unicode than your average programmer. And having to figure out why the heck strings and wstrings act so bizarrely as ranges does have the positive side effect of putting it even more in your face than it would be otherwise, making it that much more likely that folks are going to learn about Unicode - though I still think that we'd be better off if we could ever figure out how to treat all strings as ranges of code units without breaking everything in the process. :|

- Jonathan M Davis

January 15, 2018
On Monday, 15 January 2018 at 02:05:32 UTC, Chris P wrote:
> Is usage of one type over the others encouraged?

I would use string (UTF-8) throughout the program, but there seems to be no style guideline for this. Keep in mind two gotchas:

D's foreach and D's ranges will autodecode and silently iterate over dchar, not char, even when the input is string, not dstring. (It's also possible to explicitly decode strings, see std.utf and std.uni.)

If you call into the Windows API, some functions require extra care if everything in your program is UTF-8. But I still agree with the approach to keep everything as string in your program, and then wrap the Windows API calls, as the UTF-8 Everywhere manifesto suggests:
http://utf8everywhere.org/

-- Simon
January 15, 2018
On Monday, 15 January 2018 at 04:27:15 UTC, Jonathan M Davis wrote:
> On Monday, January 15, 2018 03:14:02 Tony via Digitalmars-d-learn wrote:
>> On Monday, 15 January 2018 at 02:09:25 UTC, rikki cattermole
>>
>> wrote:
>> > Unicode has three main variants, UTF-8, UTF-16 and UTF-32. The size of a code point is 1, 2 or 4 bytes.
>>
>> I think to be technically correct, 1 (UTF-8), 2 (UTF-16) or 4
>> (UTF-32) bytes are referred to as "code units" and the size of a
>> code point varies in UTF-8 and UTF-16.
>
> Yes, for UTF-8, a code unit is 8 bits, and there can be up to 6 of them (IIRC) in a code point.

Nooooooooooo!!! Only 4 maximum for Unicode. Beyond that it's obsolete crap that is not Unicode since version 2 of Unicode.



January 15, 2018
On Monday, 15 January 2018 at 06:18:27 UTC, SimonN wrote:
> D's foreach [...] will autodecode and silently iterate over dchar, not char, even when the input is string


That's not true. foreach will only decode on demand:

string s;

foreach(c; s) { /* c is a char here, it goes over bytes */ }
foreach(char c; s) { /* c is a char here, same as above */ }
foreach(dchar c; s) { /* c is a dchar - this decodes */ }



Autodecoding is a Phobos library artifact, NOT something in the D language itself.
« First   ‹ Prev
1 2