Thread overview
UTF-8 issues
Sep 15, 2008
Eldar Insafutdinov
Sep 15, 2008
Walter Bright
Sep 15, 2008
Chris R. Miller
Sep 16, 2008
Lutger
Sep 16, 2008
Benji Smith
Sep 16, 2008
Eldar Insafutdinov
Sep 16, 2008
Benji Smith
Sep 16, 2008
Oskar Linde
September 15, 2008
I faced some issues with utf-8 support in D.
As it stated in http://www.digitalmars.com/d/2.0/cppstrings.html strings support slicing and length-calculation. Since strings are char arrays this is correct only for latin strings. So when the strings for example cyrillic chars - length is wrong, indexing also doesn't work, and slicing too.
But foreach works correctly. So utf-8 support is partial. Maybe there are functions from standart library that does this work? I checked D2 new features - there was not improving utf-8 support - am I wrong?
September 15, 2008
Eldar Insafutdinov wrote:
> I faced some issues with utf-8 support in D. As it stated in
> http://www.digitalmars.com/d/2.0/cppstrings.html strings support
> slicing and length-calculation. Since strings are char arrays this is
> correct only for latin strings. So when the strings for example
> cyrillic chars - length is wrong, indexing also doesn't work, and
> slicing too. But foreach works correctly. So utf-8 support is
> partial. Maybe there are functions from standart library that does
> this work? I checked D2 new features - there was not improving utf-8
> support - am I wrong?


This should help:

http://www.digitalmars.com/d/2.0/phobos/std_utf.html
September 15, 2008
Eldar Insafutdinov wrote:
> I faced some issues with utf-8 support in D.
> As it stated in http://www.digitalmars.com/d/2.0/cppstrings.html strings support slicing and length-calculation. Since strings are char arrays this is correct only for latin strings. So when the strings for example cyrillic chars - length is wrong, indexing also doesn't work, and slicing too.
> But foreach works correctly. So utf-8 support is partial. Maybe there are functions from standart library that does this work? I checked D2 new features - there was not improving utf-8 support - am I wrong?

IIRC a char array in D will compress itself for ASCII-encodable characters, which destroys the integrity of the length variable.  Well, it's still valid in terms of how long in words the array is, but in terms of real characters it's no longer valid.

If you used a wchar or dchar things would be different.
September 15, 2008
On Mon, Sep 15, 2008 at 2:38 PM, Chris R. Miller <lordsauronthegreat@gmail.com> wrote:
> Eldar Insafutdinov wrote:
>> I faced some issues with utf-8 support in D.
>> As it stated in http://www.digitalmars.com/d/2.0/cppstrings.html strings support slicing and length-calculation. Since strings are char arrays this is correct only for latin strings. So when the strings for example cyrillic chars - length is wrong, indexing also doesn't work, and slicing too.
>> But foreach works correctly. So utf-8 support is partial. Maybe there are functions from standart library that does this work? I checked D2 new features - there was not improving utf-8 support - am I wrong?
>
> IIRC a char array in D will compress itself for ASCII-encodable characters, which destroys the integrity of the length variable.  Well, it's still valid in terms of how long in words the array is, but in terms of real characters it's no longer valid.

It's called UTF-8, and it's supposed to work like that.  That D does not provide some kind of interface for dealing with multibyte encodings (other than foreach and the encode/decode functions) is a failing on its part, not Unicode's.

(Though it could be argued that multibyte encodings are stupid as
hell, and I would agree with that.)

> If you used a wchar or dchar things would be different.
>

If he used dchar it'd be different.  wchar still has multi-element encodings (surrogate pairs) for codepoints outside the BMP.  Which, admittedly, are not that common, but it can still happen.
September 16, 2008
Eldar Insafutdinov wrote:
> I faced some issues with utf-8 support in D.

The important thing to remember is that a string is absolutely NOT an array of characters, and you can't treat it as such.

As you've noticed, a char[] string is actually an array of UTF-8 encoded bytes. Iterating directly through that array is extremely touchy and error-prone. Instead, always use the standard library functions.

D1/Tango:

http://dsource.org/projects/tango/docs/current/tango.text.Util.html
http://dsource.org/projects/tango/docs/current/tango.text.convert.Utf.html

D1/Phobos:

http://digitalmars.com/d/1.0/phobos/std_utf.html

D2/Phobos:

http://digitalmars.com/d/2.0/phobos/std_utf.html

Although the libraries do a decent job of hiding the ugly details, my opinion (which is not very popular around here) is that D's string processing is a major design flaw.

> As it stated in http://www.digitalmars.com/d/2.0/cppstrings.html strings support slicing and length-calculation. Since strings are char arrays this is correct only for latin strings. So when the strings for example cyrillic chars - length is wrong, indexing also doesn't work, and slicing too.

Indexing, slicing, and lengh-calculation of D strings is based on byte-position, not character position.

Character-position indexing and slicing is only possible by iterating from the beginning of the string, decoding the characters on-the-fly, and keeping track of the number of bytes used by each character.

That's what the standard library functions basically do.

Calculating the actual character-length of the string is fundamentally the same as in C, where strings are null-terminated (e.g., you can't determine the actual length of the string until you've iterated from the beginning to the end).

The Phobox & Tango libraries handle all of those details for you, but I think it's important to know what's going on behind the scenes, so that you have a rough idea of the true cost of each operation.

--benji
September 16, 2008
Benji Smith Wrote:

> Eldar Insafutdinov wrote:
> > I faced some issues with utf-8 support in D.
> 
> The important thing to remember is that a string is absolutely NOT an array of characters, and you can't treat it as such.
> 
> As you've noticed, a char[] string is actually an array of UTF-8 encoded bytes. Iterating directly through that array is extremely touchy and error-prone. Instead, always use the standard library functions.
> 
> D1/Tango:
> 
> http://dsource.org/projects/tango/docs/current/tango.text.Util.html http://dsource.org/projects/tango/docs/current/tango.text.convert.Utf.html
> 
> D1/Phobos:
> 
> http://digitalmars.com/d/1.0/phobos/std_utf.html
> 
> D2/Phobos:
> 
> http://digitalmars.com/d/2.0/phobos/std_utf.html
> 
> Although the libraries do a decent job of hiding the ugly details, my opinion (which is not very popular around here) is that D's string processing is a major design flaw.
> 
> > As it stated in http://www.digitalmars.com/d/2.0/cppstrings.html strings support slicing and length-calculation. Since strings are char arrays this is correct only for latin strings. So when the strings for example cyrillic chars - length is wrong, indexing also doesn't work, and slicing too.
> 
> Indexing, slicing, and lengh-calculation of D strings is based on byte-position, not character position.
> 
> Character-position indexing and slicing is only possible by iterating from the beginning of the string, decoding the characters on-the-fly, and keeping track of the number of bytes used by each character.
> 
> That's what the standard library functions basically do.
> 
> Calculating the actual character-length of the string is fundamentally the same as in C, where strings are null-terminated (e.g., you can't determine the actual length of the string until you've iterated from the beginning to the end).
> 
> The Phobox & Tango libraries handle all of those details for you, but I think it's important to know what's going on behind the scenes, so that you have a rough idea of the true cost of each operation.
> 
> --benji

Yeah - I know that this operations works with bytes rather than chars[]. But it is stated here http://www.digitalmars.com/d/2.0/cppstrings.html explicitly, that strings support slicing:

>D has the array slice syntax, not possible with C++:

>char[] s1 = "hello world";
>char[] s2 = s1[6 .. 11];	// s2 is "world"

So this example is only correct in case of latin chars, but in general it is wrong for UTF-8 strings.
September 16, 2008
Eldar Insafutdinov wrote:
> Yeah - I know that this operations works with bytes rather than chars[]. But it is stated here http://www.digitalmars.com/d/2.0/cppstrings.html explicitly, that strings support slicing:
> 
>> D has the array slice syntax, not possible with C++:
> 
>> char[] s1 = "hello world";
>> char[] s2 = s1[6 .. 11];	// s2 is "world"
> 
> So this example is only correct in case of latin chars, but in general it is wrong for UTF-8 strings.

That's my understanding.

--benji
September 16, 2008
Eldar Insafutdinov wrote:
> Benji Smith Wrote:
>> D has the array slice syntax, not possible with C++:
> 
>> char[] s1 = "hello world";
>> char[] s2 = s1[6 .. 11];	// s2 is "world"
> 
> So this example is only correct in case of latin chars, but in general it is wrong for UTF-8 strings.

It is not wrong for UTF-8 strings. It just won't work for arbitrary indices. But I don't think you will ever use arbitrary indices. All indices will be the result of other string functions (such as find) which behave correctly for UTF-8 strings. Incrementing/decrementing can be done using std.utf or similar. UTF-8 also makes it very easy to determine if an arbitrary position in a UTF-8 sequence lies at the start or in the middle of a multi-byte encoded character.

Indexing a UTF-8 string by character rather than byte index is horribly inefficient. As others have said, if you really need to do that, use dchar[](1). Although, I've never personally come across a place where I needed that.

1) Be aware that you will need to make sure your data is of a composed unicode normal form, otherwise it could still use several code points(2) to represent a single grapheme.

2) A code point is a point in the Unicode codespace, which is what a dchar encodes.

-- 
Oskar
September 16, 2008
Jarrett Billingsley wrote:
...
> 
> It's called UTF-8, and it's supposed to work like that.  That D does not provide some kind of interface for dealing with multibyte encodings (other than foreach and the encode/decode functions) is a failing on its part, not Unicode's.

There's also std.string of course. What do you find so lacking? (just
curious)
September 16, 2008
On Tue, Sep 16, 2008 at 4:57 PM, Lutger <lutger.blijdestijn@gmail.com> wrote:
> Jarrett Billingsley wrote:
> ...
>>
>> It's called UTF-8, and it's supposed to work like that.  That D does not provide some kind of interface for dealing with multibyte encodings (other than foreach and the encode/decode functions) is a failing on its part, not Unicode's.
>
> There's also std.string of course. What do you find so lacking? (just
> curious)
>

The lack of any way to index or slice a string according to codepoint indices (instead of byte/short indices), get the length of a string in codepoints, or to find the nearest beginning character given an arbitrary character index.  (std.string is also embarrassingly missing any functionality for wchar[] or dchar[] but that's a slightly different issue.)