March 09, 2014
On 3/8/2014 9:15 PM, Michel Fortin wrote:
>
> Text is an interesting topic for never-ending discussions.
>

It's also a good example for when non-programmers are surprised to hear that I *don't* see the world as binary "black and white" *because* of my programming experience ;)

Problems like text-handling make it [painfully] obvious to programmers that reality is shades-of-grey - laymen don't often expect that!

March 09, 2014
On 3/9/2014 7:47 AM, w0rp wrote:
>
> My knowledge of Unicode pretty much just comes from having
> to deal with foreign language customers and discovering the problems
> with the code unit abstraction most languages seem to use. (Java and
> Python suffer from similar issues, but they don't really have algorithms
> in the way that we do.)
>

Python 2 or 3 (out of curiosity)? If you're including Python3, then that somewhat surprises me as I thought greatly improved Unicode was one of the biggest reasons for the jump from 2 to 3. (Although it isn't *completely* surprising since, as we all know far too well here, fully correct Unicode is *not* easy.)

March 09, 2014
On 3/9/2014 6:08 AM, "Marc Schütz" <schuetzm@gmx.net>" wrote:
> Also, `byCodeUnit` and `byCodePoint` would probably be better names than `raw`
> and `decode`, to much the already existing `byGrapheme` in std.uni.

I'd vastly prefer 'byChar', 'byWchar', 'byDchar' for each of string, wstring, dstring, and InputRange!char, etc.
March 09, 2014
On 3/9/2014 6:34 AM, Jakob Ovrum wrote:
> `byCodeUnit` is essentially std.string.representation.

Not at all. std.string.representation takes a string and casts it to the corresponding ubyte, ushort, uint string.

It doesn't work at all with InputRange!char
March 10, 2014
On 3/9/2014 6:31 PM, Walter Bright wrote:
> On 3/9/2014 6:08 AM, "Marc Schütz" <schuetzm@gmx.net>" wrote:
>> Also, `byCodeUnit` and `byCodePoint` would probably be better names
>> than `raw`
>> and `decode`, to much the already existing `byGrapheme` in std.uni.
>
> I'd vastly prefer 'byChar', 'byWchar', 'byDchar' for each of string,
> wstring, dstring, and InputRange!char, etc.

'byCodePoint' and 'byDchar' are the same. However, 'byCodeUnit' is completely different from anything else:

string  str;
wstring wstr;
dstring dstr;

(str|wchar|dchar).byChar  // Always range of char
(str|wchar|dchar).byWchar // Always range of wchar
(str|wchar|dchar).byDchar // Always range of dchar

str.representation  // Range of ubyte
wstr.representation // Range of ushort
dstr.representation // Range of uint

str.byCodeUnit  // Range of char
wstr.byCodeUnit // Range of wchar
dstr.byCodeUnit // Range of dchar

March 10, 2014
On 3/10/2014 12:19 AM, Nick Sabalausky wrote:
>
> (str|wchar|dchar).byChar  // Always range of char
> (str|wchar|dchar).byWchar // Always range of wchar
> (str|wchar|dchar).byDchar // Always range of dchar
>

Erm, naturally I meant "(str|wstr|dstr)"

March 10, 2014
On 3/9/2014 9:19 PM, Nick Sabalausky wrote:
> On 3/9/2014 6:31 PM, Walter Bright wrote:
>> On 3/9/2014 6:08 AM, "Marc Schütz" <schuetzm@gmx.net>" wrote:
>>> Also, `byCodeUnit` and `byCodePoint` would probably be better names
>>> than `raw`
>>> and `decode`, to much the already existing `byGrapheme` in std.uni.
>>
>> I'd vastly prefer 'byChar', 'byWchar', 'byDchar' for each of string,
>> wstring, dstring, and InputRange!char, etc.
>
> 'byCodePoint' and 'byDchar' are the same. However, 'byCodeUnit' is completely
> different from anything else:
>
> string  str;
> wstring wstr;
> dstring dstr;
>
> (str|wchar|dchar).byChar  // Always range of char
> (str|wchar|dchar).byWchar // Always range of wchar
> (str|wchar|dchar).byDchar // Always range of dchar
>
> str.representation  // Range of ubyte
> wstr.representation // Range of ushort
> dstr.representation // Range of uint
>
> str.byCodeUnit  // Range of char
> wstr.byCodeUnit // Range of wchar
> dstr.byCodeUnit // Range of dchar

I don't see much point to the latter 3.

March 10, 2014
On 3/10/2014 12:23 AM, Walter Bright wrote:
> On 3/9/2014 9:19 PM, Nick Sabalausky wrote:
>> On 3/9/2014 6:31 PM, Walter Bright wrote:
>>> On 3/9/2014 6:08 AM, "Marc Schütz" <schuetzm@gmx.net>" wrote:
>>>> Also, `byCodeUnit` and `byCodePoint` would probably be better names
>>>> than `raw`
>>>> and `decode`, to much the already existing `byGrapheme` in std.uni.
>>>
>>> I'd vastly prefer 'byChar', 'byWchar', 'byDchar' for each of string,
>>> wstring, dstring, and InputRange!char, etc.
>>
>> 'byCodePoint' and 'byDchar' are the same. However, 'byCodeUnit' is
>> completely
>> different from anything else:
>>
>> string  str;
>> wstring wstr;
>> dstring dstr;
>>
>> (str|wchar|dchar).byChar  // Always range of char
>> (str|wchar|dchar).byWchar // Always range of wchar
>> (str|wchar|dchar).byDchar // Always range of dchar
>>
>> str.representation  // Range of ubyte
>> wstr.representation // Range of ushort
>> dstr.representation // Range of uint
>>
>> str.byCodeUnit  // Range of char
>> wstr.byCodeUnit // Range of wchar
>> dstr.byCodeUnit // Range of dchar
>
> I don't see much point to the latter 3.
>

Do you mean:

1. You don't see the point to iterating by code unit?
2. You don't see the point to 'byCodeUnit' if we have 'representation'?
3. You don't see the point to 'byCodeUnit' if we have 'byChar/byWchar/byDchar'?
4. You don't see the point to having 'byCodeUnit' work on UTF-32 dstrings?

Responses:

1. Iterating by code unit: Useful for tweaking performance anytime decoding is unnecessary. For example, parsing a grammar where the bulk of the keywords and operators are ASCII. (Occasional uses of Unicode, like unicode whitespace, can of course be handled easily enough by the lexer FSM).

2. 'byCodeUnit' if we have 'representation': This one I have trouble answering since I'm still unclear on the purpose of 'representation' (I wasn't even aware of it until a few days ago.) I've been assuming there's some specific use-case I've overlooked where it's useful to iterate by code unit *while* treating the code units as if they weren't UTF-8/16/32 at all. But since 'representation' is called *on* a string/wstring/dstring, they should already be UTF-8/16/32 anyway, not some other encoding that would necessitate using integer types. Or maybe it's just for working around problems with the auto-verification being too eager (I've ran into those)? I admit I don't quite get 'representation'.

3. 'byCodeUnit' if we have 'byChar/byWchar/byDchar': To avoid a "static if" chain every time you want to use code units inside generic code. Also, so in non-generic code you can change your data type without updating instances of 'by*char'.

4. Having 'byCodeUnit' work on UTF-32 dstrings: So generic code working on code units doesn't have to special-case UTF-32.


March 10, 2014
On 3/10/2014 12:09 AM, Nick Sabalausky wrote:
> On 3/10/2014 12:23 AM, Walter Bright wrote:
>> On 3/9/2014 9:19 PM, Nick Sabalausky wrote:
>>> On 3/9/2014 6:31 PM, Walter Bright wrote:
>>>> On 3/9/2014 6:08 AM, "Marc Schütz" <schuetzm@gmx.net>" wrote:
>>>>> Also, `byCodeUnit` and `byCodePoint` would probably be better names
>>>>> than `raw`
>>>>> and `decode`, to much the already existing `byGrapheme` in std.uni.
>>>>
>>>> I'd vastly prefer 'byChar', 'byWchar', 'byDchar' for each of string,
>>>> wstring, dstring, and InputRange!char, etc.
>>>
>>> 'byCodePoint' and 'byDchar' are the same. However, 'byCodeUnit' is
>>> completely
>>> different from anything else:
>>>
>>> string  str;
>>> wstring wstr;
>>> dstring dstr;
>>>
>>> (str|wchar|dchar).byChar  // Always range of char
>>> (str|wchar|dchar).byWchar // Always range of wchar
>>> (str|wchar|dchar).byDchar // Always range of dchar
>>>
>>> str.representation  // Range of ubyte
>>> wstr.representation // Range of ushort
>>> dstr.representation // Range of uint
>>>
>>> str.byCodeUnit  // Range of char
>>> wstr.byCodeUnit // Range of wchar
>>> dstr.byCodeUnit // Range of dchar
>>
>> I don't see much point to the latter 3.
>>
>
> Do you mean:
>
> 1. You don't see the point to iterating by code unit?
> 2. You don't see the point to 'byCodeUnit' if we have 'representation'?
> 3. You don't see the point to 'byCodeUnit' if we have 'byChar/byWchar/byDchar'?
> 4. You don't see the point to having 'byCodeUnit' work on UTF-32 dstrings?

(3)

> 3. 'byCodeUnit' if we have 'byChar/byWchar/byDchar': To avoid a "static if"
> chain every time you want to use code units inside generic code. Also, so in
> non-generic code you can change your data type without updating instances of
> 'by*char'.

Just not sure I see a use for that.

March 10, 2014
On Sunday, 9 March 2014 at 21:14:30 UTC, Nick Sabalausky wrote:
>>> With all due respect, D string type is exclusively for UTF-8 strings.
>>> If it is not valid UTF-8, it should never had been a D string in the
>>> first place. In the other cases, ubyte[] is there.
>>
>> This is an arbitrary self-imposed limitation caused by the choice in how
>> strings are handled in Phobos.
>
> Yea, I've had problems before - completely unnecessary problems that were *not* helpful or indicative of latent bugs - which were a direct result of Phobos being overly pedantic and eager about UTF validation. And yet the implicit UTF validation has never actually *helped* me in any way.


>> self-imposed limitation
For greater good.

I finds this article very telling about why string should be converted to UTF-8 as often as possible.
http://www.utf8everywhere.org/

I agree 100% with its content, it's impossibly hard to have a sane handling of encodings on WIndows (even more in a team), if not following the drastic rules the article exposes.

This happens to be what Phobos gently mandates, UTF validation is certainly the lesser evil as compared the mess that everything become without. How is mandating valid UTF-8 being overly pedantic? This is the sanest behaviour. Just use sanitizeUTF8 (http://vibed.org/api/vibe.utils.string/sanitizeUTF8) or equivalent.