March 03, 2012
On Saturday, March 03, 2012 14:57:59 Piotr Szturmaj wrote:
> This discrepancy pushes the range writer to handle special string cases.

Yes it does. And there's _no_ way around that if you want to handle unicode both correctly and efficiently. To handle it correctly, you must operate on code points (or even better, graphemes), but to handle them efficiently, you must take the encoding into account. Phobos has gone with the default of correctness while giving you the tools to special case stuff for efficiency. Phobos itself uses static if all over the place to special case pieces of functions on string type. Stuff like isNarrowString and ElementEncodingType exist specifically for that.

> The problem with that range is when it takes a string type, it aliases this type with itself, because ElementType!R yields dchar. This is why I'm talking about "bad consequences", I just want to iterate string by _char_, not _dchar_.

If you want to iterate by char, then use foreach or use a wrapper range (or cast to ubyte[] and operate on that). Phobos specificically does not to do that, because it breaks unicode. It doesn't stop you from iterating by char or wchar if you really want to, but it operates or ranges of dchar by default, because it's more correct.

- Jonathan M Davis
March 03, 2012
On 03/03/2012 08:46 PM, Jonathan M Davis wrote:
> On Saturday, March 03, 2012 18:38:44 Timon Gehr wrote:
>> On 03/03/2012 09:40 AM, Jonathan M Davis wrote:
>>> ...  but operating on
>>> code points is _far_ more correct than operating on code units. It's also
>>> more efficient.
>>> [snip.]
>>
>> No, it is less efficient.
>
> Operating on code points is more efficient than operating on graphemes is what I
> meant. I can see that I wasn't clear enough on that.
>

Makes sense.

> It's more correct than operating on code units and less correct than operating
> on graphemes,while it's less efficient than operating on code units and more
> efficient than operating on graphemes.
>
> - Jonathan M Davis

When the code actually only cares about some characters that have 7-bit ASCII values, most of the time there are no correctness issues when operating on code units directly.
March 03, 2012
On Saturday, March 03, 2012 21:05:40 Timon Gehr wrote:
> On 03/03/2012 08:46 PM, Jonathan M Davis wrote:
> > On Saturday, March 03, 2012 18:38:44 Timon Gehr wrote:
> >> On 03/03/2012 09:40 AM, Jonathan M Davis wrote:
> >>> ...  but operating on
> >>> code points is _far_ more correct than operating on code units. It's
> >>> also
> >>> more efficient.
> >>> [snip.]
> >> 
> >> No, it is less efficient.
> > 
> > Operating on code points is more efficient than operating on graphemes is what I meant. I can see that I wasn't clear enough on that.
> 
> Makes sense.
> 
> > It's more correct than operating on code units and less correct than operating on graphemes,while it's less efficient than operating on code units and more efficient than operating on graphemes.
> > 
> > - Jonathan M Davis
> 
> When the code actually only cares about some characters that have 7-bit ASCII values, most of the time there are no correctness issues when operating on code units directly.

True, but writing code without caring about unicode frequently leads to bugs when you actually _do_ have to deal with unicode (the fact that an American programmer runs into unicode less just makes it worse, because they're less likely to catch their bugs), and char is UTF-8 by definition.

So, operating specifically on ASCII is an optimization and should be coded for specifically rather than being generally encouraged. And having ranges over strings be code units rather than code points would encourage incorrect usage. The current solution encourages correct usage (or at least usage which is closer to correct, since it still isn't at the grapheme level) without disallowing more optimized code.

- Jonathan M Davis
March 03, 2012
On Sat, Mar 03, 2012 at 12:42:53PM -0800, Jonathan M Davis wrote: [...]
> The current solution encourages correct usage (or at least usage which is closer to correct, since it still isn't at the grapheme level) without disallowing more optimized code.
[...]

Speaking of graphemes, is anyone interested in implementing Unicode normalization for D? I looked at the specs briefly, and it seems to be something that is straightforward to implement, albeit somewhat tedious.

It would be nice if D string types are normalized (needs slight change to string concatenation). Or at least, if there's a guaranteed normalized string type for those who care about it.


T

-- 
You have to expect the unexpected. -- RL
March 03, 2012
On Sat, Mar 03, 2012 at 11:53:41AM -0800, Jonathan M Davis wrote: [...]
> If you want to iterate by char, then use foreach or use a wrapper range (or cast to ubyte[] and operate on that).

Or use:

	string str = ...;
	for (size_t i=0; i < str.length; i++) {
		/* do something with str[i] */
	}


> Phobos specificically does not to do that, because it breaks unicode. It doesn't stop you from iterating by char or wchar if you really want to, but it operates or ranges of dchar by default, because it's more correct.
[...]

I think this is the correct approach. Always err on the side of correct and/or safe, but give the programmer the option of getting under the hood if he wants otherwise.


T

-- 
A linguistics professor was lecturing to his class one day.
"In English," he said, "A double negative forms a positive. In some
languages, though, such as Russian, a double negative is still a
negative. However, there is no language wherein a double positive can
form a negative."
A voice from the back of the room piped up, "Yeah, yeah."
March 03, 2012
On 03/03/2012 01:42 PM, H. S. Teoh wrote:
> On Sat, Mar 03, 2012 at 12:42:53PM -0800, Jonathan M Davis wrote:
> [...]
>> The current solution encourages correct usage (or at least usage which
>> is closer to correct, since it still isn't at the grapheme level)
>> without disallowing more optimized code.
> [...]
>
> Speaking of graphemes, is anyone interested in implementing Unicode
> normalization for D? I looked at the specs briefly, and it seems to be
> something that is straightforward to implement, albeit somewhat tedious.
>
> It would be nice if D string types are normalized (needs slight change
> to string concatenation). Or at least, if there's a guaranteed
> normalized string type for those who care about it.
>
>
> T
>

Denis Spir was working on solving that problem but unfortunately we haven't heard from him for almost a year now. I think this is his site:

  http://spir.wikidot.com

Ali
March 04, 2012
On Saturday, March 03, 2012 13:46:16 Ali Çehreli wrote:
> On 03/03/2012 01:42 PM, H. S. Teoh wrote:
> > On Sat, Mar 03, 2012 at 12:42:53PM -0800, Jonathan M Davis wrote: [...]
> > 
> >> The current solution encourages correct usage (or at least usage which is closer to correct, since it still isn't at the grapheme level) without disallowing more optimized code.
> > 
> > [...]
> > 
> > Speaking of graphemes, is anyone interested in implementing Unicode normalization for D? I looked at the specs briefly, and it seems to be something that is straightforward to implement, albeit somewhat tedious.
> > 
> > It would be nice if D string types are normalized (needs slight change to string concatenation). Or at least, if there's a guaranteed normalized string type for those who care about it.
> > 
> > 
> > T
> 
> Denis Spir was working on solving that problem but unfortunately we haven't heard from him for almost a year now. I think this is his site:
> 
>    http://spir.wikidot.com
> 
> Ali
March 04, 2012
On Saturday, March 03, 2012 13:46:16 Ali Çehreli wrote:
> On 03/03/2012 01:42 PM, H. S. Teoh wrote:
> > On Sat, Mar 03, 2012 at 12:42:53PM -0800, Jonathan M Davis wrote: [...]
> > 
> >> The current solution encourages correct usage (or at least usage which is closer to correct, since it still isn't at the grapheme level) without disallowing more optimized code.
> > 
> > [...]
> > 
> > Speaking of graphemes, is anyone interested in implementing Unicode normalization for D? I looked at the specs briefly, and it seems to be something that is straightforward to implement, albeit somewhat tedious.
> > 
> > It would be nice if D string types are normalized (needs slight change to string concatenation). Or at least, if there's a guaranteed normalized string type for those who care about it.
> > 
> > 
> > T
> 
> Denis Spir was working on solving that problem but unfortunately we haven't heard from him for almost a year now. I think this is his site:
> 
>    http://spir.wikidot.com

There's some stuff in the new std.regex which was done to enhance unicode support which is currently completely internal to it which may end up being the basis for more, but Dmitry hasn't yet worked on creating a version of that for more general consumption AFAIK. I'm not quite sure what he did though, since I'm not familier with std.regex.

- Jonathan M Davis
1 2
Next ›   Last »