Thread overview | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
March 12, 2011 Ranges | ||||
---|---|---|---|---|
| ||||
Hi, I'm working a bit with ranges atm. but there are definitely some things that are not clear to me yet. Can anyone tell me why the char arrays cannot be copied but the int arrays can? import std.stdio; import std.algorithm; void main(string[] args) { // This works int[] a1 = [1,2,3,4]; int[] a2 = [5,6,7,8]; copy(a1, a2); // This does not! char[] a3 = ['1','2','3','4']; char[] a4 = ['5','6','7','8']; copy(a3, a4); } Error message: test2.d(13): Error: template std.algorithm.copy(Range1,Range2) if (isInputRange!(Range1) && isOutputRange!(Range2,ElementType!(Range1))) does not match any function template declaration test2.d(13): Error: template std.algorithm.copy(Range1,Range2) if (isInputRange!(Range1) && isOutputRange!(Range2,ElementType!(Range1))) cannot deduce template function from argument types !()(char[],char[]) Thanks, Jonas |
March 13, 2011 Re: Ranges | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jonas Drewsen | On Saturday 12 March 2011 14:02:00 Jonas Drewsen wrote:
> Hi,
>
> I'm working a bit with ranges atm. but there are definitely some
> things that are not clear to me yet. Can anyone tell me why the char
> arrays cannot be copied but the int arrays can?
>
> import std.stdio;
> import std.algorithm;
>
> void main(string[] args) {
>
> // This works
> int[] a1 = [1,2,3,4];
> int[] a2 = [5,6,7,8];
> copy(a1, a2);
>
> // This does not!
> char[] a3 = ['1','2','3','4'];
> char[] a4 = ['5','6','7','8'];
> copy(a3, a4);
>
> }
>
> Error message:
>
> test2.d(13): Error: template std.algorithm.copy(Range1,Range2) if
> (isInputRange!(Range1) && isOutputRange!(Range2,ElementType!(Range1)))
> does not match any function template declaration
>
> test2.d(13): Error: template std.algorithm.copy(Range1,Range2) if
> (isInputRange!(Range1) && isOutputRange!(Range2,ElementType!(Range1)))
> cannot deduce template function from argument types !()(char[],char[])
Character arrays / strings are not exactly normal. And there's a very good reason for it: unicode.
In unicode, a character is generally a single code point (there are also graphemes which involve combining code points to add accents and superscripts and whatnot to create a single character, but we'll ignore that in this discussion - it's complicated enough as it is). Depending on the encoding, that code point may be made up of one - or more - code units. UTF-8 uses 8 bit code units. UTF-16 uses 16 bit code units. And UTF-32 uses 32-bit code units. char is a UTF-8 code unit. wchar is a UTF-16 code unit. dchar is a UTF-32 code unit. UTF-32 is the _only_ one of those three which _always_ has one code unit per code point.
With an array of integers you can index it and slice it and be sure that everything that you're doing is valid. If you look at a single element, you know that it's a valid int. If you slice it, you know that every int in there is valid. If you're dealing with a dstring or dchar[], then the same still holds.
A dstring or dchar[] is an array of UTF-32 code units. Every code point is a
single code unit, so every element in the array is a valid code point. You can
take an arbitrary element in that array and know that it's a valid code point.
You can slice it wherever you want and you still have a valid dstrin
g or dchar[]. The same does _not_ hold for char[] and wchar[].
char[] and wchar[] are arrays of UTF-8 and UTF-16 code units respectively. In both of those encodings, multiple code units are required to create a single code point. So, for instance, a code point could have 4 code units. That means that _4_ elements of that char[] make up a _single_ code point. You'd need _all_ 4 of those elements to create a single, valid character. So, you _can't_ just take an arbitrary element in a char[] or wchar[] and expect it to be valid. You _can't_ just slice it anywhere. The resulting array stands a good chance of being invalid. You have to slice on code point boundaries - otherwise you could slice characters in hald and end up with an invalid string. So, unlike other arrays, it just doesn't work to treat char[] and wchar[] as random access ranges of their element type. What the programmer cares about is characters - dchars - not chars or wchars.
So, the way this is handled is that char[], wchar[], and dchar[] are all treated as ranges of dchar. In the case of dchar[], this is nothing special. You can index it and slice it as normal. So, it is a random access range.. However, in the case of char[] and wchar[], that means that when you're iterating over them that you're not dealing with a single element of the array at a time. front returns a dchar, and popFront() pops off however many elements made up front. It's like with foreach. If you iterate a char[] with auto or char, then each individual element is given
foreach(c; myStr) {}
But if you iterate over with dchar, then each code point is given as a dchar:
foreach(dchar c; myStr) {}
If you were to try and iterate over a char[] by char, then you would be looking at code units rather than code points which is _rarely_ what you want. If you're dealing with anything other than pure ASCII, you _will_ have bugs if you do that. You're supposed to use dchar with foreach and character arrays. That way, each value you process is a valid character. Ranges do the same, only you don't give them an iteration type, so they're _always_ iterating over dchar.
So, when you're using a range of char[] or wchar[], you're really using a range of dchar. These ranges are bi-directional. They can't be sliced, and they can't be indexed (since doing so would likely be invalid). This generally works very well. It's exactly what you want in most cases. The problem is that that means that the range that you're iterating over is effectively of a different type than the original char[] or wchar[].
You can't just take two ranges of dchar of the same length and necessarily have them fit in the same char[] or wchar[]. They have the same length, because they have the same number of code points. However, they could have a different number of code _units_, so the lengths of the actual arrays could differ. So, you can't just take an arbitrary dchar range and copy it to another arbitrary dchar range.
The way that this is dealt with in the case of a function like copy is that what you're copying _to_ must be an output range. char[] and wchar[] are _not_ output ranges, because of their differing number of code units per code point. So, they don't work with copy. You need to use a dchar[] as the output range if you want to use strings with copy.
Now, in some cases, it might be possible to special case some of the range functions to treat char[] and wchar[] as arrays instead of ranges (in the case of copy, that's probably possible if both arguments are of the same type), but that can't be done in the general case. You could open an enhancement request for copy to treat char[] and wchar[] as arrays if _both_ of the arguments are of the same type.
- Jonathan M Davis
|
March 13, 2011 Re: Ranges | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jonas Drewsen | On 3/12/2011 2:02 PM, Jonas Drewsen wrote:
> Error message:
>
> test2.d(13): Error: template std.algorithm.copy(Range1,Range2) if
> (isInputRange!(Range1) && isOutputRange!(Range2,ElementType!(Range1)))
> does not match any function template declaration
>
> test2.d(13): Error: template std.algorithm.copy(Range1,Range2) if
> (isInputRange!(Range1) && isOutputRange!(Range2,ElementType!(Range1)))
> cannot deduce template function from argument types !()(char[],char[])
I haven't checked (could be completely off here), but I don't think that char[] counts as an input range; you would normally want to use dchar instead.
|
March 13, 2011 Re: Ranges | ||||
---|---|---|---|---|
| ||||
Posted in reply to Bekenn | Or, better yet, just read Jonathan's post. |
March 13, 2011 Re: Ranges | ||||
---|---|---|---|---|
| ||||
On Saturday 12 March 2011 16:05:37 Jonathan M Davis wrote:
> You could open an
> enhancement request for copy to treat char[] and wchar[] as arrays if
> _both_ of the arguments are of the same type.
Actually, on reflection, I'd have to say that there's not much point to that. If you really want to copy on array to another (rather than a range), just use the array copy syntax:
void main()
{
auto i = [1, 2, 3, 4];
auto j = [3, 4, 5, 6];
assert(i == [1, 2, 3, 4]);
assert(j == [3, 4, 5, 6]);
i[] = j[];
assert(i == [3, 4, 5, 6]);
assert(j == [3, 4, 5, 6]);
}
copy is of benefit, because it works on generic ranges, not for copying arrays (arrays already allow you to do that quite nicely), so if all you're looking at copying is arrays, then just use the array copy syntax.
- Jonathan M Davis
|
March 13, 2011 Re: Ranges | ||||
---|---|---|---|---|
| ||||
Posted in reply to Bekenn | On Saturday 12 March 2011 16:11:20 Bekenn wrote:
> On 3/12/2011 2:02 PM, Jonas Drewsen wrote:
> > Error message:
> >
> > test2.d(13): Error: template std.algorithm.copy(Range1,Range2) if
> > (isInputRange!(Range1) && isOutputRange!(Range2,ElementType!(Range1)))
> > does not match any function template declaration
> >
> > test2.d(13): Error: template std.algorithm.copy(Range1,Range2) if
> > (isInputRange!(Range1) && isOutputRange!(Range2,ElementType!(Range1)))
> > cannot deduce template function from argument types !()(char[],char[])
>
> I haven't checked (could be completely off here), but I don't think that char[] counts as an input range; you would normally want to use dchar instead.
Char[] _does_ count as input range (of dchar). It just doesn't count as an _output_ range (since it doesn't really hold dchar).
- Jonathan M Davis
|
March 13, 2011 Re: Ranges | ||||
---|---|---|---|---|
| ||||
What Jonathan said really needs to be put up on the D website, maybe under the articles section. Heck, I'd just put a link to that recent UTF thread on the website, it's really informative (the one on UTF and meaning of glyphs, etc). And UTF will only get more important, just like multicore. Speaking of which, a description on ranges should be put up there as well. There's that article Andrei once wrote, but we should put it on the D site and discuss D's implementation of ranges in more detail. And by 'we' I mean someone who's well versed in ranges. :p |
March 13, 2011 Re: Ranges | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jonathan M Davis | Hi Jonathan,
Thank you very much your in depth answer!
It should indeed goto a faq somewhere it think. I did now about the codepoint/unit stuff but had no idea that ranges of char are handled using dchar internally. This makes sense but is an easy pitfall for newcomers trying to use std.{algoritm,array,ranges} for char[].
Thanks
Jonas
On 13/03/11 01.05, Jonathan M Davis wrote:
> On Saturday 12 March 2011 14:02:00 Jonas Drewsen wrote:
>> Hi,
>>
>> I'm working a bit with ranges atm. but there are definitely some
>> things that are not clear to me yet. Can anyone tell me why the char
>> arrays cannot be copied but the int arrays can?
>>
>> import std.stdio;
>> import std.algorithm;
>>
>> void main(string[] args) {
>>
>> // This works
>> int[] a1 = [1,2,3,4];
>> int[] a2 = [5,6,7,8];
>> copy(a1, a2);
>>
>> // This does not!
>> char[] a3 = ['1','2','3','4'];
>> char[] a4 = ['5','6','7','8'];
>> copy(a3, a4);
>>
>> }
>>
>> Error message:
>>
>> test2.d(13): Error: template std.algorithm.copy(Range1,Range2) if
>> (isInputRange!(Range1)&& isOutputRange!(Range2,ElementType!(Range1)))
>> does not match any function template declaration
>>
>> test2.d(13): Error: template std.algorithm.copy(Range1,Range2) if
>> (isInputRange!(Range1)&& isOutputRange!(Range2,ElementType!(Range1)))
>> cannot deduce template function from argument types !()(char[],char[])
>
> Character arrays / strings are not exactly normal. And there's a very good
> reason for it: unicode.
>
> In unicode, a character is generally a single code point (there are also
> graphemes which involve combining code points to add accents and superscripts
> and whatnot to create a single character, but we'll ignore that in this
> discussion - it's complicated enough as it is). Depending on the encoding, that
> code point may be made up of one - or more - code units. UTF-8 uses 8 bit code
> units. UTF-16 uses 16 bit code units. And UTF-32 uses 32-bit code units. char is
> a UTF-8 code unit. wchar is a UTF-16 code unit. dchar is a UTF-32 code unit.
> UTF-32 is the _only_ one of those three which _always_ has one code unit per
> code point.
>
> With an array of integers you can index it and slice it and be sure that
> everything that you're doing is valid. If you look at a single element, you know
> that it's a valid int. If you slice it, you know that every int in there is
> valid. If you're dealing with a dstring or dchar[], then the same still holds.
>
> A dstring or dchar[] is an array of UTF-32 code units. Every code point is a
> single code unit, so every element in the array is a valid code point. You can
> take an arbitrary element in that array and know that it's a valid code point.
> You can slice it wherever you want and you still have a valid dstrin
> g or dchar[]. The same does _not_ hold for char[] and wchar[].
>
> char[] and wchar[] are arrays of UTF-8 and UTF-16 code units respectively. In
> both of those encodings, multiple code units are required to create a single
> code point. So, for instance, a code point could have 4 code units. That means
> that _4_ elements of that char[] make up a _single_ code point. You'd need _all_
> 4 of those elements to create a single, valid character. So, you _can't_ just
> take an arbitrary element in a char[] or wchar[] and expect it to be valid. You
> _can't_ just slice it anywhere. The resulting array stands a good chance of
> being invalid. You have to slice on code point boundaries - otherwise you could
> slice characters in hald and end up with an invalid string. So, unlike other
> arrays, it just doesn't work to treat char[] and wchar[] as random access ranges
> of their element type. What the programmer cares about is characters - dchars -
> not chars or wchars.
>
> So, the way this is handled is that char[], wchar[], and dchar[] are all treated
> as ranges of dchar. In the case of dchar[], this is nothing special. You can
> index it and slice it as normal. So, it is a random access range.. However, in
> the case of char[] and wchar[], that means that when you're iterating over them
> that you're not dealing with a single element of the array at a time. front
> returns a dchar, and popFront() pops off however many elements made up front.
> It's like with foreach. If you iterate a char[] with auto or char, then each
> individual element is given
>
> foreach(c; myStr) {}
>
> But if you iterate over with dchar, then each code point is given as a dchar:
>
> foreach(dchar c; myStr) {}
>
> If you were to try and iterate over a char[] by char, then you would be looking
> at code units rather than code points which is _rarely_ what you want. If you're
> dealing with anything other than pure ASCII, you _will_ have bugs if you do
> that. You're supposed to use dchar with foreach and character arrays. That way,
> each value you process is a valid character. Ranges do the same, only you don't
> give them an iteration type, so they're _always_ iterating over dchar.
>
> So, when you're using a range of char[] or wchar[], you're really using a range
> of dchar. These ranges are bi-directional. They can't be sliced, and they can't
> be indexed (since doing so would likely be invalid). This generally works very
> well. It's exactly what you want in most cases. The problem is that that means
> that the range that you're iterating over is effectively of a different type than
> the original char[] or wchar[].
>
> You can't just take two ranges of dchar of the same length and necessarily have
> them fit in the same char[] or wchar[]. They have the same length, because they
> have the same number of code points. However, they could have a different number
> of code _units_, so the lengths of the actual arrays could differ. So, you can't
> just take an arbitrary dchar range and copy it to another arbitrary dchar range.
>
> The way that this is dealt with in the case of a function like copy is that what
> you're copying _to_ must be an output range. char[] and wchar[] are _not_ output
> ranges, because of their differing number of code units per code point. So, they
> don't work with copy. You need to use a dchar[] as the output range if you want
> to use strings with copy.
>
> Now, in some cases, it might be possible to special case some of the range
> functions to treat char[] and wchar[] as arrays instead of ranges (in the case
> of copy, that's probably possible if both arguments are of the same type), but
> that can't be done in the general case. You could open an enhancement request
> for copy to treat char[] and wchar[] as arrays if _both_ of the arguments are of
> the same type.
>
> - Jonathan M Davis
|
March 13, 2011 Re: Ranges | ||||
---|---|---|---|---|
| ||||
On 03/13/2011 01:05 AM, Jonathan M Davis wrote: > If you were to try and iterate over a char[] by char, then you would be looking > at code units rather than code points which is _rarely_ what you want. If you're > dealing with anything other than pure ASCII, you _will_ have bugs if you do > that. You're supposed to use dchar with foreach and character arrays. That way, > each value you process is a valid character. Ranges do the same, only you don't > give them an iteration type, so they're _always_ iterating over dchar. Side-note: you can be sure the source is pure ASCII if, and only if, it is mechanically produced. (As soon as an end-user touches it, it may hold anything, since OSes and apps offer users means to introduces characters which are not on their keyboards). This can also easily be checked in utf-8 (which has been designed for that): all ASCII chars are coded using the same code as in ASCII, thus all codes should be < 128. Denis -- _________________ vita es estrany spir.wikidot.com |
March 18, 2011 Re: Ranges | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jonathan M Davis | On 13/03/11 12:05 AM, Jonathan M Davis wrote:
> So, when you're using a range of char[] or wchar[], you're really using a range
> of dchar. These ranges are bi-directional. They can't be sliced, and they can't
> be indexed (since doing so would likely be invalid). This generally works very
> well. It's exactly what you want in most cases. The problem is that that means
> that the range that you're iterating over is effectively of a different type than
> the original char[] or wchar[].
This has to be the worst language design decision /ever/.
You can't just mess around with fundamental principles like "the first element in an array of T has type T" for the sake of a minor convenience. How are we supposed to do generic programming if common sense reasoning about types doesn't hold?
This is just std::vector<bool> from C++ all over again. Can we not learn from mistakes of the past?
|
Copyright © 1999-2021 by the D Language Foundation