May 10, 2008
On 10/05/2008, Yigal Chripun <yigal100@gmail.com> wrote:
>  suppose the bit layout of a is illegal utf-32 encoding. would you prefer
>  D allowed storing such an illegal value in a dchar?

I have to say yes.

It's a question of levels. Higher level code should never store invalid UTF-32 in dchars nor dstrings. But lower level code must be able to work with them.

For example, I wrote std.encoding. It has a function
isValidCodePoint() which takes a dchar and tells you whether or not it
contains a valid value. It also has a function, sanitize(), which
takes possibly invalid UTF as input, and emits guaranteed valid UTF as
output. It has a function, safeDecode(), which takes possibly invalid
UTF as input, removes the first UTF sequence, regardless of whether
valid or malformed, and returns either the decoded character or the
constant INVALID_SEQUENCE, which is (cast(dchar)(-1)).

The lowest level code has to be allowed, not merely /close/ to the metal, but to actually turn the nuts and bolts. That's what it means to be a systems programming language, and without that ability, no low level code could ever be written, without resorting to assembler.

So yes,

    int n = anything;
    dchar c = cast!(dchar)n;

must always succeed. The exclamation mark means "I know what I'm doing", which is exactly why it should be used with caution.


>  IMO, a strongly typed language (like D) must enforce at all times that
>  its variables are valid. I do not want D to allow storing illegal values
>  like that. that must be an error.

Consider this:

    string s = "\u20AC"; /* s contains exactly one Unicode character */
    string t = s[1..2];

Do you want to ban slicing?

Do you want slicing always to invoke a call to std.encoding.isValid(), just to make sure the slice is valid? If so, you must see that std.encoding itself needs to be allowed to do low-level stuff.

Higher level code is ultimately written in terms of lower level code, so you can't ban the lower level code.

However, I would be more than happy with;

    int n;
    dchar c = cast(dchar)n; /* may throw */
    dchar d = cast!(dchar)n; /* always succeeds */
May 10, 2008
Janice Caron Wrote:

> Consider this:
> 
>     string s = "\u20AC"; /* s contains exactly one Unicode character */
>     string t = s[1..2];

this makes string array of bytes rather than chars.
May 10, 2008
On 10/05/2008, terranium <spam@here.lot> wrote:
> Janice Caron Wrote:
>
>  > Consider this:
>  >
>  >     string s = "\u20AC"; /* s contains exactly one Unicode character */
>  >     string t = s[1..2];
>
> this makes string array of bytes rather than chars.

You are incorrect. Indisputably, typeof(t) == invariant(char)[]. It is an array of invariant chars - that is, an array of invariant UTF-8 code units. Each code unit is individually valid, but the complete string consists of two malformed code unit sequences, each of which is an isolated continuation byte.

You are also missing the point. This thread is about casting, not Unicode. If you want to talk Unicode, I'm happy to do so, but please let's take that to another thread. I only brought up slicing as an example of why low level stuff must be permitted, and in specific response to a point made by Yigal.
May 10, 2008
On 10/05/2008, Janice Caron <caron800@googlemail.com> wrote:
> You are incorrect. Indisputably, typeof(t) == invariant(char)[]. It is
>  an array of invariant chars - that is, an array of invariant UTF-8
>  code units. Each code unit is individually valid, but the complete
>  string consists of two malformed code unit sequences, each of which is
>  an isolated continuation byte.

My apologies. The complete string consists of /one/ malformed code unit sequence, consisting of one isolated continuation byte, not two. For some reason I misread s[1..2] as s[1..3] (which is pretty dumb really, seeing as I wrote it). :-)


>  If you want to talk Unicode, I'm happy to do so, but please
>  let's take that to another thread.

This still stands.
May 10, 2008
IMHO, your reply makes perfect sense for C/C++ but not for D.
specifically because D has other facilities to handle those cases.
a dchar (or [w]char) _must_ always contain valid data. if you need to
store other encodings you can use ubyte instead which does not limit you
to a specific bit pattern (this is why D has it in the first place...)
the above example of slices can be easily dealt with since (unlike in
C/C++) D arrays know their length. this is similar to the fact that D
checks bounds on arrays and throws on error (in debug mode) whereas
C/C++ does not. IMO, the D implementation itself (both the compiler and
the runtime) need to make sure chars are always valid. this should not
be something optional added via a library.

I agree with you notion of levels, I just think D provides for much better facilities for low-level coding compared to using unsafe C/C++ conventions.

    int n;
    dchar c = cast(dchar)n;
    dchar d = cast!(dchar)n;

in the above code, the second one should be used and it might throw. the first simply does not make any sense and should produce a compiler error because you cannot convert an int value to a dchar (unless it's a one digit int)

<off topic rant>
What worries me most about D is the fact that D becomes an extension to C++.
The whole idea behind D was to create a new language without all the
baggage and backward compatibility issues of C++.
I don't want a slightly more readable version of C++ since I'll get that
with C++0x.
c++ programmers want D to have a D "stl" and a D boost. that's wrong!
STL is badly designed and employs massive amounts of black magic that
ordinary people do not understand. (I suffer at work while writing in
C++). in what world does it make sense to mark an abstract method with
"=0;" at the end, especially when the method is horizontally long and
that gets to be off screen!
D should be written with a D mindset which should be the best
ingredients extracted from all those languages D got its influences
from: java, C#, python, ruby, c/c++, etc. Tango is a good example of
designing such a new D mindset, IMO. Phobos is not, since it's merely C
code written with D syntax, with all those new shiny code Andrei added
which is C++ code written with D syntax. I appreciate his great
expertise in C++, but I already can use C++ libraries in C++ without
learning a new language. D needs to be better. *much* better.
</rant>

--Yigal
May 10, 2008
Yigal Chripun Wrote:

> <off topic rant>
> What worries me most about D is the fact that D becomes an extension to C++.
> The whole idea behind D was to create a new language without all the
> baggage and backward compatibility issues of C++.
> I don't want a slightly more readable version of C++ since I'll get that
> with C++0x.
> c++ programmers want D to have a D "stl" and a D boost. that's wrong!
> STL is badly designed and employs massive amounts of black magic that
> ordinary people do not understand. (I suffer at work while writing in
> C++). in what world does it make sense to mark an abstract method with
> "=0;" at the end, especially when the method is horizontally long and
> that gets to be off screen!
> D should be written with a D mindset which should be the best
> ingredients extracted from all those languages D got its influences
> from: java, C#, python, ruby, c/c++, etc. Tango is a good example of
> designing such a new D mindset, IMO. Phobos is not, since it's merely C
> code written with D syntax, with all those new shiny code Andrei added
> which is C++ code written with D syntax. I appreciate his great
> expertise in C++, but I already can use C++ libraries in C++ without
> learning a new language. D needs to be better. *much* better.
> </rant>

I want to understand what you said because it can change my decision to use D1 or D2 or not at all.

From what I read in Phobos some good examples of interesting D mindset that are hard in C++ or other languages are std.typecons and std.algorithm. There are parts of Phobos that look very ugly. For example the streams.

I used STL and it is useful to me. If it is true for many people then why D not take the good parts of it? If you know of bad designs in STL they could be avoided in D. Also STL has many balck magic but it is because of C++ imperfections. The definition of STL is mathematic very clean.

Why do you mention the =0 sintax? D does not have it. And I do not think this problem is related to STL.

What are good examples that show Tango is a good example of designing with a new D mindset? Thank you. Dee Girl

May 10, 2008
Yigal Chripun wrote:
> IMHO, your reply makes perfect sense for C/C++ but not for D.
> specifically because D has other facilities to handle those cases.
> a dchar (or [w]char) _must_ always contain valid data. if you need to
> store other encodings you can use ubyte instead which does not limit you
> to a specific bit pattern (this is why D has it in the first place...)
> the above example of slices can be easily dealt with since (unlike in
> C/C++) D arrays know their length. this is similar to the fact that D
> checks bounds on arrays and throws on error (in debug mode) whereas
> C/C++ does not. IMO, the D implementation itself (both the compiler and
> the runtime) need to make sure chars are always valid. this should not
> be something optional added via a library.
> 
> I agree with you notion of levels, I just think D provides for much better facilities for low-level coding compared to using unsafe C/C++ conventions.
> 
>     int n;
>     dchar c = cast(dchar)n;
>     dchar d = cast!(dchar)n;
> 
> in the above code, the second one should be used and it might throw. the first simply does not make any sense and should produce a compiler error because you cannot convert an int value to a dchar (unless it's a one digit int)
> 
> <off topic rant>
> What worries me most about D is the fact that D becomes an extension to C++.
> The whole idea behind D was to create a new language without all the
> baggage and backward compatibility issues of C++.
> I don't want a slightly more readable version of C++ since I'll get that
> with C++0x.
> c++ programmers want D to have a D "stl" and a D boost. that's wrong!
> STL is badly designed and employs massive amounts of black magic that
> ordinary people do not understand. (I suffer at work while writing in
> C++). in what world does it make sense to mark an abstract method with
> "=0;" at the end, especially when the method is horizontally long and
> that gets to be off screen!
> D should be written with a D mindset which should be the best
> ingredients extracted from all those languages D got its influences
> from: java, C#, python, ruby, c/c++, etc. Tango is a good example of
> designing such a new D mindset, IMO. Phobos is not, since it's merely C
> code written with D syntax, with all those new shiny code Andrei added
> which is C++ code written with D syntax. I appreciate his great
> expertise in C++, but I already can use C++ libraries in C++ without
> learning a new language. D needs to be better. *much* better.
> </rant>
> 
> --Yigal

I think I misread your example so I want to clarify:
chars should contain only valid utf code points and not any other
bit-pattern. since code-points need to be ordered in specific ways it
make sense that the D standard library would provide methods that
validate and/or fix utf strings.
However, any other encoding must use ubyte arrays instead.

What if:

   int num = ...;
   dchar ch = cast(dchar)num;
   dchar ch1 = cast!(dchar)num;

ch would contain the bit pattern of the num-th code-point in the utf
standard (throwing for numbers not in the utf encoding table)
and the second cast would operate on the bit level (like an
reinterpret_cast) and throw if the resulting dchar bit pattern is not valid.

also, I'm leaning towards reporting all cast run time errors with exceptions (it's more consistent, since you cannot return null for primitives). no need for that special return null case. (if D had attributes, i would have suggested making a suppress-exceptions attribute for that purpose)

--Yigal
May 10, 2008
== Quote from Dee Girl (deegirl@noreply.com)'s article
> What are good examples that show Tango is a good example of designing with a new D mindset?

I'm somewhat biased since I created the module, but I'd consider tango.core.Array
to be a good example of a D-oriented mindset.  It's an array-specific algorithm
module intended to leverage the D slice syntax for range speficication.  For exmaple:

    import tango.core.Array;
    import tango.stdc.stdio;

    void main()
    {
        int[] buf = [1,6,2,5,9,2,3,2,4].dup;

        // calls Array.sort with optional predicate
        buf[0 .. 3].sort( (int x, int y) { reuturn x < y; } );
        assert( buf[0 .. 3] == [1,2,5,6]);
        buf.sort(); // full sort of buf with default predicate
        // below is equivalent to equal_range in C++
        printf( "there are %d 2s in buf\n",
                     buf[buf.lbound(2) .. buf.ubound(2)].length );
        // more fun stuff
        printf( "there are %d 5s between index 2 and 6\n",
                    buf[2 .. 6].count( 5 ) );
    }

etc. (I'm using printf for the sake of illustration, not because I suggest you actually use it in your app)


Sean
May 10, 2008
Sean Kelly Wrote:

> == Quote from Dee Girl (deegirl@noreply.com)'s article
> > What are good examples that show Tango is a good example of designing with a new D mindset?
> 
> I'm somewhat biased since I created the module, but I'd consider tango.core.Array
> to be a good example of a D-oriented mindset.  It's an array-specific algorithm
> module intended to leverage the D slice syntax for range speficication.  For exmaple:
> 
>     import tango.core.Array;
>     import tango.stdc.stdio;
> 
>     void main()
>     {
>         int[] buf = [1,6,2,5,9,2,3,2,4].dup;
> 
>         // calls Array.sort with optional predicate
>         buf[0 .. 3].sort( (int x, int y) { reuturn x < y; } );
>         assert( buf[0 .. 3] == [1,2,5,6]);
>         buf.sort(); // full sort of buf with default predicate
>         // below is equivalent to equal_range in C++
>         printf( "there are %d 2s in buf\n",
>                      buf[buf.lbound(2) .. buf.ubound(2)].length );
>         // more fun stuff
>         printf( "there are %d 5s between index 2 and 6\n",
>                     buf[2 .. 6].count( 5 ) );
>     }
> 
> etc. (I'm using printf for the sake of illustration, not because I suggest you actually use it in your app)
> 
> 
> Sean

Nice example! How did you do it? Did Tango change the compiler and added more methods to arrays? Thank you, Dee Girl
May 10, 2008
On 10/05/2008, Yigal Chripun <yigal100@gmail.com> wrote:
>  a dchar (or [w]char) _must_ always contain valid data.

That has never been the case, ever. If you think that D should change, so as always to enforce Permanently Valid Unicode, then start a new thread and make a proposal. We'll discuss.