October 18, 2012
On Thursday, 18 October 2012 at 10:39:46 UTC, monarch_dodra wrote:
>
> Yeah, that makes sense too. I'll try to toy around on my end and see if I can write an "hex".

That was actually relatively easy!

Here is some usecase:

//----
void main()
{
    enum a = hex!"01 ff 7f";
    enum b = hex!0x01_ff_7f;
    ubyte[] c = hex!"0123456789abcdef";
    immutable(ubyte)[] bearophile1 = hex!"A1 B2 C3 D4";
    immutable(ubyte)[] bearophile2 = hex!0xA1_B2_C3_D4;

    a.writeln();
    b.writeln();
    c.writeln();
    bearophile1.writeln();
    bearophile2.writeln();
}
//----

And corresponding output:

//----
[1, 255, 127]
[1, 255, 127]
[1, 35, 69, 103, 137, 171, 205, 239]
[161, 178, 195, 212]
[161, 178, 195, 212]
//----

hex! was a very good idea actually, imo. I'll post my current impl in the next post.

That said, I don't know if I'd deprecate x"", as it serves a different role, as you have already pointed out, in that it *will* validate the code points.
October 18, 2012
On Thursday, 18 October 2012 at 11:24:04 UTC, monarch_dodra wrote:
> hex! was a very good idea actually, imo. I'll post my current impl in the next post.
>

//----
import std.stdio;
import std.conv;
import std.ascii;


template hex(string s)
{
    enum hex = decode(s);
}


template hex(ulong ul)
{
    enum hex = decode(ul);
}

ubyte[] decode(string s)
{
    ubyte[] ret;
    size_t p;
    while(p < s.length)
    {
        while( s[p] == ' ' || s[p] == '_' )
        {
            ++p;
            if (p == s.length) assert(0, text("Premature end of string at index ", p, "."));;
        }

        char c1 = s[p];
        if (!std.ascii.isHexDigit(c1)) assert(0, text("Unexpected character ", c1, " at index ", p, "."));
        c1 = cast(char)std.ascii.toUpper(c1);

        ++p;
        if (p == s.length) assert(0, text("Premature end of string after ", c1, "."));

        char c2 = s[p];
        if (!std.ascii.isHexDigit(c2)) assert(0, text("Unexpected character ", c2, " at index ", p, "."));
        c2 = cast(char)std.ascii.toUpper(c2);
        ++p;


        ubyte val;
        if('0' <= c2 && c2 <= '9') val += (c2 - '0');
        if('A' <= c2 && c2 <= 'F') val += (c2 - 'A' + 10);
        if('0' <= c1 && c1 <= '9') val += ((c1 - '0')*16);
        if('A' <= c1 && c1 <= 'F') val += ((c1 - 'A' + 10)*16);
        ret ~= val;
    }
    return ret;
}

ubyte[] decode(ulong ul)
{
    //NOTE: This is not efficinet AT ALL (push front)
    //but it is ctfe, so we can live it for now ^^
    //I'll optimize it if I try to push it
    ubyte[] ret;
    while(ul)
    {
        ubyte t = ul%256;
        ret = t ~ ret;
        ul /= 256;
    }
    return ret;
}
//----

NOT a final version.
October 18, 2012
On Thursday, 18 October 2012 at 11:26:13 UTC, monarch_dodra wrote:
>
> NOT a final version.

With correct-er utf string support. In theory, non-ascii characters are illegal, but it makes for safer code, and better diagnosis.

//----
ubyte[] decode(string s)
{
    ubyte[] ret;;
    while(s.length)
    {
        while( s.front == ' ' || s.front == '_' )
        {
            s.popFront();
            if (!s.length) assert(0, text("Premature end of string."));;
        }

        dchar c1 = s.front;
        if (!std.ascii.isHexDigit(c1)) assert(0, text("Unexpected character ", c1, "."));
        c1 = std.ascii.toUpper(c1);

        s.popFront();
        if (!s.length) assert(0, text("Premature end of string after ", c1, "."));

        dchar c2 = s.front;
        if (!std.ascii.isHexDigit(c2)) assert(0, text("Unexpected character ", c2, " after ", c1, "."));
        c2 = std.ascii.toUpper(c2);
        s.popFront();

        ubyte val;
        if('0' <= c2 && c2 <= '9') val += (c2 - '0');
        if('A' <= c2 && c2 <= 'F') val += (c2 - 'A' + 10);
        if('0' <= c1 && c1 <= '9') val += ((c1 - '0')*16);
        if('A' <= c1 && c1 <= 'F') val += ((c1 - 'A' + 10)*16);
        ret ~= val;
    }
    return ret;
}
//----
October 18, 2012
monarch_dodra:

> hex! was a very good idea actually, imo.

It must scale up to "real world" usages. Try it with a program
composed of 3 modules each one containing a 100 KB long string.
Then try it with a program with two hundred of medium sized
literals, and let's see compilation times and binary sizes.

Bye,
bearophile
October 18, 2012
On Thursday, 18 October 2012 at 09:42:43 UTC, monarch_dodra wrote:
> Have you actually ever written code that requires using code points? This feature is a *huge* convenience for when you do. Just compare:
>
> string nihongo1 = x"e697a5 e69cac e8aa9e";
> string nihongo2 = "\ue697a5\ue69cac\ue8aa9e";
> ubyte[] nihongo3 = [0xe6, 0x97, 0xa5, 0xe6, 0x9c, 0xac, 0xe8, 0xaa, 0x9e];

You should use unicode directly here, that's the whole point to support it.
string nihongo = "日本語";
October 18, 2012
On 18/10/12 10:58, foobar wrote:
> On Thursday, 18 October 2012 at 02:47:42 UTC, H. S. Teoh wrote:
>> On Thu, Oct 18, 2012 at 02:45:10AM +0200, bearophile wrote:
>> [...]
>>> hex strings are useful, but I think they were invented in D1 when
>>> strings were convertible to char[]. But today they are an array of
>>> immutable UFT-8, so I think this default type is not so useful:
>>>
>>> void main() {
>>>     string data1 = x"A1 B2 C3 D4"; // OK
>>>     immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error
>>> }
>>>
>>>
>>> test.d(3): Error: cannot implicitly convert expression
>>> ("\xa1\xb2\xc3\xd4") of type string to ubyte[]
>> [...]
>>
>> Yeah I think hex strings would be better as ubyte[] by default.
>>
>> More generally, though, I think *both* of the above lines should be
>> equally accepted.  If you write x"A1 B2 C3" in the context of
>> initializing a string, then the compiler should infer the type of the
>> literal as string, and if the same literal occurs in the context of,
>> say, passing a ubyte[], then its type should be inferred as ubyte[], NOT
>> string.
>>
>>
>> T
>
> IMO, this is a redundant feature that complicates the language for no
> benefit and should be deprecated.
> strings already have an escape sequence for specifying code-points "\u"
> and for ubyte arrays you can simply use:
> immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4];
>
> So basically this feature gains us nothing.

That is not the same. Array literals are not the same as string literals, they have an implicit .dup.
See my recent thread on this issue (which unfortunately seems have to died without a resolution, people got hung up about trailing null characters without apparently noticing the more important issue of the dup).

October 18, 2012
On Thursday, 18 October 2012 at 13:15:55 UTC, bearophile wrote:
> monarch_dodra:
>
>> hex! was a very good idea actually, imo.
>
> It must scale up to "real world" usages. Try it with a program
> composed of 3 modules each one containing a 100 KB long string.
> Then try it with a program with two hundred of medium sized
> literals, and let's see compilation times and binary sizes.
>
> Bye,
> bearophile

Hum... The compilation is pretty fast actually, about 1 second, provided it doesn't choke.

It works for strings up to a length of 400 lines @ 80 chars per line, which result to approximately 16K of data. After that, I get a DMD out of memory error.

DMD memory usage spikes quite quickly. To compile those 400 lines (16K), I use 800MB of memory (!). If I reach about 1GB, then it crashes.

I tried using a refAppender instead of ret~, but that changed nothing.

Kind of weird it would use that much memory though...

Also, the memory doesn't get released. I can parse a 1x400 Line string, but if I try to parse 3 of them, DMD will choke on the second one. :(
October 18, 2012
On Thursday, 18 October 2012 at 14:29:57 UTC, Don Clugston wrote:
> On 18/10/12 10:58, foobar wrote:
>> On Thursday, 18 October 2012 at 02:47:42 UTC, H. S. Teoh wrote:
>>> On Thu, Oct 18, 2012 at 02:45:10AM +0200, bearophile wrote:
>>> [...]
>>>> hex strings are useful, but I think they were invented in D1 when
>>>> strings were convertible to char[]. But today they are an array of
>>>> immutable UFT-8, so I think this default type is not so useful:
>>>>
>>>> void main() {
>>>>    string data1 = x"A1 B2 C3 D4"; // OK
>>>>    immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error
>>>> }
>>>>
>>>>
>>>> test.d(3): Error: cannot implicitly convert expression
>>>> ("\xa1\xb2\xc3\xd4") of type string to ubyte[]
>>> [...]
>>>
>>> Yeah I think hex strings would be better as ubyte[] by default.
>>>
>>> More generally, though, I think *both* of the above lines should be
>>> equally accepted.  If you write x"A1 B2 C3" in the context of
>>> initializing a string, then the compiler should infer the type of the
>>> literal as string, and if the same literal occurs in the context of,
>>> say, passing a ubyte[], then its type should be inferred as ubyte[], NOT
>>> string.
>>>
>>>
>>> T
>>
>> IMO, this is a redundant feature that complicates the language for no
>> benefit and should be deprecated.
>> strings already have an escape sequence for specifying code-points "\u"
>> and for ubyte arrays you can simply use:
>> immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4];
>>
>> So basically this feature gains us nothing.
>
> That is not the same. Array literals are not the same as string literals, they have an implicit .dup.
> See my recent thread on this issue (which unfortunately seems have to died without a resolution, people got hung up about trailing null characters without apparently noticing the more important issue of the dup).

I don't see how that detail is relevant to this discussion as I was not arguing against string literals or array literals in general.

We can still have both (assuming the code points are valid...):
string foo = "\ua1\ub2\uc3"; // no .dup
and:
ubyte[3] goo = [0xa1, 0xb2, 0xc3]; // implicit .dup
October 18, 2012
On Thursday, October 18, 2012 15:56:50 Kagamin wrote:
> On Thursday, 18 October 2012 at 09:42:43 UTC, monarch_dodra wrote:
> > Have you actually ever written code that requires using code points? This feature is a *huge* convenience for when you do. Just compare:
> > 
> > string nihongo1 = x"e697a5 e69cac e8aa9e";
> > string nihongo2 = "\ue697a5\ue69cac\ue8aa9e";
> > ubyte[] nihongo3 = [0xe6, 0x97, 0xa5, 0xe6, 0x9c, 0xac, 0xe8,
> > 0xaa, 0x9e];
> 
> You should use unicode directly here, that's the whole point to
> support it.
> string nihongo = "日本語";

It's a nice feature, but there are plenty of cases where it makes more sense to use the unicode values rather than the characters themselves (e.g. your keyboard doesn't have the characters in question). It's valuable to be able to do it both ways.

- Jonathan M Davis
October 18, 2012
Your keyboard doesn't have ready unicode values for all characters either.