View mode: basic / threaded / horizontal-split · Log in · Help
October 18, 2012
Re: Regarding hex strings
On Thursday, 18 October 2012 at 10:39:46 UTC, monarch_dodra wrote:
>
> Yeah, that makes sense too. I'll try to toy around on my end 
> and see if I can write an "hex".

That was actually relatively easy!

Here is some usecase:

//----
void main()
{
    enum a = hex!"01 ff 7f";
    enum b = hex!0x01_ff_7f;
    ubyte[] c = hex!"0123456789abcdef";
    immutable(ubyte)[] bearophile1 = hex!"A1 B2 C3 D4";
    immutable(ubyte)[] bearophile2 = hex!0xA1_B2_C3_D4;

    a.writeln();
    b.writeln();
    c.writeln();
    bearophile1.writeln();
    bearophile2.writeln();
}
//----

And corresponding output:

//----
[1, 255, 127]
[1, 255, 127]
[1, 35, 69, 103, 137, 171, 205, 239]
[161, 178, 195, 212]
[161, 178, 195, 212]
//----

hex! was a very good idea actually, imo. I'll post my current 
impl in the next post.

That said, I don't know if I'd deprecate x"", as it serves a 
different role, as you have already pointed out, in that it 
*will* validate the code points.
October 18, 2012
Re: Regarding hex strings
On Thursday, 18 October 2012 at 11:24:04 UTC, monarch_dodra wrote:
> hex! was a very good idea actually, imo. I'll post my current 
> impl in the next post.
>

//----
import std.stdio;
import std.conv;
import std.ascii;


template hex(string s)
{
    enum hex = decode(s);
}


template hex(ulong ul)
{
    enum hex = decode(ul);
}

ubyte[] decode(string s)
{
    ubyte[] ret;
    size_t p;
    while(p < s.length)
    {
        while( s[p] == ' ' || s[p] == '_' )
        {
            ++p;
            if (p == s.length) assert(0, text("Premature end of 
string at index ", p, "."));;
        }

        char c1 = s[p];
        if (!std.ascii.isHexDigit(c1)) assert(0, text("Unexpected 
character ", c1, " at index ", p, "."));
        c1 = cast(char)std.ascii.toUpper(c1);

        ++p;
        if (p == s.length) assert(0, text("Premature end of 
string after ", c1, "."));

        char c2 = s[p];
        if (!std.ascii.isHexDigit(c2)) assert(0, text("Unexpected 
character ", c2, " at index ", p, "."));
        c2 = cast(char)std.ascii.toUpper(c2);
        ++p;


        ubyte val;
        if('0' <= c2 && c2 <= '9') val += (c2 - '0');
        if('A' <= c2 && c2 <= 'F') val += (c2 - 'A' + 10);
        if('0' <= c1 && c1 <= '9') val += ((c1 - '0')*16);
        if('A' <= c1 && c1 <= 'F') val += ((c1 - 'A' + 10)*16);
        ret ~= val;
    }
    return ret;
}

ubyte[] decode(ulong ul)
{
    //NOTE: This is not efficinet AT ALL (push front)
    //but it is ctfe, so we can live it for now ^^
    //I'll optimize it if I try to push it
    ubyte[] ret;
    while(ul)
    {
        ubyte t = ul%256;
        ret = t ~ ret;
        ul /= 256;
    }
    return ret;
}
//----

NOT a final version.
October 18, 2012
Re: Regarding hex strings
On Thursday, 18 October 2012 at 11:26:13 UTC, monarch_dodra wrote:
>
> NOT a final version.

With correct-er utf string support. In theory, non-ascii 
characters are illegal, but it makes for safer code, and better 
diagnosis.

//----
ubyte[] decode(string s)
{
    ubyte[] ret;;
    while(s.length)
    {
        while( s.front == ' ' || s.front == '_' )
        {
            s.popFront();
            if (!s.length) assert(0, text("Premature end of 
string."));;
        }

        dchar c1 = s.front;
        if (!std.ascii.isHexDigit(c1)) assert(0, text("Unexpected 
character ", c1, "."));
        c1 = std.ascii.toUpper(c1);

        s.popFront();
        if (!s.length) assert(0, text("Premature end of string 
after ", c1, "."));

        dchar c2 = s.front;
        if (!std.ascii.isHexDigit(c2)) assert(0, text("Unexpected 
character ", c2, " after ", c1, "."));
        c2 = std.ascii.toUpper(c2);
        s.popFront();

        ubyte val;
        if('0' <= c2 && c2 <= '9') val += (c2 - '0');
        if('A' <= c2 && c2 <= 'F') val += (c2 - 'A' + 10);
        if('0' <= c1 && c1 <= '9') val += ((c1 - '0')*16);
        if('A' <= c1 && c1 <= 'F') val += ((c1 - 'A' + 10)*16);
        ret ~= val;
    }
    return ret;
}
//----
October 18, 2012
Re: Regarding hex strings
monarch_dodra:

> hex! was a very good idea actually, imo.

It must scale up to "real world" usages. Try it with a program
composed of 3 modules each one containing a 100 KB long string.
Then try it with a program with two hundred of medium sized
literals, and let's see compilation times and binary sizes.

Bye,
bearophile
October 18, 2012
Re: Regarding hex strings
On Thursday, 18 October 2012 at 09:42:43 UTC, monarch_dodra wrote:
> Have you actually ever written code that requires using code 
> points? This feature is a *huge* convenience for when you do. 
> Just compare:
>
> string nihongo1 = x"e697a5 e69cac e8aa9e";
> string nihongo2 = "\ue697a5\ue69cac\ue8aa9e";
> ubyte[] nihongo3 = [0xe6, 0x97, 0xa5, 0xe6, 0x9c, 0xac, 0xe8, 
> 0xaa, 0x9e];

You should use unicode directly here, that's the whole point to 
support it.
string nihongo = "日本語";
October 18, 2012
Re: Regarding hex strings
On 18/10/12 10:58, foobar wrote:
> On Thursday, 18 October 2012 at 02:47:42 UTC, H. S. Teoh wrote:
>> On Thu, Oct 18, 2012 at 02:45:10AM +0200, bearophile wrote:
>> [...]
>>> hex strings are useful, but I think they were invented in D1 when
>>> strings were convertible to char[]. But today they are an array of
>>> immutable UFT-8, so I think this default type is not so useful:
>>>
>>> void main() {
>>>     string data1 = x"A1 B2 C3 D4"; // OK
>>>     immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error
>>> }
>>>
>>>
>>> test.d(3): Error: cannot implicitly convert expression
>>> ("\xa1\xb2\xc3\xd4") of type string to ubyte[]
>> [...]
>>
>> Yeah I think hex strings would be better as ubyte[] by default.
>>
>> More generally, though, I think *both* of the above lines should be
>> equally accepted.  If you write x"A1 B2 C3" in the context of
>> initializing a string, then the compiler should infer the type of the
>> literal as string, and if the same literal occurs in the context of,
>> say, passing a ubyte[], then its type should be inferred as ubyte[], NOT
>> string.
>>
>>
>> T
>
> IMO, this is a redundant feature that complicates the language for no
> benefit and should be deprecated.
> strings already have an escape sequence for specifying code-points "\u"
> and for ubyte arrays you can simply use:
> immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4];
>
> So basically this feature gains us nothing.

That is not the same. Array literals are not the same as string 
literals, they have an implicit .dup.
See my recent thread on this issue (which unfortunately seems have to 
died without a resolution, people got hung up about trailing null 
characters without apparently noticing the more important issue of the dup).
October 18, 2012
Re: Regarding hex strings
On Thursday, 18 October 2012 at 13:15:55 UTC, bearophile wrote:
> monarch_dodra:
>
>> hex! was a very good idea actually, imo.
>
> It must scale up to "real world" usages. Try it with a program
> composed of 3 modules each one containing a 100 KB long string.
> Then try it with a program with two hundred of medium sized
> literals, and let's see compilation times and binary sizes.
>
> Bye,
> bearophile

Hum... The compilation is pretty fast actually, about 1 second, 
provided it doesn't choke.

It works for strings up to a length of 400 lines @ 80 chars per 
line, which result to approximately 16K of data. After that, I 
get a DMD out of memory error.

DMD memory usage spikes quite quickly. To compile those 400 lines 
(16K), I use 800MB of memory (!). If I reach about 1GB, then it 
crashes.

I tried using a refAppender instead of ret~, but that changed 
nothing.

Kind of weird it would use that much memory though...

Also, the memory doesn't get released. I can parse a 1x400 Line 
string, but if I try to parse 3 of them, DMD will choke on the 
second one. :(
October 18, 2012
Re: Regarding hex strings
On Thursday, 18 October 2012 at 14:29:57 UTC, Don Clugston wrote:
> On 18/10/12 10:58, foobar wrote:
>> On Thursday, 18 October 2012 at 02:47:42 UTC, H. S. Teoh wrote:
>>> On Thu, Oct 18, 2012 at 02:45:10AM +0200, bearophile wrote:
>>> [...]
>>>> hex strings are useful, but I think they were invented in D1 
>>>> when
>>>> strings were convertible to char[]. But today they are an 
>>>> array of
>>>> immutable UFT-8, so I think this default type is not so 
>>>> useful:
>>>>
>>>> void main() {
>>>>    string data1 = x"A1 B2 C3 D4"; // OK
>>>>    immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error
>>>> }
>>>>
>>>>
>>>> test.d(3): Error: cannot implicitly convert expression
>>>> ("\xa1\xb2\xc3\xd4") of type string to ubyte[]
>>> [...]
>>>
>>> Yeah I think hex strings would be better as ubyte[] by 
>>> default.
>>>
>>> More generally, though, I think *both* of the above lines 
>>> should be
>>> equally accepted.  If you write x"A1 B2 C3" in the context of
>>> initializing a string, then the compiler should infer the 
>>> type of the
>>> literal as string, and if the same literal occurs in the 
>>> context of,
>>> say, passing a ubyte[], then its type should be inferred as 
>>> ubyte[], NOT
>>> string.
>>>
>>>
>>> T
>>
>> IMO, this is a redundant feature that complicates the language 
>> for no
>> benefit and should be deprecated.
>> strings already have an escape sequence for specifying 
>> code-points "\u"
>> and for ubyte arrays you can simply use:
>> immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4];
>>
>> So basically this feature gains us nothing.
>
> That is not the same. Array literals are not the same as string 
> literals, they have an implicit .dup.
> See my recent thread on this issue (which unfortunately seems 
> have to died without a resolution, people got hung up about 
> trailing null characters without apparently noticing the more 
> important issue of the dup).

I don't see how that detail is relevant to this discussion as I 
was not arguing against string literals or array literals in 
general.

We can still have both (assuming the code points are valid...):
string foo = "\ua1\ub2\uc3"; // no .dup
and:
ubyte[3] goo = [0xa1, 0xb2, 0xc3]; // implicit .dup
October 18, 2012
Re: Regarding hex strings
On Thursday, October 18, 2012 15:56:50 Kagamin wrote:
> On Thursday, 18 October 2012 at 09:42:43 UTC, monarch_dodra wrote:
> > Have you actually ever written code that requires using code
> > points? This feature is a *huge* convenience for when you do.
> > Just compare:
> > 
> > string nihongo1 = x"e697a5 e69cac e8aa9e";
> > string nihongo2 = "\ue697a5\ue69cac\ue8aa9e";
> > ubyte[] nihongo3 = [0xe6, 0x97, 0xa5, 0xe6, 0x9c, 0xac, 0xe8,
> > 0xaa, 0x9e];
> 
> You should use unicode directly here, that's the whole point to
> support it.
> string nihongo = "日本語";

It's a nice feature, but there are plenty of cases where it makes more sense 
to use the unicode values rather than the characters themselves (e.g. your 
keyboard doesn't have the characters in question). It's valuable to be able to 
do it both ways.

- Jonathan M Davis
October 18, 2012
Re: Regarding hex strings
Your keyboard doesn't have ready unicode values for all 
characters either.
1 2 3 4 5
Top | Discussion index | About this forum | D home