Regarding hex strings (page 4) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Regarding hex strings (page 4)

October 19, 2012

Re: Regarding hex strings

Posted by foobar
in reply to Don Clugston

foobar

Posted in reply to Don Clugston

On Friday, 19 October 2012 at 13:19:09 UTC, Don Clugston wrote:
>>
>> We can still have both (assuming the code points are valid...):
>> string foo = "\ua1\ub2\uc3"; // no .dup
>
> That doesn't compile.
> Error: escape hex sequence has 2 hex digits instead of 4

Come on, "assuming the code points are valid". It says so 4 lines above!

October 19, 2012

Re: Regarding hex strings

Posted by Don Clugston
in reply to foobar

Don Clugston

Posted in reply to foobar

On 19/10/12 16:07, foobar wrote:
> On Friday, 19 October 2012 at 13:19:09 UTC, Don Clugston wrote:
>>>
>>> We can still have both (assuming the code points are valid...):
>>> string foo = "\ua1\ub2\uc3"; // no .dup
>>
>> That doesn't compile.
>> Error: escape hex sequence has 2 hex digits instead of 4
>
> Come on, "assuming the code points are valid". It says so 4 lines above!

It isn't the same.
Hex strings are the raw bytes, eg UTF8 code points. (ie, it includes the high bits that indicate the length of each char).
\u makes dchars.

"\u00A1" is not the same as x"A1" nor is it x"00 A1". It's two non-zero bytes.

October 19, 2012

Re: Regarding hex strings

Posted by foobar
in reply to Don Clugston

foobar

Posted in reply to Don Clugston

On Friday, 19 October 2012 at 15:07:44 UTC, Don Clugston wrote:
> On 19/10/12 16:07, foobar wrote:
>> On Friday, 19 October 2012 at 13:19:09 UTC, Don Clugston wrote:
>>>>
>>>> We can still have both (assuming the code points are valid...):
>>>> string foo = "\ua1\ub2\uc3"; // no .dup
>>>
>>> That doesn't compile.
>>> Error: escape hex sequence has 2 hex digits instead of 4
>>
>> Come on, "assuming the code points are valid". It says so 4 lines above!
>
> It isn't the same.
> Hex strings are the raw bytes, eg UTF8 code points. (ie, it includes the high bits that indicate the length of each char).
> \u makes dchars.
>
> "\u00A1" is not the same as x"A1" nor is it x"00 A1". It's two non-zero bytes.

Yes, the \u requires code points and not code-units for a specific UTF encoding, which you are correct in pointing out are four hex digits and not two.
This is a very reasonable choice to prevent/reduce Unicode encoding errors.

http://dlang.org/lex.html#HexString states:
"Hex strings allow string literals to be created using hex data. The hex data need not form valid UTF characters."

I _already_ said that I consider this a major semantic bug as it violates the principle of least surprise - the programmer's expectation that the D string types which are Unicode according to the spec to, well, actually contain _valid_ Unicode and _not_ arbitrary binary data.
Given the above, the design of \u makes perfect sense for _strings_ - you can use _valid_ code-points (not code units) in hex form.

For general purpose binary data (i.e. _not_ UTF encoded Unicode text) I also _already_ said IMO should be either stored as ubyte[] or better yet their own types that would ensure the correct invariants for the data type, be it audio, video, or just a different text encoding.

In neither case the hex-string is relevant IMO. In the former it potentially violates the type's invariant and in the latter we already have array literals.

Using a malformed _string_ to initialize ubyte[] IMO is simply less readable. How did that article call such features, "WAT"?

October 19, 2012

Re: Regarding hex strings

Posted by foobar
in reply to foobar

foobar

Posted in reply to foobar

On Friday, 19 October 2012 at 18:46:07 UTC, foobar wrote:
> On Friday, 19 October 2012 at 15:07:44 UTC, Don Clugston wrote:
>> On 19/10/12 16:07, foobar wrote:
>>> On Friday, 19 October 2012 at 13:19:09 UTC, Don Clugston wrote:
>>>>>
>>>>> We can still have both (assuming the code points are valid...):
>>>>> string foo = "\ua1\ub2\uc3"; // no .dup
>>>>
>>>> That doesn't compile.
>>>> Error: escape hex sequence has 2 hex digits instead of 4
>>>
>>> Come on, "assuming the code points are valid". It says so 4 lines above!
>>
>> It isn't the same.
>> Hex strings are the raw bytes, eg UTF8 code points. (ie, it includes the high bits that indicate the length of each char).
>> \u makes dchars.
>>
>> "\u00A1" is not the same as x"A1" nor is it x"00 A1". It's two non-zero bytes.
>
> Yes, the \u requires code points and not code-units for a specific UTF encoding, which you are correct in pointing out are four hex digits and not two.
> This is a very reasonable choice to prevent/reduce Unicode encoding errors.
>
> http://dlang.org/lex.html#HexString states:
> "Hex strings allow string literals to be created using hex data. The hex data need not form valid UTF characters."
>
> I _already_ said that I consider this a major semantic bug as it violates the principle of least surprise - the programmer's expectation that the D string types which are Unicode according to the spec to, well, actually contain _valid_ Unicode and _not_ arbitrary binary data.
> Given the above, the design of \u makes perfect sense for _strings_ - you can use _valid_ code-points (not code units) in hex form.
>
> For general purpose binary data (i.e. _not_ UTF encoded Unicode text) I also _already_ said IMO should be either stored as ubyte[] or better yet their own types that would ensure the correct invariants for the data type, be it audio, video, or just a different text encoding.
>
> In neither case the hex-string is relevant IMO. In the former it potentially violates the type's invariant and in the latter we already have array literals.
>
> Using a malformed _string_ to initialize ubyte[] IMO is simply less readable. How did that article call such features, "WAT"?

I just re-checked and to clarify string literals support _three_ escape sequences:
\x__ - a single byte
\u____ - two bytes
\U________ - four bytes

So raw bytes _can_ be directly specified and I hope the compiler still verifies the string literal is valid Unicode.

October 20, 2012

Re: Regarding hex strings

Posted by Denis Shelomovskij
in reply to foobar

Denis Shelomovskij

Posted in reply to foobar

18.10.2012 12:58, foobar пишет:
> IMO, this is a redundant feature that complicates the language for no
> benefit and should be deprecated.
> strings already have an escape sequence for specifying code-points "\u"
> and for ubyte arrays you can simply use:
> immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4];
>
> So basically this feature gains us nothing.
>

Maybe. Just an example of a real world code:

Arrays:
https://github.com/D-Programming-Language/druntime/blob/fc45de1d089a1025df60ee2eea66ba27ee0bd99c/src/core/sys/windows/dll.d#L110

vs

Hex strings:
https://github.com/denis-sh/hooking/blob/69105a24d77fcb6eca701282a16dd5ec7311c077/tlsfixer/ntdll.d#L130

By the way, current code isn't affected by the topic issue.

-- 
Денис В. Шеломовский
Denis V. Shelomovskij

October 20, 2012

Re: Regarding hex strings

Posted by foobar
in reply to Denis Shelomovskij

foobar

Posted in reply to Denis Shelomovskij

On Saturday, 20 October 2012 at 10:51:25 UTC, Denis Shelomovskij wrote:
> 18.10.2012 12:58, foobar пишет:
>> IMO, this is a redundant feature that complicates the language for no
>> benefit and should be deprecated.
>> strings already have an escape sequence for specifying code-points "\u"
>> and for ubyte arrays you can simply use:
>> immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4];
>>
>> So basically this feature gains us nothing.
>>
>
> Maybe. Just an example of a real world code:
>
> Arrays:
> https://github.com/D-Programming-Language/druntime/blob/fc45de1d089a1025df60ee2eea66ba27ee0bd99c/src/core/sys/windows/dll.d#L110
>
> vs
>
> Hex strings:
> https://github.com/denis-sh/hooking/blob/69105a24d77fcb6eca701282a16dd5ec7311c077/tlsfixer/ntdll.d#L130
>
> By the way, current code isn't affected by the topic issue.

I personally find the former more readable but I guess there would always be someone to disagree. As the say, YMMV.

October 20, 2012

Re: Regarding hex strings

Posted by monarch_dodra
in reply to Marco Leise

monarch_dodra

Posted in reply to Marco Leise

On Friday, 19 October 2012 at 03:14:54 UTC, Marco Leise wrote:
>
> Hehe, I assume most of the regulars know this: DMD used to
> use a garbage collector that is disabled. Memory just isn't
> freed! Also it has copy on write semantics during CTFE:
>
> int bug6498(int x)
> {
>     int n = 0;
>     while (n < x)
>         ++n;
>     return n;
> }
> static assert(bug6498(10_000_000)==10_000_000);
>
> --> Fails with an 'out of memory' error.
>
> http://d.puremagic.com/issues/show_bug.cgi?id=6498
>
> So, as strange as it sounds, for now try not to write often or
> into large blocks. Using this knowledge I was sometimes able
> to bring down the memory consumption considerably by caching
> recurring concatenations of two strings or to!string calls.
>
> That said, appending single elements to an array may actually
> be better than using a fixed-sized one and have DMD duplicate
> it on every write. :p
>
> Please remember to give Don a cookie when he manages to change
> the compiler to modify in-place where appropriate.

I should have read your post in more detail. I thought you were saying that allocations are never freed, but it is indeed more than that: Every write allocates.

I just spent the last hour trying to "optimize" my code, only to realize that at its "simplest" (Walk the string counting elements), I run out of memory :/

Can't do much more about it at this point.

October 20, 2012

Re: Regarding hex strings

Posted by Nick Sabalausky
in reply to foobar

Nick Sabalausky

Posted in reply to foobar

On Fri, 19 Oct 2012 20:46:06 +0200
> 
> For general purpose binary data (i.e. _not_ UTF encoded Unicode text) I also _already_ said IMO should be either stored as ubyte[]

Problem is, x"..." is FAR better syntax for that.

> or better yet their own types that would ensure the correct invariants for the data type, be it audio, video, or just a different text encoding.

Using x"..." doesn't prevent anyone from doing that:

auto a = SomeAudioType(x"...");

> 
> In neither case the hex-string is relevant IMO. In the former it potentially violates the type's invariant and in the latter we already have array literals.
> 
> Using a malformed _string_ to initialize ubyte[] IMO is simply less readable. How did that article call such features, "WAT"?

The only thing ridiculous about x"..." is that somewhere along the lines it was decided that it must be a string instead of the arbitrary binary data that it *is*.

October 20, 2012

Re: Regarding hex strings

Posted by Nick Sabalausky
in reply to foobar

Nick Sabalausky

Posted in reply to foobar

On Sat, 20 Oct 2012 14:59:27 +0200
"foobar" <foo@bar.com> wrote:
> On Saturday, 20 October 2012 at 10:51:25 UTC, Denis Shelomovskij wrote:
> >
> > Maybe. Just an example of a real world code:
> >
> > Arrays: https://github.com/D-Programming-Language/druntime/blob/fc45de1d089a1025df60ee2eea66ba27ee0bd99c/src/core/sys/windows/dll.d#L110
> >
> > vs
> >
> > Hex strings: https://github.com/denis-sh/hooking/blob/69105a24d77fcb6eca701282a16dd5ec7311c077/tlsfixer/ntdll.d#L130
> >
> > By the way, current code isn't affected by the topic issue.
> 
> I personally find the former more readable but I guess there would always be someone to disagree. As the say, YMMV.

Honestly, I can't imagine how anyone wouldn't find the latter vastly more readable.

October 20, 2012

Re: Regarding hex strings

Posted by H. S. Teoh
in reply to Nick Sabalausky

H. S. Teoh

Posted in reply to Nick Sabalausky

On Sat, Oct 20, 2012 at 04:39:28PM -0400, Nick Sabalausky wrote:
> On Sat, 20 Oct 2012 14:59:27 +0200
> "foobar" <foo@bar.com> wrote:
> > On Saturday, 20 October 2012 at 10:51:25 UTC, Denis Shelomovskij wrote:
> > >
> > > Maybe. Just an example of a real world code:
> > >
> > > Arrays: https://github.com/D-Programming-Language/druntime/blob/fc45de1d089a1025df60ee2eea66ba27ee0bd99c/src/core/sys/windows/dll.d#L110
> > >
> > > vs
> > >
> > > Hex strings: https://github.com/denis-sh/hooking/blob/69105a24d77fcb6eca701282a16dd5ec7311c077/tlsfixer/ntdll.d#L130
> > >
> > > By the way, current code isn't affected by the topic issue.
> > 
> > I personally find the former more readable but I guess there would always be someone to disagree. As the say, YMMV.
> 
> Honestly, I can't imagine how anyone wouldn't find the latter vastly more readable.

If you want vastly human readable, you want heredoc hex syntax, something like this:

	ubyte[] = x"<<END
	32 2b 32 3d 34 2e 20 32 2a 32 3d 34 2e 20 32 5e
	32 3d 34 2e 20 54 68 65 72 65 66 6f 72 65 2c 20
	2b 2c 20 2a 2c 20 61 6e 64 20 5e 20 61 72 65 20
	74 68 65 20 73 61 6d 65 20 6f 70 65 72 61 74 69
	6f 6e 2e 0a 22 36 34 30 4b 20 6f 75 67 68 74 20
	74 6f 20 62 65 20 65 6e 6f 75 67 68 22 20 2d 2d
	20 42 69 6c 6c 20 47 2e 2c 20 31 39 38 34 2e 20
	22 54 68 65 20 49 6e 74 65 72 6e 65 74 20 69 73
	20 6e 6f 74 20 61 20 70 72 69 6d 61 72 79 20 67
	6f 61 6c 20 66 6f 72 20 50 43 20 75 73 61 67 65
	END";

(I just made that syntax up, so the details are not final, but you get the idea.) I would propose supporting this in D, but then D already has way too many different ways of writing strings, some of questionable utility, so I will refrain.

Of course, the above syntax might actually be implementable with a suitable mixin template that takes a compile-time string. Maybe we should lobby for such a template to go into Phobos -- that might motivate people to fix CTFE in dmd so that it doesn't consume unreasonable amounts of memory when the size of CTFE input gets moderately large (see other recent thread on this topic).

T

-- 
Без труда не выловишь и рыбку из пруда.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation