string to character code hex string (page 2)

On Saturday, 2 September 2017 at 18:28:02 UTC, Moritz Maxeiner wrote: > > In UTF8: > > --- utfmangle.d --- > void fun_ༀ() {} > pragma(msg, fun_ༀ.mangleof); > ------------------- > > --- > $ dmd -c utfmangle.d > _D6mangle7fun_ༀFZv > --- > > Only universal character names for identifiers are allowed, though, as per [1] > > [1] https://dlang.org/spec/lex.html#identifiers What I intend to do is this though: void fun(string s)() {} pragma(msg, fun!"ༀ".mangleof); which gives: _D7mainMod21__T3funVAyaa3_e0bc80Z3funFNaNbNiNfZv where "e0bc80" is the 3 bytes of "ༀ". The function will be internal to my library. The only thing provided from outside will be the string template argument, which is meant to represent a fully qualified type name.

On Saturday, 2 September 2017 at 18:28:02 UTC, Moritz Maxeiner wrote: > [...] Code will eventually look something like the following. The point is to be able to retrieve the exported function at runtime only by knowing what the template arg would have been. export extern(C) const(Reflection) dummy(string fqn)(){ ... } int main(string[] argv) { enum ARG = "AAAAAA"; auto hex = toAsciiHex(ARG); // original writeln(dummy!ARG.mangleof); // reconstructed at runtime auto remangled = dummy!"".mangleof; remangled = remangled.replaceFirst( "_D7mainMod17", "_D7mainMod" ~ (17 + hex.length).to!string); remangled = remangled.replaceFirst( "VAyaa0_", "VAyaa" ~ ARG.length.to!string ~ "_" ~ hex); writeln(remangled); return 0; }

On Saturday, 2 September 2017 at 20:02:37 UTC, bitwise wrote: > On Saturday, 2 September 2017 at 18:28:02 UTC, Moritz Maxeiner wrote: >> >> In UTF8: >> >> --- utfmangle.d --- >> void fun_ༀ() {} >> pragma(msg, fun_ༀ.mangleof); >> ------------------- >> >> --- >> $ dmd -c utfmangle.d >> _D6mangle7fun_ༀFZv >> --- >> >> Only universal character names for identifiers are allowed, though, as per [1] >> >> [1] https://dlang.org/spec/lex.html#identifiers > > What I intend to do is this though: > > void fun(string s)() {} > pragma(msg, fun!"ༀ".mangleof); > > which gives: > _D7mainMod21__T3funVAyaa3_e0bc80Z3funFNaNbNiNfZv > > where "e0bc80" is the 3 bytes of "ༀ". Interesting, I wasn't aware of that (though after thinking about it, it does make sense, as identifiers can only have visible characters in them, while a string could have things such as control characters inside), thanks! That behaviour is defined here [1], btw (the line `CharWidth Number _ HexDigits`). [1] https://dlang.org/spec/abi.html#Value

On 09/02/2017 11:02 AM, lithium iodate wrote: > On Saturday, 2 September 2017 at 17:41:34 UTC, Ali Çehreli wrote: >> You're right but I think there is no intention of interpreting the >> result as UTF-8. "f62026" is just to be used as "f62026", which can be >> converted byte-by-byte back to "ö…". That's how understand the >> requirement anyway. >> >> Ali > > That is not possible, because you cannot know whether "f620" and "26" or > "f6" and "2026" (or any other combination) should form a code point > each. Additional padding to constant width (8 hex chars) is needed. Ok, I see that I made a mistake but I still don't think the conversion is one way. If we can convert byte-by-byte, we should be able to convert back byte-by-byte, right? What I failed to ensure was to iterate by code units. The following is able to get the same string back: import std.stdio; import std.string; import std.algorithm; import std.range; import std.utf; import std.conv; auto toHex(R)(R input) { // As Moritz Maxeiner says, this format is expensive return input.byCodeUnit.map!(c => format!"%02x"(c)).joiner; } int hexValue(C)(C c) { switch (c) { case '0': .. case '9': return c - '0'; case 'a': .. case 'f': return c - 'a' + 10; default: assert(false); } } auto fromHex(R, Dst = char)(R input) { return input.chunks(2).map!((ch) { auto high = ch.front.hexValue * 16; ch.popFront(); return high + ch.front.hexValue; }).map!(value => cast(Dst)value); } void main() { assert("AAA".toHex.fromHex.equal("AAA")); assert("ö…".toHex.fromHex.equal("ö…".byCodeUnit)); // Alternative check: assert("ö…".toHex.fromHex.text.equal("ö…")); } Ali

September 03, 2017

Re: string to character code hex string

Posted by ag0aep6g
in reply to Ali Çehreli

Permalink

ag0aep6g

Posted in reply to Ali Çehreli

Permalink

On 09/03/2017 01:39 AM, Ali Çehreli wrote:
> Ok, I see that I made a mistake but I still don't think the conversion is one way. If we can convert byte-by-byte, we should be able to convert back byte-by-byte, right?

You weren't converting byte-by-byte. You were only converting the significant bytes of the code points, throwing away leading zeroes.

> What I failed to ensure was to iterate by code units.

A UTF-8 code unit is a byte, so "%02x" is enough, yes. But for UTF-16 and UTF-32 code units, it's not. You need to match the format width to the size of the code unit.

Or maybe just convert everything to UTF-8 first. That also sidesteps any endianess issues.

> The following is able to get the same string back:
> 
> import std.stdio;
> import std.string;
> import std.algorithm;
> import std.range;
> import std.utf;
> import std.conv;
> 
> auto toHex(R)(R input) {
>      // As Moritz Maxeiner says, this format is expensive
>      return input.byCodeUnit.map!(c => format!"%02x"(c)).joiner;
> }
> 
> int hexValue(C)(C c) {
>      switch (c) {
>      case '0': .. case '9':
>          return c - '0';
>      case 'a': .. case 'f':
>          return c - 'a' + 10;
>      default:
>          assert(false);
>      }
> }
> 
> auto fromHex(R, Dst = char)(R input) {
>      return input.chunks(2).map!((ch) {
>              auto high = ch.front.hexValue * 16;
>              ch.popFront();
>              return high + ch.front.hexValue;
>          }).map!(value => cast(Dst)value);
> }
> 
> void main() {
>      assert("AAA".toHex.fromHex.equal("AAA"));
> 
>      assert("ö…".toHex.fromHex.equal("ö…".byCodeUnit));
>      // Alternative check:
>      assert("ö…".toHex.fromHex.text.equal("ö…"));
> }

Still fails with UTF-16 and UTF-32 strings:

----
writeln("…"w.toHex.fromHex.text); /* prints " &" */
writeln("…"d.toHex.fromHex.text); /* prints " &" */
----

On 09/03/2017 03:03 AM, ag0aep6g wrote: > On 09/03/2017 01:39 AM, Ali Çehreli wrote: >> If we can convert byte-by-byte, we should be able to >> convert back byte-by-byte, right? > > You weren't converting byte-by-byte. In my mind I was! :o) > Or maybe just convert everything to UTF-8 first. That also sidesteps any > endianess issues. Good point. > Still fails with UTF-16 and UTF-32 strings: I think I can make it work with a few more iterations but I'll leave it as an exercise for the author. Ali

Forums