[Issue 15382] std.uri has an incorrect set of reserved characters

Apr 08, 2016

Dec 23, 2019

Jan 24, 2021

https://issues.dlang.org/show_bug.cgi?id=15382 Eugene Wissner <belka@caraus.de> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |belka@caraus.de --- Comment #1 from Eugene Wissner <belka@caraus.de> --- Look at "2.4. When to Encode or Decode": "the only time when octets within a URI are percent-encoded is during the process of producing the URI from its component parts. This is when an implementation determines which of the reserved characters are to be used as subcomponent delimiters and which can be safely used as data." So reserved characters can be encoded but it isn't a must. Only characters used as delimiters in a particular URL scheme must be encoded. Wikipedia differs between reserved characters with or without reserved meaning. I tested it quickly in Firefox and Firefox doesn't seem to encode characters like * or (). The behavior of encodeComponent is actually exactly the same as encodeURIComponent from JavaScript. The behavior described in the issue, is how PHP urlencode works, that encodes all reserved characters. --

https://issues.dlang.org/show_bug.cgi?id=15382 berni44 <bugzilla@d-ecke.de> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED CC| |bugzilla@d-ecke.de Resolution|--- |INVALID --- Comment #2 from berni44 <bugzilla@d-ecke.de> --- This is more a question how std.uri works, than a bug report. Please use the forum [1] for such questions in the future. [1] https://forum.dlang.org/group/learn --

January 24, 2021

[Issue 15382] std.uri has an incorrect set of reserved characters

Posted by Stefan

Permalink

Stefan

Permalink

https://issues.dlang.org/show_bug.cgi?id=15382

Stefan <kdevel@vogtner.de> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
                 CC|                            |kdevel@vogtner.de
         Resolution|INVALID                     |---

--- Comment #3 from Stefan <kdevel@vogtner.de> ---
According to § 2.2 of RFC 3986 there are the following character classes:

   unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
   reserved      = gen-delims / sub-delims
   gen-delims    = ":" / "/" / "?" / "#" / "[" / "]" / "@"
   sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="

The code in phobos/std/uri.d references these character classes instead:

     62         uflags['#'] |= URI_Hash;
     66             uflags[c] |= URI_Alpha;
     67             uflags[c + 0x20] |= URI_Alpha;   // lowercase letters
     69         foreach (c; '0' .. '9' + 1) uflags[c] |= URI_Digit;
     70         foreach (c; ";/?:@&=+$,")   uflags[c] |= URI_Reserved;
     71         foreach (c; "-_.!~*'()")    uflags[c] |= URI_Mark;

If encodeComponent is used URI_Encode is invoked with unescapedSet = URI_Alpha | URI_Digit | URI_Mark. This leads to some reserved characters not beeing encoded, e.g. ! or (.

The notion of mark characters stems from the obsoleted RFC 2396 [2]. RFC 3986 explains the changes in its Appendix D.2 [3].

[1] https://tools.ietf.org/html/rfc3986#section-2
[2] https://tools.ietf.org/html/rfc2396#section-2.3
[3] https://tools.ietf.org/html/rfc3986#appendix-D.2

--

Forums