Thread overview
[Issue 15382] std.uri has an incorrect set of reserved characters
Apr 08, 2016
Eugene Wissner
Dec 23, 2019
berni44
Jan 24, 2021
Stefan
April 08, 2016
https://issues.dlang.org/show_bug.cgi?id=15382

Eugene Wissner <belka@caraus.de> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |belka@caraus.de

--- Comment #1 from Eugene Wissner <belka@caraus.de> ---
Look at "2.4.  When to Encode or Decode":
"the only time when octets within a URI are percent-encoded is during the
process of producing the URI from its component parts.  This is when an
implementation determines which of the reserved characters are to be used as
subcomponent delimiters and which can be safely used as data."

So reserved characters can be encoded but it isn't a must. Only characters used
as delimiters in a particular URL scheme must be encoded. Wikipedia differs
between reserved characters with or without reserved meaning.
I tested it quickly in Firefox and Firefox doesn't seem to encode characters
like * or ().
The behavior of encodeComponent is actually exactly the same as
encodeURIComponent from JavaScript. The behavior described in the issue, is how
PHP urlencode works, that encodes all reserved characters.

--
December 23, 2019
https://issues.dlang.org/show_bug.cgi?id=15382

berni44 <bugzilla@d-ecke.de> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
                 CC|                            |bugzilla@d-ecke.de
         Resolution|---                         |INVALID

--- Comment #2 from berni44 <bugzilla@d-ecke.de> ---
This is more a question how std.uri works, than a bug report. Please use the forum [1] for such questions in the future.

[1] https://forum.dlang.org/group/learn

--
January 24, 2021
https://issues.dlang.org/show_bug.cgi?id=15382

Stefan <kdevel@vogtner.de> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
                 CC|                            |kdevel@vogtner.de
         Resolution|INVALID                     |---

--- Comment #3 from Stefan <kdevel@vogtner.de> ---
According to ยง 2.2 of RFC 3986 there are the following character classes:

   unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
   reserved      = gen-delims / sub-delims
   gen-delims    = ":" / "/" / "?" / "#" / "[" / "]" / "@"
   sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="

The code in phobos/std/uri.d references these character classes instead:

     62         uflags['#'] |= URI_Hash;
     66             uflags[c] |= URI_Alpha;
     67             uflags[c + 0x20] |= URI_Alpha;   // lowercase letters
     69         foreach (c; '0' .. '9' + 1) uflags[c] |= URI_Digit;
     70         foreach (c; ";/?:@&=+$,")   uflags[c] |= URI_Reserved;
     71         foreach (c; "-_.!~*'()")    uflags[c] |= URI_Mark;

If encodeComponent is used URI_Encode is invoked with unescapedSet = URI_Alpha | URI_Digit | URI_Mark. This leads to some reserved characters not beeing encoded, e.g. ! or (.

The notion of mark characters stems from the obsoleted RFC 2396 [2]. RFC 3986 explains the changes in its Appendix D.2 [3].

[1] https://tools.ietf.org/html/rfc3986#section-2
[2] https://tools.ietf.org/html/rfc2396#section-2.3
[3] https://tools.ietf.org/html/rfc3986#appendix-D.2

--