Thread overview
[dmd-internals] named character entities, the spec, and dmd
Jul 31, 2012
Jonathan M Davis
Jul 31, 2012
Jonathan M Davis
Jul 31, 2012
Walter Bright
Jul 31, 2012
Jonathan M Davis
Jul 31, 2012
Don Clugston
July 30, 2012
This is the D spec page on named character entities:

http://dlang.org/entity.html

From what I can tell, it's essentially (if not exactly) what HTML 4 lists:

http://www.htmlhelp.com/reference/html40/entities/ http://www.w3.org/TR/html4/sgml/entities.html

However, what dmd seems to be using is essentilaly what HTML 5 lists:

http://www.w3.org/TR/html5/named-character-references.html

though it appears to have taken its list from here:

http://www.w3.org/2003/entities/2007/w3centities-f.ent

The problem is (aside from the fact that the D spec and dmd don't match) that we appear to be dealing with a moving target here, since HTML is a moving target. Should the D spec and dmd target a specific version of HTML? If so, can that list change later (e.g. moving from HTML 4 to 5 or 5 to 6 whenever 6 comes along)? And if it's HTML 5, HTML 5 itself isn't finalized yet, so _that_'s potentially a moving target. Or should the D spec just make its own list and stick with that (in which case, it would presumably match one of the HTML specs initially but may not do so in the long run)? I assume that we _don't_ want to take the approach of letting the implementation define whatever entities it feels like, even with the caveat that they're supposed to have come from one of the HTML specs.

From what I can tell, two names were redefined (⟨ and &rlang;) in HTML 5, but other than that, it's purely additive. I ran into this, because I'm working on a lexer for D, and I created a unit test to check all of the named entities I had against what dmd did, and those two didn't match. Looking at entity.c in dmd, it's worse than that in that there are far more defined there than in the spec, but regardless, it's clearly a potential implementation issue if anything following the spec is going to match dmd - especially if the spec is a moving target in this case.

What _I_ would be tempted to go for is to have the D spec specifically state that it supports the list of named entities that HTML 5 does, giving a link to the current HTML 5 spec and then update that link and dmd (and the changelog) whenever the HTML 5 spec changes and then leave it at the final draft of HTML 5 once that comes around. That way, we don't have to list every single entity in the spec ourselves and it's clearly defined what we currently support. It _does_ present a slightly moving target for the moment that way, but I suspect that the named character entities aren't changing much in the HTML 5 spec, and since dmd is _already_ supporting them, I'm not sure that it's reasonable to say that we're supporting HTML 4 (which is what the spec currently seems to match though it doesn't say so).

Thoughts? I'm perfectly willing to go and create whatever pull requests are necessary for dmd and d-programming-language.org to fix this, but we need a decision of some kind on how we want to proceed.

- Jonathan M Davis
_______________________________________________
dmd-internals mailing list
dmd-internals@puremagic.com
http://lists.puremagic.com/mailman/listinfo/dmd-internals

July 30, 2012
On Monday, July 30, 2012 22:51:58 Jonathan M Davis wrote:
> Thoughts? I'm perfectly willing to go and create whatever pull requests are necessary for dmd and d-programming-language.org to fix this, but we need a decision of some kind on how we want to proceed.

By the way, it does look like some named entities in HTML 5 are two code points in length rather than just one, and dmd does not support those (and I don't know how it could given that a named entity can be used in a character literal as well as a string literal), so we'll need to say that we don't support those regardless. But that's easy enough to do if we went with the approach of saying in the D spec that we followed the HTML 5 spec - you just say that we support the HTML 5 list of named character entities which are a single code point. But it _does_ mean that we have to say more than that we follow the HTML 5 spec for named character entities.

- Jonathan M Davis
_______________________________________________
dmd-internals mailing list
dmd-internals@puremagic.com
http://lists.puremagic.com/mailman/listinfo/dmd-internals

July 30, 2012
On 7/30/2012 11:02 PM, Jonathan M Davis wrote:
> On Monday, July 30, 2012 22:51:58 Jonathan M Davis wrote:
>> Thoughts? I'm perfectly willing to go and create whatever pull requests are
>> necessary for dmd and d-programming-language.org to fix this, but we need a
>> decision of some kind on how we want to proceed.
> By the way, it does look like some named entities in HTML 5 are two code
> points in length rather than just one, and dmd does not support those (and I
> don't know how it could given that a named entity can be used in a character
> literal as well as a string literal), so we'll need to say that we don't
> support those regardless. But that's easy enough to do if we went with the
> approach of saying in the D spec that we followed the HTML 5 spec - you just
> say that we support the HTML 5 list of named character entities which are a
> single code point. But it _does_ mean that we have to say more than that we
> follow the HTML 5 spec for named character entities.
>

Simply say that we support the HTML 5 spec. We can get the two code point ones to work.
_______________________________________________
dmd-internals mailing list
dmd-internals@puremagic.com
http://lists.puremagic.com/mailman/listinfo/dmd-internals

July 30, 2012
On Monday, July 30, 2012 23:37:50 Walter Bright wrote:
> Simply say that we support the HTML 5 spec.

Okay.

> We can get the two code point ones to work.

How? A dchar is a single code point. We could theoretically make them work in string literals (though that does complicate things a bit), but I don't see how that would be possible with character literals.

- Jonathan M Davis
_______________________________________________
dmd-internals mailing list
dmd-internals@puremagic.com
http://lists.puremagic.com/mailman/listinfo/dmd-internals

July 31, 2012
On 31 July 2012 08:43, Jonathan M Davis <jmdavisProg@gmx.com> wrote:
> On Monday, July 30, 2012 23:37:50 Walter Bright wrote:
>> Simply say that we support the HTML 5 spec.
>
> Okay.
>
>> We can get the two code point ones to work.
>
> How? A dchar is a single code point. We could theoretically make them work in string literals (though that does complicate things a bit), but I don't see how that would be possible with character literals.

It just doesn't fit into a dchar. We have that situation already, char
x = 'รค', doesn't compile.
It's OK.
_______________________________________________
dmd-internals mailing list
dmd-internals@puremagic.com
http://lists.puremagic.com/mailman/listinfo/dmd-internals