January 29, 2011
http://d.puremagic.com/issues/show_bug.cgi?id=5221



--- Comment #10 from Iain Buclaw <ibuclaw@ubuntu.com> 2011-01-29 02:43:45 PST ---
By the way, given the larger list size, I'd recommend a bsearch to lookup the entities.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
January 29, 2011
http://d.puremagic.com/issues/show_bug.cgi?id=5221



--- Comment #11 from Aziz Köksal <aziz.koeksal@gmail.com> 2011-01-29 08:53:01 PST ---
Good to see somebody is working on this. :)

Here's some info I could gather on the entities in that file:

Almost all entities look like this:

<!ENTITY AElig            "&#x000C6;" ><!--LATIN CAPITAL LETTER AE -->

The odd ones that stand out are:

<!ENTITY AMP              "&#38;#38;" ><!--AMPERSAND -->
// The name has a dot (you already noticed.)
<!ENTITY b.Delta          "&#x1D6AB;" ><!--MATHEMATICAL BOLD CAPITAL DELTA -->
// Notice the leading space in the value.
<!ENTITY DotDot           " &#x020DC;" ><!--COMBINING FOUR DOTS ABOVE -->
// The entity has two characters in its value.
<!ENTITY bne              "&#x0003D;&#x020E5;" ><!--EQUALS SIGN with reverse
slash -->
// Double chars + initial &#38;.
<!ENTITY nvlt             "&#38;#x0003C;&#x020D2;" ><!--LESS-THAN SIGN with
vertical line -->


I was wondering what constitutes valid names for HTML entities and after a long
and hard search I found this page:
http://www.w3.org/TR/REC-xml/#NT-Name

Basically you can stick in a lot more different characters than just alphanumeric ones into an HTML entity. However, I would not recommend adjusting the lexer to recognise all of them. We should just allow only those that are actually in the list, because it keeps things a lot simpler. Therefore the "." char should be allowed as well.

But what to do about those entities that define two replacement characters? Again, to keep things simple and efficient, let's just leave them out of the lookup-table in the compiler.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
January 29, 2011
http://d.puremagic.com/issues/show_bug.cgi?id=5221



--- Comment #12 from Iain Buclaw <ibuclaw@ubuntu.com> 2011-01-29 11:59:07 PST ---
(In reply to comment #11)
> The odd ones that stand out are:
> 
> <!ENTITY AMP              "&#38;#38;" ><!--AMPERSAND -->

amp is 0x0026.

> // The name has a dot (you already noticed.)
> <!ENTITY b.Delta          "&#x1D6AB;" ><!--MATHEMATICAL BOLD CAPITAL DELTA -->

Yep, the DMD parser can't handle it.

> // Notice the leading space in the value.
> <!ENTITY DotDot           " &#x020DC;" ><!--COMBINING FOUR DOTS ABOVE -->

Means nothing.

> // The entity has two characters in its value.
> <!ENTITY bne              "&#x0003D;&#x020E5;" ><!--EQUALS SIGN with reverse
> slash -->

They are called composited characters.

try:
writeln(cast(dchar)0x003D);
writeln(cast(dchar)0x003D,cast(dchar)0x20E5);

And spot the difference.

> // Double chars + initial &#38;.
> <!ENTITY nvlt             "&#38;#x0003C;&#x020D2;" ><!--LESS-THAN SIGN with
> vertical line -->
> 

nvlt is 0x003C,0x20D2.

Regards

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
January 30, 2011
http://d.puremagic.com/issues/show_bug.cgi?id=5221


Iain Buclaw <ibuclaw@ubuntu.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
 Attachment #887 is|0                           |1
           obsolete|                            |


--- Comment #13 from Iain Buclaw <ibuclaw@ubuntu.com> 2011-01-30 06:28:20 PST ---
Created an attachment (id=889)
new entity.c source based off new w3 list

Dang diggity, just realised that some items are in the wrong order.

Uploading corrected source.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
January 30, 2011
http://d.puremagic.com/issues/show_bug.cgi?id=5221


Iain Buclaw <ibuclaw@ubuntu.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
 Attachment #888 is|0                           |1
           obsolete|                            |


--- Comment #14 from Iain Buclaw <ibuclaw@ubuntu.com> 2011-01-30 06:29:42 PST ---
Created an attachment (id=890)
diff between file and donc/dmd/src/entity.c

And corrected diff between file and Don's copy.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
January 31, 2011
http://d.puremagic.com/issues/show_bug.cgi?id=5221



--- Comment #15 from Don <clugdbug@yahoo.com.au> 2011-01-31 01:08:57 PST ---
The DMD test suite chokes on:
&lang; == 9001 (== U+2329), in the new list it is U+27E8.

This really scared me, because I found a few web references that listed &lang;
as U+2329.
http://www.fileformat.info/info/unicode/char/2329/index.htm

Turns out that U+2329 and U+27E8 are visually almost identical.

I found this helpful note in http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references#endnote_lang

lang: 'mathematical left angle bracket' is NOT the same character as U+003C 'less than', or U+2039 'single left-pointing angle quotation mark', or U+2329 'left-pointing angle bracket', or U+3008 'left angle bracket'.

I finally found what has happened: U+27E8 was added in unicode 3.2.0
In the book "unicode explained", p423, it says that U+27E8 is poorly supported
(because it was a recent addition to unicode) and that U+2329 is a more
practical choice. But, U+2329 is canonically equivalent to U+3008, and is
intended for chinese-japanese-korean ideographs, and it can look wrong if it
goes through a normalization process.

That book was published in 2006. Can we assume that unicode support is widespread enough now that we should change to the more correct value?

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
January 31, 2011
http://d.puremagic.com/issues/show_bug.cgi?id=5221



--- Comment #16 from Iain Buclaw <ibuclaw@ubuntu.com> 2011-01-31 03:36:26 PST ---
(In reply to comment #15)
> That book was published in 2006. Can we assume that unicode support is widespread enough now that we should change to the more correct value?

I would assume it to be safe to change. Having looked up on the reference myself, it appears to be added as part of the HTML5 standard?

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
January 31, 2011
http://d.puremagic.com/issues/show_bug.cgi?id=5221



--- Comment #17 from Don <clugdbug@yahoo.com.au> 2011-01-31 14:17:09 PST ---
(In reply to comment #16)
> (In reply to comment #15)
> > That book was published in 2006. Can we assume that unicode support is widespread enough now that we should change to the more correct value?
> 
> I would assume it to be safe to change. Having looked up on the reference myself, it appears to be added as part of the HTML5 standard?

Yes. The old definition was in 4.01, back in 1999. And here's the problem -- HTML5 still hasn't been ratified, and the documents all say "draft".

I wrote these next two lines:
"So, I'm not sure that we can use these entity values yet. Are we certain that
they're not going to change them before HTML5 becomes official?"

and then I read this:
http://blog.whatwg.org/html-is-the-new-html5
indicating that html5 will never happen, and the draft documents are as
"standard" as it's ever going to get. That's a spectacular failure.

So I guess there's no reason to hold off on the patch.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
February 04, 2011
http://d.puremagic.com/issues/show_bug.cgi?id=5221



--- Comment #18 from Aziz Köksal <aziz.koeksal@gmail.com> 2011-02-04 05:03:04 PST ---
(In reply to comment #12)
> > // Notice the leading space in the value.
> > <!ENTITY DotDot           " &#x020DC;" ><!--COMBINING FOUR DOTS ABOVE -->
> 
> Means nothing.

I guess you're right, since on the following list the value of those combining characters consists of only one Unicode codepoint:

http://www.w3.org/2003/entities/2007doc/byalpha.html

For &DotDot; it's U+20DC.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
February 06, 2011
http://d.puremagic.com/issues/show_bug.cgi?id=5221


Don <clugdbug@yahoo.com.au> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED


--- Comment #19 from Don <clugdbug@yahoo.com.au> 2011-02-06 13:41:15 PST ---
Fixed: https://github.com/D-Programming-Language/dmd/commit/b46fe402cff4618f5d49f99d71b8fefb764e16e5

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
1 2
Next ›   Last »