View mode: basic / threaded / horizontal-split · Log in · Help
November 02, 2009
[Issue 3465] New: isIdeographic can be wrong in std.xml
http://d.puremagic.com/issues/show_bug.cgi?id=3465

          Summary: isIdeographic can be wrong in std.xml
          Product: D
          Version: 2.035
         Platform: Other
       OS/Version: All
           Status: NEW
         Severity: minor
         Priority: P2
        Component: Phobos
       AssignedTo: nobody@puremagic.com
       ReportedBy: y0uf00bar@gmail.com


--- Comment #0 from hed010gy <y0uf00bar@gmail.com> 2009-11-01 21:51:25 PST ---
The std.xml functionisIdeographic failed my parser on one of the xml
conformance tests for the character 0x4E00.

// As implemented in XML Piece Parser Project,  http://source.miryn.org/
// but I took it from std.xml

//WRONG in std.xml
//invariant IdeographicTable=[0x4E00,0x9FA5,0x3007,0x3007,0x3021,0x3029];

//RIGHT, because for lookup function,
// the table data range pairs should be ordered!
dchar[] IdeographicTable=[0x3007,0x3007,0x3021,0x3029,0x4E00,0x9FA5];

// PERFORMANCE SUGGESTION
// also lookup is best done for tables that are larger
// for smaller tables, like this one, or character, 
// surely a hard coded search will be faster


// Surely not much more code, is generated for this.
// and faster, since no function call to lookup, and no array slices used.

bool isIdeographic(dchar c)
{
   if (c == 0x3007)
       return true;
   if (c >= 0x3007 && c <= 0x3029)
       return true;
   if (c >= 0x4E00 && c <= 0x9FA5)
       return true;
   return false;
}

// Only suggestion here..
// isChar has to be called for every single character in the document, and 
//    it must be worth a bit of optimisation,
//     especially for common cases.

/**
* Returns true if the character is a character according to the XML standard
* Character references must refer to one of these.
* Any unicode character, excluding surrogate blocks FFFE and FFFF.
* #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
* Avoid [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDEF],
* Standards: $(LINK2 http://www.w3.org/TR/1998/REC-xml-19980210, XML 1.0)
*
* Params:
*    c = the character to be tested
*    The standard ASCII case gets at most 3 value comparisons.
 */
bool isChar(dchar c) 
{
   if (c <= 0xD7FF)
   {
       if (c >= 0x20)
       {
           if (c >= 0x7F)
           {
               if (c <= 0x84)
                   return false;
               if (c >= 0x86)
               {
                   if (c <= 0x9F)
                       return false;
               }
           }
           return true;
       }
       switch(c)
       {
       case 0xA:
       case 0x9:
       case 0xD:
           return true;
       default:
           return false;
       }
   }
   else if (c >= 0xE000)
   {
       if (c < 0xFFFE)
       {
           if (c >= 0xFDD0 && c <= 0xFDEF)
               return false;
           return true;
       }
       if (c >= 0x10000)
       {
           if (c <= 0x10FFFF)
           {
       /* some conformance tests have the 0x10FFFF
               if ((c & 0xFFFE) == 0xFFFE)
               {
                   return false; 
               }
       */
               return true;
           }
       }
   }
   return false;
}

// Most digits are expected to be ASCII ones
bool isDigit(dchar c)
{
   if (c <= 0x0039 && c >= 0x0030)
       return true;
   else
       return lookup(DigitTable,c);
}

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
November 02, 2009
[Issue 3465] isIdeographic can be wrong in std.xml
http://d.puremagic.com/issues/show_bug.cgi?id=3465



--- Comment #1 from hed010gy <y0uf00bar@gmail.com> 2009-11-01 21:58:11 PST ---
// A check on my code indicates afternoon doziness, so here is the better
version

bool isIdeographic(dchar c)
{
   if (c == 0x3007)
       return true;
   if (c <= 0x3029 && c >= 0x3021 )
       return true;
   if (c <= 0x9FA5 && c >= 0x4E00)
       return true;
   return false;
}

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
May 24, 2010
[Issue 3465] isIdeographic can be wrong in std.xml
http://d.puremagic.com/issues/show_bug.cgi?id=3465


Shin Fujishiro <rsinfu@gmail.com> changed:

          What    |Removed                     |Added
----------------------------------------------------------------------------
            Status|NEW                         |ASSIGNED
                CC|                            |rsinfu@gmail.com
        AssignedTo|nobody@puremagic.com        |rsinfu@gmail.com


-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
May 24, 2010
[Issue 3465] isIdeographic can be wrong in std.xml
http://d.puremagic.com/issues/show_bug.cgi?id=3465


Shin Fujishiro <rsinfu@gmail.com> changed:

          What    |Removed                     |Added
----------------------------------------------------------------------------
            Status|ASSIGNED                    |RESOLVED
        Resolution|                            |FIXED


--- Comment #2 from Shin Fujishiro <rsinfu@gmail.com> 2010-05-23 21:36:54 PDT ---
Fixed in svn r1552.
Thanks for your contribution!

Excuse me: I removed certain part of your code from the actual commit. The
contributed code took care of newer Unicode standards. I like new things, but
as far as supporting XML 1.0, we have to stick to Unicode 2.0.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
Top | Discussion index | About this forum | D home