January 29, 2005 Re: Non UTF characters in comments | ||||
---|---|---|---|---|
| ||||
Posted in reply to Thomas Kühne | Thomas Kühne wrote: > | It *should* just stop dead when it finds one, e.g.: > | error("invalid UTF-8 sequence"); > > I dont think the compiler should try to check the comment's content. Why not ? It checks the rest of the file... > What is an "invalid" UTF-8 sequence? I just think it should treat comments the same way it treats identifiers and literals ? That is, call: utf_decodeChar and follow whatever error that it returns... (utf.c) --anders |
January 29, 2005 Re: Non UTF characters in comments | ||||
---|---|---|---|---|
| ||||
Posted in reply to Sebastian Beschke | I see. Thanks, Sebastian. But text with erroneous utf sequences (not "non UTF character", sic! ) will not be compiled anyway. So for "... other things like text/code editors. What is supposed to happen when a non UTF character is encountered?..." editor (good one) should mark them as "bad string literal" or the like. Andrew Fedoniouk. http://terrainformatica.com "Sebastian Beschke" <s.beschke@gmx.de> wrote in message news:ctgkta$48a$1@digitaldaemon.com... > Andrew Fedoniouk schrieb: >> What does it mean "non UTF character" ? >> UTFs are the forms of representing/encoding full UNICODE table - 21-bit >> charactes (code points). >> >> So "non UTF character" sounds for me as "non UNICODE character". And what >> is that? >> Some new alphabet? > > Invalid sequences *are* possible by using codepoints in the table that aren't defined, or by misforming UTF-8 or UTF-16 sequences. > > -Sebastian |
January 29, 2005 Re: Non UTF characters in comments | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrew Fedoniouk | Andrew Fedoniouk wrote:
> But text with erroneous utf sequences (not "non UTF character", sic! ) will not be compiled anyway.
A bug in the current DMD makes it allow almost everything, in comments.
--anders
|
January 29, 2005 Re: Non UTF characters in comments | ||||
---|---|---|---|---|
| ||||
Posted in reply to Anders F Björklund Attachments: | Anders F Björklund wrote: | Thomas Kühne wrote: | |> | It *should* just stop dead when it finds one, e.g.: |> | error("invalid UTF-8 sequence"); |> |> I dont think the compiler should try to check the comment's content. | | | Why not ? It checks the rest of the file... | |> What is an "invalid" UTF-8 sequence? | | I just think it should treat comments the | same way it treats identifiers and literals ? | | That is, call: utf_decodeChar and follow | whatever error that it returns... (utf.c) The current check for identifiers are: 1) shortes possible byte sequence for UTF-8 OK 2) no loone surrogate part That might clash with pre 1.5 Java output. This is a Java bug, thus can be ignored. 3) c <= 0x10FFFF OK 4) c != 0xFFFE && c != 0xFFFF That's the only check I reject. Those codepoints can occure if a non-Unicode document is converted to UTF encoded Unicode. Inside of comments they shouldn't stop the parsing. Those checks above are - except for the 4th - reasonable for comments. Thomas |
January 29, 2005 Re: Non UTF characters in comments | ||||
---|---|---|---|---|
| ||||
Posted in reply to Thomas Kühne | Thomas Kühne wrote: > 4) c != 0xFFFE && c != 0xFFFF > That's the only check I reject. Those codepoints can occure if a > non-Unicode document is converted to UTF encoded Unicode. Inside of > comments they shouldn't stop the parsing. > > Those checks above are - except for the 4th - reasonable for comments. If needed, that can be hacked around for comments, for those two. > s = utf_decodeChar(octet, ndigits, &idx, &c); > if (s || idx != ndigits) can be changed into: s = utf_decodeChar(octet, ndigits, &idx, &c); if ((s && c != 0xFFFE && c != 0xFFFF) || idx != ndigits) Would that make it more reasonable ? (have the DMD patch ready...) --anders |
January 29, 2005 Re: Non UTF characters in comments | ||||
---|---|---|---|---|
| ||||
Posted in reply to Anders F Björklund | I wrote: >> Looks like DMD allows that in comments and I don't think it's a good >> idea. [...] > Would that make it more reasonable ? (have the DMD patch ready...) Hilarious, the new patch made phobos fail: > ../gcc-3.4.3/gcc/d/phobos/std/loader.d:62: invalid UTF-8 sequence Due to this little comment line, from GDC: > Modified by David Friedman, October 2004 (applied patches from Anders F Björklund.) (as the ö here was in Latin-1, you see...) --anders |
January 29, 2005 Re: Non UTF characters in comments | ||||
---|---|---|---|---|
| ||||
Posted in reply to Anders F Björklund Attachments: | Anders F Björklund schrieb: | Thomas Kühne wrote: | |> 4) c != 0xFFFE && c != 0xFFFF That's the only check I reject. Those |> codepoints can occure if a non-Unicode document is converted to UTF |> encoded Unicode. Inside of comments they shouldn't stop the |> parsing. |> |> Those checks above are - except for the 4th - reasonable for |> comments. | | | If needed, that can be hacked around for comments, for those two. | |> s = utf_decodeChar(octet, ndigits, &idx, &c); |> if (s || idx != ndigits) | | | can be changed into: | | s = utf_decodeChar(octet, ndigits, &idx, &c); | if ((s && c != 0xFFFE && c != 0xFFFF) || idx != ndigits) | | Would that make it more reasonable ? (have the DMD patch ready...) Have a look at utf_decodeChar: dmd/utf.c:92 and dmd/utf.c:183 ;) While looking through utf.c I noticed that UTF-32 decoding doesn't undergo any checks. I'll write a bunch of test cases for all those encoding issues tomorrow. Thomas |
January 29, 2005 Re: Non UTF characters in comments | ||||
---|---|---|---|---|
| ||||
Posted in reply to Anders F Björklund | > A bug in the current DMD makes it allow almost everything, in comments. It's a feature rather than a bug. Preparation for attributed programming I guess. With option to include binary data inline :) I can imagine properties/methods having its own descriptional GIFs given in source text as bytes. Le Cauchemar! BTW: Are there any ports of png/jpeg/gif libs in D? Andrew Fedoniouk. http://terrainformatica.com |
January 30, 2005 Re: Non UTF characters in comments | ||||
---|---|---|---|---|
| ||||
Posted in reply to Thomas Kühne | Thomas Kühne wrote:
> | can be changed into:
> |
> | s = utf_decodeChar(octet, ndigits, &idx, &c);
> | if ((s && c != 0xFFFE && c != 0xFFFF) || idx != ndigits)
> |
> | Would that make it more reasonable ? (have the DMD patch ready...)
>
> Have a look at utf_decodeChar:
> dmd/utf.c:92 and dmd/utf.c:183 ;)
Yes, the idea is that it will not be valid and
return string "invalid UTF-8 sequence", which
is then ignored because the char is FFFE/F...
(all input is converted to UTF-8 before lexer)
The patch is in the digitalmars.D.bugs group.
--anders
|
January 30, 2005 Re: Non UTF characters in comments | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrew Fedoniouk | Andrew Fedoniouk wrote:
> Preparation for attributed programming I guess. With option to include binary data inline :)
:-)
No, it's a bug. D source code is supposed to be valid UTF-8/16/32.
Ideally, the HTML used should be made to be valid XHTML as well...
--anders
|
Copyright © 1999-2021 by the D Language Foundation