January 29, 2005
Thomas Kühne wrote:

> | It *should* just stop dead when it finds one, e.g.:
> | error("invalid UTF-8 sequence");
> 
> I dont think the compiler should try to check the comment's content.

Why not ? It checks the rest of the file...

> What is an "invalid" UTF-8 sequence?

I just think it should treat comments the
same way it treats identifiers and literals ?

That is, call: utf_decodeChar and follow
whatever error that it returns... (utf.c)

--anders
January 29, 2005
I see. Thanks, Sebastian.

But text with erroneous utf sequences (not "non UTF character", sic! ) will not be compiled anyway.

So for "... other things like text/code editors. What is supposed to happen
when a non UTF character is encountered?..." editor (good one) should mark
them as
"bad string literal" or the like.

Andrew Fedoniouk.
http://terrainformatica.com


"Sebastian Beschke" <s.beschke@gmx.de> wrote in message news:ctgkta$48a$1@digitaldaemon.com...
> Andrew Fedoniouk schrieb:
>> What does it mean "non UTF character" ?
>> UTFs are the forms of representing/encoding full UNICODE table - 21-bit
>> charactes (code points).
>>
>> So "non UTF character" sounds for me as "non UNICODE character". And what
>> is that?
>> Some new alphabet?
>
> Invalid sequences *are* possible by using codepoints in the table that aren't defined, or by misforming UTF-8 or UTF-16 sequences.
>
> -Sebastian


January 29, 2005
Andrew Fedoniouk wrote:

> But text with erroneous utf sequences (not "non UTF character", sic! ) will not be compiled anyway.

A bug in the current DMD makes it allow almost everything, in comments.

--anders
January 29, 2005
Anders F Björklund wrote:

| Thomas Kühne wrote:
|
|> | It *should* just stop dead when it finds one, e.g.:
|> | error("invalid UTF-8 sequence");
|>
|> I dont think the compiler should try to check the comment's content.
|
|
| Why not ? It checks the rest of the file...
|
|> What is an "invalid" UTF-8 sequence?
|
| I just think it should treat comments the
| same way it treats identifiers and literals ?
|
| That is, call: utf_decodeChar and follow
| whatever error that it returns... (utf.c)

The current check for identifiers are:
1) shortes possible byte sequence for UTF-8
OK

2) no loone surrogate part
That might clash with pre 1.5 Java output.
This is a Java bug, thus can be ignored.

3) c <= 0x10FFFF
OK

4) c != 0xFFFE && c != 0xFFFF
That's the only check I reject. Those codepoints can occure if a
non-Unicode document is converted to UTF encoded Unicode. Inside of
comments they shouldn't stop the parsing.

Those checks above are - except for the 4th - reasonable for comments.

Thomas


January 29, 2005
Thomas Kühne wrote:

> 4) c != 0xFFFE && c != 0xFFFF
> That's the only check I reject. Those codepoints can occure if a
> non-Unicode document is converted to UTF encoded Unicode. Inside of
> comments they shouldn't stop the parsing.
> 
> Those checks above are - except for the 4th - reasonable for comments.

If needed, that can be hacked around for comments, for those two.

> 	s = utf_decodeChar(octet, ndigits, &idx, &c);
> 	if (s || idx != ndigits)

can be changed into:

	s = utf_decodeChar(octet, ndigits, &idx, &c);
	if ((s && c != 0xFFFE && c != 0xFFFF) || idx != ndigits)

Would that make it more reasonable ? (have the DMD patch ready...)

--anders
January 29, 2005
I wrote:

>> Looks like DMD allows that in comments and I don't think it's a good
>> idea.
[...]
> Would that make it more reasonable ? (have the DMD patch ready...)

Hilarious, the new patch made phobos fail:

> ../gcc-3.4.3/gcc/d/phobos/std/loader.d:62: invalid UTF-8 sequence

Due to this little comment line, from GDC:

>    Modified by David Friedman, October 2004 (applied patches from Anders F Björklund.)

(as the ö here was in Latin-1, you see...)

--anders
January 29, 2005
Anders F Björklund schrieb:
| Thomas Kühne wrote:
|
|> 4) c != 0xFFFE && c != 0xFFFF That's the only check I reject. Those
|> codepoints can occure if a non-Unicode document is converted to UTF
|> encoded Unicode. Inside of comments they shouldn't stop the
|> parsing.
|>
|> Those checks above are - except for the 4th - reasonable for
|> comments.
|
|
| If needed, that can be hacked around for comments, for those two.
|
|> s = utf_decodeChar(octet, ndigits, &idx, &c);
|> if (s || idx != ndigits)
|
|
| can be changed into:
|
| s = utf_decodeChar(octet, ndigits, &idx, &c);
| if ((s && c != 0xFFFE && c != 0xFFFF) || idx != ndigits)
|
| Would that make it more reasonable ? (have the DMD patch ready...)

Have a look at utf_decodeChar:
dmd/utf.c:92 and dmd/utf.c:183 ;)

While looking through utf.c I noticed that UTF-32 decoding doesn't
undergo any checks. I'll write a bunch of test cases for all those
encoding issues tomorrow.

Thomas

January 29, 2005
> A bug in the current DMD makes it allow almost everything, in comments.

It's a feature rather than a bug.

Preparation for attributed programming I guess. With option to include
binary data inline :)
I can imagine properties/methods having its own descriptional GIFs given in
source text as bytes. Le Cauchemar!

BTW: Are there any ports of png/jpeg/gif libs in D?

Andrew Fedoniouk.
http://terrainformatica.com


January 30, 2005
Thomas Kühne wrote:

> | can be changed into:
> |
> | s = utf_decodeChar(octet, ndigits, &idx, &c);
> | if ((s && c != 0xFFFE && c != 0xFFFF) || idx != ndigits)
> |
> | Would that make it more reasonable ? (have the DMD patch ready...)
> 
> Have a look at utf_decodeChar:
> dmd/utf.c:92 and dmd/utf.c:183 ;)

Yes, the idea is that it will not be valid and
return string "invalid UTF-8 sequence", which
is then ignored because the char is FFFE/F...
(all input is converted to UTF-8 before lexer)

The patch is in the digitalmars.D.bugs group.

--anders
January 30, 2005
Andrew Fedoniouk wrote:

> Preparation for attributed programming I guess. With option to include binary data inline :)

:-)

No, it's a bug. D source code is supposed to be valid UTF-8/16/32.

Ideally, the HTML used should be made to be valid XHTML as well...

--anders