View mode: basic / threaded / horizontal-split · Log in · Help
January 29, 2005
Non UTF characters in comments
I'm not sure if allowing non UTF characters in comments is such a good  
idea. It seems to be complicating my parser, and it will probably  
complicate other things like text/code editors. What is supposed to happen  
when a non UTF character is encountered? Should it display a question  
mark, display nothing, use the current code page? What if the editor  
doesn't know about D's comments?
I might not have mentioned this, but since D is suppsed to be easily  
parsed, this might be an issue; a special case.
- Chris
January 29, 2005
Re: Non UTF characters in comments
Vathix wrote:
| I'm not sure if allowing non UTF characters in comments is such a good
| idea. It seems to be complicating my parser, and it will probably
| complicate other things like text/code editors. What is supposed to
| happen  when a non UTF character is encountered? Should it display a
| question  mark, display nothing, use the current code page? What if the
| editor  doesn't know about D's comments?

Maybe I am missreading your post.
Are you trying to use 2 different encodings in one file?

Concerning Unicode: you are supposed to display the glyph of U+FFFD for
all character's that can't be displayed by other means - e.g. a generic
glyph displaying the codepoint or the code range. (Depending on your
situation you might also use U+FFFC).

http://www.unicode.org

Thomas
January 29, 2005
Re: Non UTF characters in comments
What does it mean "non UTF character" ?
UTFs are the forms of representing/encoding full UNICODE table - 21-bit 
charactes (code points).

So "non UTF character" sounds for me as "non UNICODE character". And what is 
that?
Some new alphabet?

Andrew Fedoniouk.
http://terrainformatica.com



"Vathix" <vathix@dprogramming.com> wrote in message 
news:opsldc5xwrkcck4r@esi...
> I'm not sure if allowing non UTF characters in comments is such a good 
> idea. It seems to be complicating my parser, and it will probably 
> complicate other things like text/code editors. What is supposed to happen 
> when a non UTF character is encountered? Should it display a question 
> mark, display nothing, use the current code page? What if the editor 
> doesn't know about D's comments?
> I might not have mentioned this, but since D is suppsed to be easily 
> parsed, this might be an issue; a special case.
> - Chris
January 29, 2005
Re: Non UTF characters in comments
> Are you trying to use 2 different encodings in one file?

Looks like DMD allows that in comments and I don't think it's a good idea.
January 29, 2005
Re: Non UTF characters in comments
Andrew Fedoniouk schrieb:
> What does it mean "non UTF character" ?
> UTFs are the forms of representing/encoding full UNICODE table - 21-bit 
> charactes (code points).
> 
> So "non UTF character" sounds for me as "non UNICODE character". And what is 
> that?
> Some new alphabet?

Invalid sequences *are* possible by using codepoints in the table that 
aren't defined, or by misforming UTF-8 or UTF-16 sequences.

-Sebastian
January 29, 2005
Re: Non UTF characters in comments
> So "non UTF character" sounds for me as "non UNICODE character". And  
> what is
> that?
> Some new alphabet?
>

A value in the file that causes std.utf functions to throw an exception  
because it's invalid. I'm not good at this stuff and I don't know all the  
proper terminology.
January 29, 2005
Re: Non UTF characters in comments
Sebastian Beschke wrote:

> Invalid sequences *are* possible by using codepoints in the table that 
> aren't defined, or by misforming UTF-8 or UTF-16 sequences.

A simple way to do it is to try to interpret a file in Latin-1 as UTF-8.
That'll give you "invalid UTF-8 sequence", for everything outside ASCII.

--anders
January 29, 2005
Re: Non UTF characters in comments
"Vathix" <vathix@dprogramming.com> wrote in message
news:opsldc5xwrkcck4r@esi...
> I'm not sure if allowing non UTF characters in comments is such a good
> idea. It seems to be complicating my parser, and it will probably
> complicate other things like text/code editors. What is supposed to happen
> when a non UTF character is encountered? Should it display a question
> mark, display nothing, use the current code page? What if the editor
> doesn't know about D's comments?
> I might not have mentioned this, but since D is suppsed to be easily
> parsed, this might be an issue; a special case.
> - Chris

Technically it's an error to have non-UTF characters anywhere in the source.
January 29, 2005
Re: Non UTF characters in comments
Vathix wrote:

>>> I'm not sure if allowing non UTF characters in comments is such a good idea.

>> Are you trying to use 2 different encodings in one file?
> 
> Looks like DMD allows that in comments and I don't think it's a good idea.

The current lexer just skips all bytes in comments,
until it finds the end of the current comment run.

And that's probably not a good idea, but simpler...
(otherwise you would have to check all non-ASCIIs)


You still cannot use such invalid UTF sequences for
anything such as identifiers or strings, though...

Just consider it a bug in the current DMD front-end ?
(i.e. don't abuse this, since it'll be fixed one day)


Says http://www.digitalmars.com/d/lex.html:

>  D source text can be in one of the following formats:
> 
>     * ASCII
>     * UTF-8
>     * UTF-16BE
>     * UTF-16LE
>     * UTF-32BE
>     * UTF-32LE 

This implies that *all* source input should be
valid UTF (since ASCII is also valid as UTF-8)

It *should* just stop dead when it finds one, e.g.:
error("invalid UTF-8 sequence");
		
--anders

PS. A nice feature would be to have the frontend
    convert from other encodings as well, but it
    would just add unneeded complexity since there
    are a *lot* of possible encodings out there (200)
January 29, 2005
Re: Non UTF characters in comments
Anders F Björklund wrote:

| Vathix wrote:
|
|>>> I'm not sure if allowing non UTF characters in comments is such a
|>>> good idea.
|
|
|>> Are you trying to use 2 different encodings in one file?
|>
|>
|> Looks like DMD allows that in comments and I don't think it's a good
|> idea.
|
|
| The current lexer just skips all bytes in comments,
| until it finds the end of the current comment run.

[snip]

| It *should* just stop dead when it finds one, e.g.:
| error("invalid UTF-8 sequence");

I dont think the compiler should try to check the comment's content.

What is an "invalid" UTF-8 sequence?

How would you e.g. handle Java's pre 1.5 "customised" UTF-8?
(endcoding >U-FFFF as UTF-16 surrogates encoded in 2 UTF-8 codepoints)

- - Granted, we might agree on overlong  sequences, but how about
unassigned codepoints?
- - Has the input to be normalized? What normalization?
- - Are you going to enforce the full Unicode spec? What spec version?
- - How about the PUA?
- - How about >U-11FFFD?
- - Is U-FFFD/U-FFFC allowed?

Thomas
« First   ‹ Prev
1 2 3
Top | Discussion index | About this forum | D home