Jump to page: 1 2 3
Thread overview
Non UTF characters in comments
Jan 29, 2005
Vathix
Jan 29, 2005
Thomas Kühne
Jan 29, 2005
Vathix
Jan 29, 2005
Thomas Kühne
Jan 29, 2005
Thomas Kühne
Jan 29, 2005
Thomas Kühne
Jan 29, 2005
Andrew Fedoniouk
Jan 29, 2005
Sebastian Beschke
Jan 29, 2005
Andrew Fedoniouk
Jan 29, 2005
Andrew Fedoniouk
Jan 30, 2005
Andrew Fedoniouk
Jan 30, 2005
Derek
Jan 29, 2005
Vathix
Jan 29, 2005
Walter
Jan 30, 2005
Brian Chapman
January 29, 2005
I'm not sure if allowing non UTF characters in comments is such a good idea. It seems to be complicating my parser, and it will probably complicate other things like text/code editors. What is supposed to happen when a non UTF character is encountered? Should it display a question mark, display nothing, use the current code page? What if the editor doesn't know about D's comments?
I might not have mentioned this, but since D is suppsed to be easily parsed, this might be an issue; a special case.
- Chris
January 29, 2005
Vathix wrote:
| I'm not sure if allowing non UTF characters in comments is such a good
| idea. It seems to be complicating my parser, and it will probably
| complicate other things like text/code editors. What is supposed to
| happen  when a non UTF character is encountered? Should it display a
| question  mark, display nothing, use the current code page? What if the
| editor  doesn't know about D's comments?

Maybe I am missreading your post.
Are you trying to use 2 different encodings in one file?

Concerning Unicode: you are supposed to display the glyph of U+FFFD for
all character's that can't be displayed by other means - e.g. a generic
glyph displaying the codepoint or the code range. (Depending on your
situation you might also use U+FFFC).

http://www.unicode.org

Thomas
January 29, 2005
What does it mean "non UTF character" ?
UTFs are the forms of representing/encoding full UNICODE table - 21-bit
charactes (code points).

So "non UTF character" sounds for me as "non UNICODE character". And what is
that?
Some new alphabet?

Andrew Fedoniouk.
http://terrainformatica.com



"Vathix" <vathix@dprogramming.com> wrote in message news:opsldc5xwrkcck4r@esi...
> I'm not sure if allowing non UTF characters in comments is such a good
> idea. It seems to be complicating my parser, and it will probably
> complicate other things like text/code editors. What is supposed to happen
> when a non UTF character is encountered? Should it display a question
> mark, display nothing, use the current code page? What if the editor
> doesn't know about D's comments?
> I might not have mentioned this, but since D is suppsed to be easily
> parsed, this might be an issue; a special case.
> - Chris


January 29, 2005
> Are you trying to use 2 different encodings in one file?

Looks like DMD allows that in comments and I don't think it's a good idea.
January 29, 2005
Andrew Fedoniouk schrieb:
> What does it mean "non UTF character" ?
> UTFs are the forms of representing/encoding full UNICODE table - 21-bit charactes (code points).
> 
> So "non UTF character" sounds for me as "non UNICODE character". And what is that?
> Some new alphabet?

Invalid sequences *are* possible by using codepoints in the table that aren't defined, or by misforming UTF-8 or UTF-16 sequences.

-Sebastian
January 29, 2005
> So "non UTF character" sounds for me as "non UNICODE character". And what is
> that?
> Some new alphabet?
>

A value in the file that causes std.utf functions to throw an exception because it's invalid. I'm not good at this stuff and I don't know all the proper terminology.
January 29, 2005
Sebastian Beschke wrote:

> Invalid sequences *are* possible by using codepoints in the table that aren't defined, or by misforming UTF-8 or UTF-16 sequences.

A simple way to do it is to try to interpret a file in Latin-1 as UTF-8.
That'll give you "invalid UTF-8 sequence", for everything outside ASCII.

--anders
January 29, 2005
"Vathix" <vathix@dprogramming.com> wrote in message news:opsldc5xwrkcck4r@esi...
> I'm not sure if allowing non UTF characters in comments is such a good
> idea. It seems to be complicating my parser, and it will probably
> complicate other things like text/code editors. What is supposed to happen
> when a non UTF character is encountered? Should it display a question
> mark, display nothing, use the current code page? What if the editor
> doesn't know about D's comments?
> I might not have mentioned this, but since D is suppsed to be easily
> parsed, this might be an issue; a special case.
> - Chris

Technically it's an error to have non-UTF characters anywhere in the source.


January 29, 2005
Vathix wrote:

>>> I'm not sure if allowing non UTF characters in comments is such a good idea.

>> Are you trying to use 2 different encodings in one file?
> 
> Looks like DMD allows that in comments and I don't think it's a good idea.

The current lexer just skips all bytes in comments,
until it finds the end of the current comment run.

And that's probably not a good idea, but simpler...
(otherwise you would have to check all non-ASCIIs)


You still cannot use such invalid UTF sequences for
anything such as identifiers or strings, though...

Just consider it a bug in the current DMD front-end ?
(i.e. don't abuse this, since it'll be fixed one day)


Says http://www.digitalmars.com/d/lex.html:

>  D source text can be in one of the following formats:
> 
>     * ASCII
>     * UTF-8
>     * UTF-16BE
>     * UTF-16LE
>     * UTF-32BE
>     * UTF-32LE 

This implies that *all* source input should be
valid UTF (since ASCII is also valid as UTF-8)

It *should* just stop dead when it finds one, e.g.:
error("invalid UTF-8 sequence");
		
--anders

PS. A nice feature would be to have the frontend
    convert from other encodings as well, but it
    would just add unneeded complexity since there
    are a *lot* of possible encodings out there (200)
January 29, 2005
Anders F Björklund wrote:

| Vathix wrote:
|
|>>> I'm not sure if allowing non UTF characters in comments is such a
|>>> good idea.
|
|
|>> Are you trying to use 2 different encodings in one file?
|>
|>
|> Looks like DMD allows that in comments and I don't think it's a good
|> idea.
|
|
| The current lexer just skips all bytes in comments,
| until it finds the end of the current comment run.

[snip]

| It *should* just stop dead when it finds one, e.g.:
| error("invalid UTF-8 sequence");

I dont think the compiler should try to check the comment's content.

What is an "invalid" UTF-8 sequence?

How would you e.g. handle Java's pre 1.5 "customised" UTF-8?
(endcoding >U-FFFF as UTF-16 surrogates encoded in 2 UTF-8 codepoints)

- - Granted, we might agree on overlong  sequences, but how about
unassigned codepoints?
- - Has the input to be normalized? What normalization?
- - Are you going to enforce the full Unicode spec? What spec version?
- - How about the PUA?
- - How about >U-11FFFD?
- - Is U-FFFD/U-FFFC allowed?

Thomas


« First   ‹ Prev
1 2 3