Non UTF characters in comments

I'm not sure if allowing non UTF characters in comments is such a good idea. It seems to be complicating my parser, and it will probably complicate other things like text/code editors. What is supposed to happen when a non UTF character is encountered? Should it display a question mark, display nothing, use the current code page? What if the editor doesn't know about D's comments? I might not have mentioned this, but since D is suppsed to be easily parsed, this might be an issue; a special case. - Chris

Vathix wrote: | I'm not sure if allowing non UTF characters in comments is such a good | idea. It seems to be complicating my parser, and it will probably | complicate other things like text/code editors. What is supposed to | happen when a non UTF character is encountered? Should it display a | question mark, display nothing, use the current code page? What if the | editor doesn't know about D's comments? Maybe I am missreading your post. Are you trying to use 2 different encodings in one file? Concerning Unicode: you are supposed to display the glyph of U+FFFD for all character's that can't be displayed by other means - e.g. a generic glyph displaying the codepoint or the code range. (Depending on your situation you might also use U+FFFC). http://www.unicode.org Thomas

What does it mean "non UTF character" ? UTFs are the forms of representing/encoding full UNICODE table - 21-bit charactes (code points). So "non UTF character" sounds for me as "non UNICODE character". And what is that? Some new alphabet? Andrew Fedoniouk. http://terrainformatica.com "Vathix" <vathix@dprogramming.com> wrote in message news:opsldc5xwrkcck4r@esi... > I'm not sure if allowing non UTF characters in comments is such a good > idea. It seems to be complicating my parser, and it will probably > complicate other things like text/code editors. What is supposed to happen > when a non UTF character is encountered? Should it display a question > mark, display nothing, use the current code page? What if the editor > doesn't know about D's comments? > I might not have mentioned this, but since D is suppsed to be easily > parsed, this might be an issue; a special case. > - Chris

Andrew Fedoniouk schrieb: > What does it mean "non UTF character" ? > UTFs are the forms of representing/encoding full UNICODE table - 21-bit charactes (code points). > > So "non UTF character" sounds for me as "non UNICODE character". And what is that? > Some new alphabet? Invalid sequences *are* possible by using codepoints in the table that aren't defined, or by misforming UTF-8 or UTF-16 sequences. -Sebastian

> So "non UTF character" sounds for me as "non UNICODE character". And what is > that? > Some new alphabet? > A value in the file that causes std.utf functions to throw an exception because it's invalid. I'm not good at this stuff and I don't know all the proper terminology.

Sebastian Beschke wrote: > Invalid sequences *are* possible by using codepoints in the table that aren't defined, or by misforming UTF-8 or UTF-16 sequences. A simple way to do it is to try to interpret a file in Latin-1 as UTF-8. That'll give you "invalid UTF-8 sequence", for everything outside ASCII. --anders

"Vathix" <vathix@dprogramming.com> wrote in message news:opsldc5xwrkcck4r@esi... > I'm not sure if allowing non UTF characters in comments is such a good > idea. It seems to be complicating my parser, and it will probably > complicate other things like text/code editors. What is supposed to happen > when a non UTF character is encountered? Should it display a question > mark, display nothing, use the current code page? What if the editor > doesn't know about D's comments? > I might not have mentioned this, but since D is suppsed to be easily > parsed, this might be an issue; a special case. > - Chris Technically it's an error to have non-UTF characters anywhere in the source.

January 29, 2005

Re: Non UTF characters in comments

Posted by Anders F Björklund
in reply to Vathix

Permalink

Anders F Björklund

Posted in reply to Vathix

Permalink

Vathix wrote:

>>> I'm not sure if allowing non UTF characters in comments is such a good idea.

>> Are you trying to use 2 different encodings in one file?
> 
> Looks like DMD allows that in comments and I don't think it's a good idea.

The current lexer just skips all bytes in comments,
until it finds the end of the current comment run.

And that's probably not a good idea, but simpler...
(otherwise you would have to check all non-ASCIIs)

You still cannot use such invalid UTF sequences for
anything such as identifiers or strings, though...

Just consider it a bug in the current DMD front-end ?
(i.e. don't abuse this, since it'll be fixed one day)

Says http://www.digitalmars.com/d/lex.html:

>  D source text can be in one of the following formats:
> 
>     * ASCII
>     * UTF-8
>     * UTF-16BE
>     * UTF-16LE
>     * UTF-32BE
>     * UTF-32LE 

This implies that *all* source input should be
valid UTF (since ASCII is also valid as UTF-8)

It *should* just stop dead when it finds one, e.g.:
error("invalid UTF-8 sequence");

--anders

PS. A nice feature would be to have the frontend
    convert from other encodings as well, but it
    would just add unneeded complexity since there
    are a *lot* of possible encodings out there (200)

Anders F Björklund wrote: | Vathix wrote: | |>>> I'm not sure if allowing non UTF characters in comments is such a |>>> good idea. | | |>> Are you trying to use 2 different encodings in one file? |> |> |> Looks like DMD allows that in comments and I don't think it's a good |> idea. | | | The current lexer just skips all bytes in comments, | until it finds the end of the current comment run. [snip] | It *should* just stop dead when it finds one, e.g.: | error("invalid UTF-8 sequence"); I dont think the compiler should try to check the comment's content. What is an "invalid" UTF-8 sequence? How would you e.g. handle Java's pre 1.5 "customised" UTF-8? (endcoding >U-FFFF as UTF-16 surrogates encoded in 2 UTF-8 codepoints) - - Granted, we might agree on overlong sequences, but how about unassigned codepoints? - - Has the input to be normalized? What normalization? - - Are you going to enforce the full Unicode spec? What spec version? - - How about the PUA? - - How about >U-11FFFD? - - Is U-FFFD/U-FFFC allowed? Thomas

Forums