Thread overview | |||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
January 29, 2005 Non UTF characters in comments | ||||
---|---|---|---|---|
| ||||
I'm not sure if allowing non UTF characters in comments is such a good idea. It seems to be complicating my parser, and it will probably complicate other things like text/code editors. What is supposed to happen when a non UTF character is encountered? Should it display a question mark, display nothing, use the current code page? What if the editor doesn't know about D's comments? I might not have mentioned this, but since D is suppsed to be easily parsed, this might be an issue; a special case. - Chris |
January 29, 2005 Re: Non UTF characters in comments | ||||
---|---|---|---|---|
| ||||
Posted in reply to Vathix Attachments: | Vathix wrote: | I'm not sure if allowing non UTF characters in comments is such a good | idea. It seems to be complicating my parser, and it will probably | complicate other things like text/code editors. What is supposed to | happen when a non UTF character is encountered? Should it display a | question mark, display nothing, use the current code page? What if the | editor doesn't know about D's comments? Maybe I am missreading your post. Are you trying to use 2 different encodings in one file? Concerning Unicode: you are supposed to display the glyph of U+FFFD for all character's that can't be displayed by other means - e.g. a generic glyph displaying the codepoint or the code range. (Depending on your situation you might also use U+FFFC). http://www.unicode.org Thomas |
January 29, 2005 Re: Non UTF characters in comments | ||||
---|---|---|---|---|
| ||||
Posted in reply to Vathix | What does it mean "non UTF character" ? UTFs are the forms of representing/encoding full UNICODE table - 21-bit charactes (code points). So "non UTF character" sounds for me as "non UNICODE character". And what is that? Some new alphabet? Andrew Fedoniouk. http://terrainformatica.com "Vathix" <vathix@dprogramming.com> wrote in message news:opsldc5xwrkcck4r@esi... > I'm not sure if allowing non UTF characters in comments is such a good > idea. It seems to be complicating my parser, and it will probably > complicate other things like text/code editors. What is supposed to happen > when a non UTF character is encountered? Should it display a question > mark, display nothing, use the current code page? What if the editor > doesn't know about D's comments? > I might not have mentioned this, but since D is suppsed to be easily > parsed, this might be an issue; a special case. > - Chris |
January 29, 2005 Re: Non UTF characters in comments | ||||
---|---|---|---|---|
| ||||
Posted in reply to Thomas Kühne | > Are you trying to use 2 different encodings in one file?
Looks like DMD allows that in comments and I don't think it's a good idea.
|
January 29, 2005 Re: Non UTF characters in comments | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrew Fedoniouk | Andrew Fedoniouk schrieb:
> What does it mean "non UTF character" ?
> UTFs are the forms of representing/encoding full UNICODE table - 21-bit charactes (code points).
>
> So "non UTF character" sounds for me as "non UNICODE character". And what is that?
> Some new alphabet?
Invalid sequences *are* possible by using codepoints in the table that aren't defined, or by misforming UTF-8 or UTF-16 sequences.
-Sebastian
|
January 29, 2005 Re: Non UTF characters in comments | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andrew Fedoniouk | > So "non UTF character" sounds for me as "non UNICODE character". And what is
> that?
> Some new alphabet?
>
A value in the file that causes std.utf functions to throw an exception because it's invalid. I'm not good at this stuff and I don't know all the proper terminology.
|
January 29, 2005 Re: Non UTF characters in comments | ||||
---|---|---|---|---|
| ||||
Posted in reply to Sebastian Beschke | Sebastian Beschke wrote:
> Invalid sequences *are* possible by using codepoints in the table that aren't defined, or by misforming UTF-8 or UTF-16 sequences.
A simple way to do it is to try to interpret a file in Latin-1 as UTF-8.
That'll give you "invalid UTF-8 sequence", for everything outside ASCII.
--anders
|
January 29, 2005 Re: Non UTF characters in comments | ||||
---|---|---|---|---|
| ||||
Posted in reply to Vathix |
"Vathix" <vathix@dprogramming.com> wrote in message news:opsldc5xwrkcck4r@esi...
> I'm not sure if allowing non UTF characters in comments is such a good
> idea. It seems to be complicating my parser, and it will probably
> complicate other things like text/code editors. What is supposed to happen
> when a non UTF character is encountered? Should it display a question
> mark, display nothing, use the current code page? What if the editor
> doesn't know about D's comments?
> I might not have mentioned this, but since D is suppsed to be easily
> parsed, this might be an issue; a special case.
> - Chris
Technically it's an error to have non-UTF characters anywhere in the source.
|
January 29, 2005 Re: Non UTF characters in comments | ||||
---|---|---|---|---|
| ||||
Posted in reply to Vathix | Vathix wrote: >>> I'm not sure if allowing non UTF characters in comments is such a good idea. >> Are you trying to use 2 different encodings in one file? > > Looks like DMD allows that in comments and I don't think it's a good idea. The current lexer just skips all bytes in comments, until it finds the end of the current comment run. And that's probably not a good idea, but simpler... (otherwise you would have to check all non-ASCIIs) You still cannot use such invalid UTF sequences for anything such as identifiers or strings, though... Just consider it a bug in the current DMD front-end ? (i.e. don't abuse this, since it'll be fixed one day) Says http://www.digitalmars.com/d/lex.html: > D source text can be in one of the following formats: > > * ASCII > * UTF-8 > * UTF-16BE > * UTF-16LE > * UTF-32BE > * UTF-32LE This implies that *all* source input should be valid UTF (since ASCII is also valid as UTF-8) It *should* just stop dead when it finds one, e.g.: error("invalid UTF-8 sequence"); --anders PS. A nice feature would be to have the frontend convert from other encodings as well, but it would just add unneeded complexity since there are a *lot* of possible encodings out there (200) |
January 29, 2005 Re: Non UTF characters in comments | ||||
---|---|---|---|---|
| ||||
Posted in reply to Anders F Björklund Attachments: | Anders F Björklund wrote: | Vathix wrote: | |>>> I'm not sure if allowing non UTF characters in comments is such a |>>> good idea. | | |>> Are you trying to use 2 different encodings in one file? |> |> |> Looks like DMD allows that in comments and I don't think it's a good |> idea. | | | The current lexer just skips all bytes in comments, | until it finds the end of the current comment run. [snip] | It *should* just stop dead when it finds one, e.g.: | error("invalid UTF-8 sequence"); I dont think the compiler should try to check the comment's content. What is an "invalid" UTF-8 sequence? How would you e.g. handle Java's pre 1.5 "customised" UTF-8? (endcoding >U-FFFF as UTF-16 surrogates encoded in 2 UTF-8 codepoints) - - Granted, we might agree on overlong sequences, but how about unassigned codepoints? - - Has the input to be normalized? What normalization? - - Are you going to enforce the full Unicode spec? What spec version? - - How about the PUA? - - How about >U-11FFFD? - - Is U-FFFD/U-FFFC allowed? Thomas |
Copyright © 1999-2021 by the D Language Foundation