Thread overview | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
August 18, 2004 INVALID UTF-8 SEQUENCE! | ||||
---|---|---|---|---|
| ||||
I just downloaded the new dmd compiler and it tells me INVALID UTF-8 SEQUENCE when I compile. This is when I use non.english characters in my strings.(like ÜÄÖ) But I need to use them. The old version was just fine, why this change? Most of the C compilers accept them, why not D? |
August 18, 2004 Re: INVALID UTF-8 SEQUENCE! | ||||
---|---|---|---|---|
| ||||
Posted in reply to Martin | In article <cfvh55$2d5s$1@digitaldaemon.com>, Martin says... > >I just downloaded the new dmd compiler and it tells me INVALID UTF-8 SEQUENCE when I compile. This is when I use non.english characters in my strings.(like ÜÄÖ) But I need to use them. The old version was just fine, why this change? > >Most of the C compilers accept them, why not D? > What format is your file saved in? From http://www.digitalmars.com/d/lex.html: Source Text D source text can be in one of the following formats: * ASCII * UTF-8 * UTF-16BE * UTF-16LE * UTF-32BE * UTF-32LE jcc7 |
August 18, 2004 Re: Invalid UTF-8 sequence! | ||||
---|---|---|---|---|
| ||||
Posted in reply to Martin | In article <cfvh55$2d5s$1@digitaldaemon.com>, Martin says... > >I just downloaded the new dmd compiler and it tells me INVALID UTF-8 SEQUENCE when I compile. This is when I use non.english characters in my strings.(like ÜÄÖ) But I need to use them. The old version was just fine, why this change? I have a sneaking suspicion you might find it will work just fine if you save your source file in UTF-8 before trying to compile it. (Save As...). So far as I know, the D compiler has not changed in this regard (except that it can now auto-detect UTF-16 and UTF-32). >Most of the C compilers accept them, why not D? Actually, I think most C compilers simply allow a string to consist of an arbitrary sequence of bytes without any interpretation whatsoever - which just happens to appear to work whenever the source file encoding is the same as the run-time encoding. Arcane Jill |
August 18, 2004 Re: Invalid UTF-8 sequence! | ||||
---|---|---|---|---|
| ||||
Posted in reply to Arcane Jill | Thank you for your answer! >I have a sneaking suspicion you might find it will work just fine if you save your source file in UTF-8 before trying to compile it. (Save As...). I am using the gnu midnight commander text editor, it only saves ascii. >So far as I know, the D compiler has not changed in this regard (except that it >can now auto-detect UTF-16 and UTF-32). I think I changed the 0.93 version to the 0.98. In 0.93 my files compiled fine, in 0.98 I get an error. I changed back to 0.93, because I need to use these characters. So how to I tell the dmd that the source is an ascii file? Thank you! In article <cfvjdr$2dr2$1@digitaldaemon.com>, Arcane Jill says... > >In article <cfvh55$2d5s$1@digitaldaemon.com>, Martin says... >> >>I just downloaded the new dmd compiler and it tells me INVALID UTF-8 SEQUENCE when I compile. This is when I use non.english characters in my strings.(like ÜÄÖ) But I need to use them. The old version was just fine, why this change? > >I have a sneaking suspicion you might find it will work just fine if you save your source file in UTF-8 before trying to compile it. (Save As...). So far as I know, the D compiler has not changed in this regard (except that it can now auto-detect UTF-16 and UTF-32). > > >>Most of the C compilers accept them, why not D? > >Actually, I think most C compilers simply allow a string to consist of an arbitrary sequence of bytes without any interpretation whatsoever - which just happens to appear to work whenever the source file encoding is the same as the run-time encoding. > >Arcane Jill > > |
August 18, 2004 Re: Invalid UTF-8 sequence! | ||||
---|---|---|---|---|
| ||||
Posted in reply to Martin | In article <cfvlcp$2eob$1@digitaldaemon.com>, Martin says... > >Thank you for your answer! Err... Don't thank me yet. Save that until the problem's actually solved! >>I have a sneaking suspicion you might find it will work just fine if you save your source file in UTF-8 before trying to compile it. (Save As...). > >I am using the gnu midnight commander text editor, it only saves ascii. That's not possible. In your original post you said "I use non.english characters in my strings.(like ÜÄÖ)". If that statement is true, you /cannot/ be using ASCII, since these characters do not even /exist/ in ASCII. If your text contains any of the characters 'Ü', 'Ä' or 'Ö' then you are /not/ using ASCII. Period. Unfortunately, I am not familiar with this text editor, so I don't know how to determine the encoding it uses, or how to change it. I may have a fix for you even so, however. (Read on...) >>So far as I know, the D compiler has not changed in this regard (except that it >can now auto-detect UTF-16 and UTF-32). > >I think I changed the 0.93 version to the 0.98. In 0.93 my files compiled fine, in 0.98 I get an error. I changed back to 0.93, because I need to use these characters. Okay. Now, first off, the following compiles fine for me: # void main() # { # printf("hello ÜÄÖ\n"); # } using DMD v0.98, with the file saved as UTF-8. However, when I resaved the file as ISO-8859-1 (which is an invalid thing to do) then the compiler (correctly) gave me the compile-time error message: "Invalid UTF-8 sequence". I believe that the earlier version to which you refer (0.93) there was a bug, which was fixed in 0.96 - according to the change log: "Invalid UTF characters in string literals now diagnosed." In other words, DMD 0.93 failed to diagnose the invalid UTF-8 characters in your source file, and so the file compiled -- but it compiled incorrectly. The error would not have been detected until runtime - and even then only IF you passed your string to a UTF conversion routine. If you passed your invalid string straight to printf(), for example, the v0.93 compiler wouldn't even have noticed. But you can bet your life that even if your program had appeared to run correctly on your machine, it would not necessarily have worked on anyone else's. So tell me - what operating system are you using. The word "gnu" makes me suspect Linux, in which case I believe you need to set the environment variable, CHARSET to the value UTF-8. (But I'm a Windows user, so I could be wrong - I'm hoping someone will leap in here and correct me if so). Anyway, once you've set your environment variable, everything should work with the latest DMD - and this time, it will work for everyone, not just for you. >So how to I tell the dmd that the source is an ascii file? There's a problem here - which is that you and I are not speaking the same language. An ASCII file is a file which DOES NOT CONTAIN any characters having codepoints outside the range 0x00 to 0x7F. DMD is perfectly happy with ASCII files, but your files are not ASCII. Sorry to be pedantic. Your file is /probably/ ISO-8859-1 (aka "Latin 1"). But it's not ASCII. Arcane Jill |
August 18, 2004 Re: Invalid UTF-8 sequence! | ||||
---|---|---|---|---|
| ||||
Posted in reply to Martin | In article <cfvlcp$2eob$1@digitaldaemon.com>, Martin says... > >Thank you for your answer! >>I have a sneaking suspicion you might find it will work just fine if you save your source file in UTF-8 before trying to compile it. (Save As...). > >I am using the gnu midnight commander text editor, it only saves ascii. If you are on linux you can convert from latin1 to utf8 with the command iconv -f latin1 -t utf8 file.d > newfile.d dmd newfile.d You will probably be doing that a lot, so it's best if you can put it in a script or something. Hope this helps :) Nick |
August 18, 2004 Re: Invalid UTF-8 sequence! | ||||
---|---|---|---|---|
| ||||
Posted in reply to Arcane Jill | Yes you are probably right, it is some kind of extended ascii, in this case I think that yes it is ISO-8859-1. My problem is, that the webserver that I am wrting this software for, uses the same encoding. With the old version everything worked fine. Everyone that used the server saw the characters right. So can I tell the dmd to use ISO-8859-1, or just not to check the things it shouldn't be checking? In article <cfvqq8$2jhu$1@digitaldaemon.com>, Arcane Jill says... > >In article <cfvlcp$2eob$1@digitaldaemon.com>, Martin says... >> >>Thank you for your answer! > >Err... Don't thank me yet. Save that until the problem's actually solved! > > >>>I have a sneaking suspicion you might find it will work just fine if you save your source file in UTF-8 before trying to compile it. (Save As...). >> >>I am using the gnu midnight commander text editor, it only saves ascii. > >That's not possible. In your original post you said "I use non.english >characters in my strings.(like ÜÄÖ)". If that statement is true, you /cannot/ be >using ASCII, since these characters do not even /exist/ in ASCII. If your text >contains any of the characters 'Ü', 'Ä' or 'Ö' then you are /not/ using ASCII. >Period. > >Unfortunately, I am not familiar with this text editor, so I don't know how to determine the encoding it uses, or how to change it. I may have a fix for you even so, however. (Read on...) > > >>>So far as I know, the D compiler has not changed in this regard (except that it >can now auto-detect UTF-16 and UTF-32). >> >>I think I changed the 0.93 version to the 0.98. In 0.93 my files compiled fine, in 0.98 I get an error. I changed back to 0.93, because I need to use these characters. > >Okay. Now, first off, the following compiles fine for me: > ># void main() ># { ># printf("hello ÜÄÖ\n"); ># } > >using DMD v0.98, with the file saved as UTF-8. However, when I resaved the file as ISO-8859-1 (which is an invalid thing to do) then the compiler (correctly) gave me the compile-time error message: "Invalid UTF-8 sequence". > >I believe that the earlier version to which you refer (0.93) there was a bug, which was fixed in 0.96 - according to the change log: "Invalid UTF characters in string literals now diagnosed." In other words, DMD 0.93 failed to diagnose the invalid UTF-8 characters in your source file, and so the file compiled -- but it compiled incorrectly. The error would not have been detected until runtime - and even then only IF you passed your string to a UTF conversion routine. If you passed your invalid string straight to printf(), for example, the v0.93 compiler wouldn't even have noticed. But you can bet your life that even if your program had appeared to run correctly on your machine, it would not necessarily have worked on anyone else's. > >So tell me - what operating system are you using. The word "gnu" makes me suspect Linux, in which case I believe you need to set the environment variable, CHARSET to the value UTF-8. (But I'm a Windows user, so I could be wrong - I'm hoping someone will leap in here and correct me if so). Anyway, once you've set your environment variable, everything should work with the latest DMD - and this time, it will work for everyone, not just for you. > > > >>So how to I tell the dmd that the source is an ascii file? > >There's a problem here - which is that you and I are not speaking the same language. An ASCII file is a file which DOES NOT CONTAIN any characters having codepoints outside the range 0x00 to 0x7F. DMD is perfectly happy with ASCII files, but your files are not ASCII. > >Sorry to be pedantic. Your file is /probably/ ISO-8859-1 (aka "Latin 1"). But >it's not ASCII. > >Arcane Jill > > > |
August 18, 2004 Re: Invalid UTF-8 sequence! | ||||
---|---|---|---|---|
| ||||
Posted in reply to Arcane Jill | "Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cfvjdr$2dr2$1@digitaldaemon.com... > >Most of the C compilers accept them, why not D? > > Actually, I think most C compilers simply allow a string to consist of an arbitrary sequence of bytes without any interpretation whatsoever - which just > happens to appear to work whenever the source file encoding is the same as the > run-time encoding. It doesn't always work, some of the code pages include multibyte sequences where " can be the second byte :-(. That's why DMC has special switches for such. This is just the sort of thing I want to move away from. |
August 18, 2004 Re: Invalid UTF-8 sequence! | ||||
---|---|---|---|---|
| ||||
Posted in reply to Arcane Jill | "Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cfvqq8$2jhu$1@digitaldaemon.com... > There's a problem here - which is that you and I are not speaking the same language. An ASCII file is a file which DOES NOT CONTAIN any characters having > codepoints outside the range 0x00 to 0x7F. DMD is perfectly happy with ASCII > files, but your files are not ASCII. > > Sorry to be pedantic. Your file is /probably/ ISO-8859-1 (aka "Latin 1"). But > it's not ASCII. You write well and understand the issues involved. Can I suggest that you write an article about this for, say, CUJ or DDJ? Such an article exploring this topic is sorely needed. |
August 18, 2004 Re: Invalid UTF-8 sequence! | ||||
---|---|---|---|---|
| ||||
Posted in reply to Martin | "Martin" <Martin_member@pathlink.com> wrote in message news:cg0ggt$16f3$1@digitaldaemon.com... > Yes you are probably right, it is some kind of extended ascii, in this case I > think that yes it is ISO-8859-1. > My problem is, that the webserver that I am wrting this software for, uses the > same encoding. > With the old version everything worked fine. Everyone that used the server saw > the characters right. > > So can I tell the dmd to use ISO-8859-1, or just not to check the things it > shouldn't be checking? There's no way to do that right now. One of the problems with using such charsets in source code is the source code is then non-portable. Someone can just change a seemingly unrelated system setting, and poof, your builds fail. You can also use \xXX to specify the characters, though that is ugly enough to be unusable. |
Copyright © 1999-2021 by the D Language Foundation