Invalid UTF-8 sequence! (page 2) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Invalid UTF-8 sequence! (page 2)

August 19, 2004

Re: Invalid UTF-8 sequence!

Posted by Arcane Jill
in reply to Walter

Arcane Jill

Posted in reply to Walter

In article <cg0gsg$16u8$2@digitaldaemon.com>, Walter says...
>
>
>"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cfvqq8$2jhu$1@digitaldaemon.com...
>> There's a problem here - which is that you and I are not speaking the same language. An ASCII file is a file which DOES NOT CONTAIN any characters
>having
>> codepoints outside the range 0x00 to 0x7F. DMD is perfectly happy with
>ASCII
>> files, but your files are not ASCII.
>>
>> Sorry to be pedantic. Your file is /probably/ ISO-8859-1 (aka "Latin 1").
>But
>> it's not ASCII.
>
>You write well and understand the issues involved. Can I suggest that you write an article about this for, say, CUJ or DDJ? Such an article exploring this topic is sorely needed.

Could be fun. So what are CUJ and DDJ? Could someone give me some URLs?

Jill

August 19, 2004

Re: Invalid UTF-8 sequence!

Posted by Martin
in reply to Walter

Martin

Posted in reply to Walter

I think I will use the \xXX. My workaround solution was much uglyer, so I am quite happy with this one.

Thanks!

In article <cg0n3l$1ln6$1@digitaldaemon.com>, Walter says...
>
>
>"Martin" <Martin_member@pathlink.com> wrote in message news:cg0ggt$16f3$1@digitaldaemon.com...
>> Yes you are probably right, it is some kind of extended ascii, in this
>case I
>> think that yes it is ISO-8859-1.
>> My problem is, that the webserver that I am wrting this software for, uses
>the
>> same encoding.
>> With the old version everything worked fine. Everyone that used the server
>saw
>> the characters right.
>>
>> So can I tell the dmd to use  ISO-8859-1, or just not to check the things
>it
>> shouldn't be checking?
>
>There's no way to do that right now. One of the problems with using such charsets in source code is the source code is then non-portable. Someone can just change a seemingly unrelated system setting, and poof, your builds fail. You can also use \xXX to specify the characters, though that is ugly enough to be unusable.
>
>

August 19, 2004

Re: Invalid UTF-8 sequence!

Posted by Arcane Jill
in reply to Martin

Arcane Jill

Posted in reply to Martin

In article <cg0ggt$16f3$1@digitaldaemon.com>, Martin says...

>My problem is, that the webserver that I am wrting this software for, uses the same encoding.

I think that your statement might need some clarifying. Web servers by definition need to do transcoding. Most programs need a concept of a "run-time encoding" (so they can do printf(), etc.), but the run-time encoding of a web server is no longer limited to that of one particular machine - a web server has to deal with machines all over the internet, each possibly with its own local encoding. The "Accept" field in an HTTP request can act as a request from the browser to the server that the web content be delivered in a particular encoding. For example:

#    Accept: text/plain, text/html; charset=UTF-8

When the page is delivered, a web server sends back:

#    Content-type: text/html; charset=UTF-8

If the encoding is not specified then HTML is supposed to default to ISO-8859-1, but XML (including XHTML) is supposed to default to UTF-8. A web server which doesn't do UTF-8, or which doesn't do transcoding, is all but useless. That said, you may still be able to get away with it. If you send all your web content in a particular encoding, then, as long as it is marked as such, the user's browser /may/ be able to reinterpret the page (the Accept request header is supposed to advise you of what the browser can or can't deal with).

So, when you say "the webserver ... uses the same encoding [ISO-8859-1]", I'm still not clear what it uses that encoding /for/. It's the default for HTML, but are you saying your server emits no other encoding? Not even UTF-8? That would be weird. Any chance you could clarify?




>With the old version everything worked fine. Everyone that used the server saw the characters right.

Providing your server emitted "Content-type: text/html; charset=ISO-8859-1" in its response headers, (or just "Content-type: text/html" since ISO-8859-1 is the default for HTML - but that's dangerous, since not all browsers obey the W3C spec), that is likely to be true. But still, you're relying on a parochial character set, and it /is/ possible that some viewers of your server simply won't have that encoding in their browser.


>So can I tell the dmd to use  ISO-8859-1, or just not to check the things it shouldn't be checking?

No. You *MUST* save your DMD source files in either ASCII or UTF-8 before attempting to compile them. If you wish to emit output in ISO-8859-1 then you must ISO-8859-1-encode the output at runtime (which is easy - I can show you how to do that).

But why is saving your source file as UTF-8 hard? I've never heard of a modern text editor which can't do it, but if you've discovered one, why not just change to a different text editor?

Nonetheless - if you really can't figure out how to save in UTF-8 (which would be surprising for someone writing a web server, with all the transcoding understanding required thereby), then your only remaining choice is to save as ASCII. You can do this by replacing your non-ASCII characters either by Unicode escape sequences (if you want DMD to interpret them) or HTML entities (if you want the users' browsers to interpret them). So replace as follows:

#    Character     Escape sequence     HTML entity
#    ~~~~~~~~~     ~~~~~~~~~~~~~~~     ~~~~~~~~~~~
#    Ü             \u00DC              &#x00DC;
#    Ä             \u00C4              &#x00C4;
#    Ö             \u00D6              &#x00D6;

Hope that helps.

Arcane Jill

August 19, 2004

Re: Invalid UTF-8 sequence!

Posted by Jonathan Leffler
in reply to Arcane Jill

Jonathan Leffler

Posted in reply to Arcane Jill

Arcane Jill wrote:

> In article <cg0gsg$16u8$2@digitaldaemon.com>, Walter:
>>You write well and understand the issues involved. Can I suggest that you
>>write an article about this for, say, CUJ or DDJ? Such an article exploring
>>this topic is sorely needed.
> 
> 
> Could be fun. So what are CUJ and DDJ? Could someone give me some URLs?

CUJ = C User's Journal (or possibly Users'?)
	http://www.cuj.com/ (where there's no apostrophe in sight)
DDJ = Dr Dobb's Journal
	http://www.ddj.com/


-- 
Jonathan Leffler                   #include <disclaimer.h>
Email: jleffler@earthlink.net, jleffler@us.ibm.com
Guardian of DBD::Informix v2003.04 -- http://dbi.perl.org/

August 19, 2004

Re: Invalid UTF-8 sequence!

Posted by Arcane Jill
in reply to Walter

Arcane Jill

Posted in reply to Walter

In article <cg0n3l$1ln6$1@digitaldaemon.com>, Walter says...

>You can also use \xXX to specify the characters, though that is ugly enough to be unusable.

Sorry, Walter - that's not right! You should not be encouraging the use of \xXX in this context. This is wrong. Martin needs to be using \uXXXX, not \xXX. Instead of \xD6, he needs to use \u00D6. (Martin, I hope you're listening).

Sticking \x's into a string literal is just another way to create an invalid UTF-8 sequence. See this code:

#    void main()
#    {
#        char[] s1 = "\xD6";
#        char[] s2 = "\u00D6";
#
#        printf("s1.length = %d\n", s1.length);
#        printf("s2.length = %d\n", s2.length);
#    }

This will output:

#    s1.length = 1
#    s2.length = 2

thereby proving that s1 contains an Invalid UTF-8 sequence! (But s2 is correct).

Remember - \x is used to insert literal bytes. \u inserts characters. All you've done is provided a way to get pre DMD-0.96 behavior out of a DMD-0.96+ compiler.

Arcane Jill

August 19, 2004

Re: Invalid UTF-8 sequence!

Posted by Martin
in reply to Arcane Jill

Martin

Posted in reply to Arcane Jill

I think I will move to UTF-8 with my next version of the program. I can't do it
right now, because then it needs some rewriting.
The UTF-8 output is not the problem, it's more like UTF-8 input. I need to read
the POST data from users browser, to proccess it.
The problem with UTF-8 is that a character can be 1,2,3 or even 4 bytes long. I
do a lot of text proccessing and I need to rewrite, atleast look over all these
functions. But I have a deadline coming...

I wrote my last web with C++, didn't use UTF-8, and it works fine. I am only writing application for Estonian people.

But probalby you are right, I need to move to UTF-8, but not before my next version.

Martin



In article <cg1of6$18ss$1@digitaldaemon.com>, Arcane Jill says...
>
>In article <cg0n3l$1ln6$1@digitaldaemon.com>, Walter says...
>
>>You can also use \xXX to specify the characters, though that is ugly enough to be unusable.
>
>Sorry, Walter - that's not right! You should not be encouraging the use of \xXX in this context. This is wrong. Martin needs to be using \uXXXX, not \xXX. Instead of \xD6, he needs to use \u00D6. (Martin, I hope you're listening).
>
>Sticking \x's into a string literal is just another way to create an invalid UTF-8 sequence. See this code:
>
>#    void main()
>#    {
>#        char[] s1 = "\xD6";
>#        char[] s2 = "\u00D6";
>#
>#        printf("s1.length = %d\n", s1.length);
>#        printf("s2.length = %d\n", s2.length);
>#    }
>
>This will output:
>
>#    s1.length = 1
>#    s2.length = 2
>
>thereby proving that s1 contains an Invalid UTF-8 sequence! (But s2 is correct).
>
>Remember - \x is used to insert literal bytes. \u inserts characters. All you've done is provided a way to get pre DMD-0.96 behavior out of a DMD-0.96+ compiler.
>
>Arcane Jill
>
>

August 19, 2004

Re: Invalid UTF-8 sequence!

Posted by Arcane Jill
in reply to Martin

Arcane Jill

Posted in reply to Martin

In article <cg1q02$1c1h$1@digitaldaemon.com>, Martin says...

>The UTF-8 output is not the problem, it's more like UTF-8 input. I need to read the POST data from users browser, to proccess it.

But I don't think you can make demands on what encoding in which the POST data is going to be presented, can you? You simply have to recognize it, and decode it. If the data is in ISO-whatever, you must decode that; if the data is in MAC-ROMAN, you must decode that; if the data is in UTF-8, you must decode that. And so on.


>The problem with UTF-8 is that a character can be 1,2,3 or even 4 bytes long.

Indeed, but D has lots of handy functions to convert them. And the problem with ISO-8859-1 (Latin-1) is that characters beyond \u00FF are completely unrepresentable. Like, AT ALL. If someone wants to use a lowercase c with an acute accent ('\u0107'), you're completely screwed. UTF-8 is the solution.


>I wrote my last web with C++, didn't use UTF-8, and it works fine.

But only if /you/ compile it. If someone else, with a different default encoding, were to compile the same source code, it may fail badly.

But it's nice to see you're writing for a non-English audience. I'm sure this trend will continue.

Arcane Jill

August 19, 2004

Re: Invalid UTF-8 sequence!

Posted by Walter
in reply to Arcane Jill

Walter

Posted in reply to Arcane Jill

"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cg1of6$18ss$1@digitaldaemon.com...
> In article <cg0n3l$1ln6$1@digitaldaemon.com>, Walter says...
>
> >You can also use \xXX to specify the characters, though that is ugly enough to be unusable.
>
> Sorry, Walter - that's not right! You should not be encouraging the use of
\xXX
> in this context. This is wrong. Martin needs to be using \uXXXX, not \xXX. Instead of \xD6, he needs to use \u00D6. (Martin, I hope you're
listening).
>
> Sticking \x's into a string literal is just another way to create an
invalid
> UTF-8 sequence. See this code:

True, but if they're used to create a ubyte[] sequence (not a char[]
sequence) it should work.

August 19, 2004

Re: Invalid UTF-8 sequence!

Posted by Walter
in reply to Arcane Jill

Walter

Posted in reply to Arcane Jill

"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cg1rba$1fnh$1@digitaldaemon.com...
> But it's nice to see you're writing for a non-English audience. I'm sure
this
> trend will continue.

And that's great, because it helps us identify and shake out the problems with the internationalization support.

August 19, 2004

Re: Invalid UTF-8 sequence!

Posted by Walter
in reply to Jonathan Leffler

Walter

Posted in reply to Jonathan Leffler

"Jonathan Leffler" <jleffler@earthlink.net> wrote in message news:cg1n1p$13qg$1@digitaldaemon.com...
> Arcane Jill wrote:
>
> > In article <cg0gsg$16u8$2@digitaldaemon.com>, Walter:
> >>You write well and understand the issues involved. Can I suggest that
you
> >>write an article about this for, say, CUJ or DDJ? Such an article
exploring
> >>this topic is sorely needed.
> >
> >
> > Could be fun. So what are CUJ and DDJ? Could someone give me some URLs?
>
> CUJ = C User's Journal (or possibly Users'?)
> http://www.cuj.com/ (where there's no apostrophe in sight)
> DDJ = Dr Dobb's Journal
> http://www.ddj.com/

Yes, they're the two main print publications that C/C++ programmers read. The D articles published by them have been well received, and the publisher (CMP Media) has indicated they want more. And besides, they even pay for articles! Getting published in CUJ or DDJ is fairly prestigious, and will look good on any resume. Many of the top highly paid C++ professionals built their reputation early on by writing articles. Many companies also have a policy of giving a bonus to engineering employees who get published in a magazine, that's worth checking out.

So it's really an everybody wins kind of situation.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation