Jump to page: 1 2
Thread overview
writefln and ASCII
Sep 12, 2006
Serg Kovrov
Sep 12, 2006
Oskar Linde
Sep 12, 2006
Marcin Kuszczak
Sep 12, 2006
Serg Kovrov
Sep 13, 2006
Steve Horne
Sep 13, 2006
nobody
Sep 13, 2006
Steve Horne
Sep 13, 2006
nobody
Sep 14, 2006
Steve Horne
Sep 14, 2006
John Reimer
Sep 14, 2006
Steve Horne
Sep 14, 2006
nobody
Sep 14, 2006
John Reimer
Sep 15, 2006
John Reimer
Sep 17, 2006
Steve Horne
Sep 14, 2006
Don Clugston
Sep 15, 2006
Steve Horne
Sep 15, 2006
John Reimer
Sep 15, 2006
Georg Wrede
Sep 14, 2006
nobody
September 12, 2006
How do I writefln a string from ASCII file contained illegal UTF-8 characters, but legal as ASCII? For example ndash symbol - ASCII 0x96).

Is there a standard routine to convert such ASCII characters to UTF, or other way to get valid UTF string from arbitrary raw data? Filter or substitute bad characters, etc...

-- 
serg.
September 12, 2006
Serg Kovrov wrote:
> How do I writefln a string from ASCII file contained illegal UTF-8 characters, but legal as ASCII? For example ndash symbol - ASCII 0x96).

0x96 is not valid ASCII. Nothing above 0x7F is valid ASCII (7-bit).

> Is there a standard routine to convert such ASCII characters to UTF, or other way to get valid UTF string from arbitrary raw data? Filter or substitute bad characters, etc...

If your raw data is Latin-1 (ISO 8859-1):

ubyte[] src;
char[] dst;
foreach(s; src)
	std.utf.encode(dst,cast(dchar)s);


/Oskar
September 12, 2006
Serg Kovrov wrote:

> How do I writefln a string from ASCII file contained illegal UTF-8 characters, but legal as ASCII? For example ndash symbol - ASCII 0x96).
> 
> Is there a standard routine to convert such ASCII characters to UTF, or other way to get valid UTF string from arbitrary raw data? Filter or substitute bad characters, etc...
> 

First - explanation:
If you have file with invalid UTF-8 characters it means that it is in
specific local encoding. It's not an ASCII file as ASCII is only for
characters code from 0..127.

Second -- how to cope with such files:
I used to convert files in local encoding using std.windows.charset. There
are two functions which will be useful:

char[] fromMBSz(char* s, int codePage = 0);  // local encoding to UTF-8
char* toMBSz(char[] s, uint codePage = cast(uint)0); // UTF-8 to local
encoding

In my case encoding was 1250.

-- 
Regards
Marcin Kuszczak
(Aarti_pl)
September 12, 2006
Marcin Kuszczak wrote:
> char[] fromMBSz(char* s, int codePage = 0);  // local encoding to UTF-8
> char* toMBSz(char[] s, uint codePage = cast(uint)0); // UTF-8 to local
> encoding

Thanks Marcin, fromMBSz() is just fine.

In my case I do not care for correct codepage - just want to dump contents to console as part of debug trace.

-- 
serg.
September 13, 2006
On Tue, 12 Sep 2006 15:03:20 +0300, Serg Kovrov <kovrov@no.spam> wrote:

>How do I writefln a string from ASCII file contained illegal UTF-8 characters, but legal as ASCII? For example ndash symbol - ASCII 0x96).

Just to add some angry ranting to what has already been said...

First, there was ASCII. ASCII had character codes 0 to 127.

Then, there was a whole bunch of codepages - different character sets for different countries. These exploited characters 128 to 255, but each codepage defined the characters differently. Some codepages had multi-byte characters.

Then, there was Unicode. Unicode was supposed to make things easier. But instead, it made things harder.

There are millions of defined codes in unicode. A code does not necessarily represent a character - it may take several codes in sequence (for example, applying diacritics).

In addition, the codes are just numbers. There are several different ways to encode them into streams of bytes or whatever.

In UTF-8, several bytes may be needed to specify a unicode code, and several codes may be needed to specify a single character. I don't think any codepage has that level of complexity.

Oh, and let's not forget the UTF-16, UCS-32 and other encodings.

And then, of course, understanding millions of codes is a tall order. Especially since the number isn't fixed, even now. Back when there were 40,000 or so codes, people thought the set was almost complete - hah hah hah! So Unicode explicitly allows that applications and operating systems don't have to understand all possible codes. So you have the Windows XP subset, the MacOS subset, the GTK subset etc etc etc.

And then there's endian marks to worry about!

So much for easier. So much for one standard.


Anyway, the first 256 unicode codes match the characters in the US codepage. Because of the way UTF-8 encoding works, a genuine ASCII file is also a UTF-8 file. But a file that uses characters 128 to 255 for any codepage, US included, is not a valid UTF-8 file.

-- 
Remove 'wants' and 'nospam' from e-mail.
September 13, 2006
Steve Horne wrote:
> On Tue, 12 Sep 2006 15:03:20 +0300, Serg Kovrov <kovrov@no.spam>
> wrote:
> 
>> How do I writefln a string from ASCII file contained illegal UTF-8 characters, but legal as ASCII? For example ndash symbol - ASCII 0x96).
> 
> Just to add some angry ranting to what has already been said...

I can understand your frustration. I felt the same way you did for awhile. The thing that changed my mind was realizing that I think Unicode has some great features.

Unicode threads do have a tendency to be rather long so here is my short contribution up front. UTF-8 is great if you can be fairly sure you will only be using ASCII data. UTF-16 is great for almost every writing system that is currently used on the planet Earth.

> 
> Then, there was a whole bunch of codepages - different character sets
> for different countries. These exploited characters 128 to 255, but
> each codepage defined the characters differently. Some codepages had
> multi-byte characters.

  Unicode is not so bad
  Unicode 不是那么坏
  Unicode δεν είναι τόσο κακό
  Unicode はあまり悪くない
  Unicode не настолько плох

When I wrote this message I see English, Chinese, Greek, Japanese and Russian characters displayed. My preferred text editor (TextPad) uses codepages and wants me to pick whether to display only one of Chinese, Greek, Japanese or Russian. With Unicode it is possible to read and write all of the above.

If you think Unicode is overly complex then perhaps you should have a go at writing some code to display this message correctly using codepages. You will need to identify codepage boundaries and then you can probably use frequency tables to identify the codepage used within each boundry. You might want to mitigate the high error rates by also checking dictionaries appropriate for each codepage. Of course dictionaries only go so far so you might also need to know how each language and its dialects vary words.
September 13, 2006
On Wed, 13 Sep 2006 10:55:42 -0400, nobody <nobody@mailinator.com> wrote:

>When I wrote this message I see English, Chinese, Greek, Japanese and Russian characters displayed.

Obviously, yes. I just think Unicode could have been simpler. And perhaps it doesn't really need codepoints for characters in languages and dialects that haven't been in use for a couple of thousand years.

>You will need to identify codepage boundaries and then you can probably use frequency tables to identify the codepage used within each boundry.

Metadata. When your document cannot be represented as a simple text file, use something else.

-- 
Remove 'wants' and 'nospam' from e-mail.
September 13, 2006
Steve Horne wrote:
> On Wed, 13 Sep 2006 10:55:42 -0400, nobody <nobody@mailinator.com>
> wrote:
> 
>> When I wrote this message I see English, Chinese, Greek, Japanese and Russian characters displayed.
> 
> Obviously, yes. I just think Unicode could have been simpler.

I think they kept it as simple as was reasonably possible. Once you admit a need to use more than a single byte to represent an entity then any solution is going to have the same complications.

They really did need to remain backwards compatible with ASCII while also allowing the bulk of non-ASCII to be represented as 2 bytes.

UTF-8 is free of endian ambiguity and is fully compatible with ASCII data but might use as many as 8 bytes to represent a single Unicode code point. UTF-16 represents the bulk of code points actually used in the world with only 2 bytes but as with any data using more than one byte it has to address endian ambiguities.

> 
>> You will need to identify codepage boundaries and then you can probably use frequency tables to identify the codepage used within each boundry.
> 
> Metadata. When your document cannot be represented as a simple text
> file, use something else.
> 

It is my opinion that if you need metadata in addition to textual data then your method of representing textual data is inadequate.

I am certain that to freely mix data from any codepage you would probably use something like an escape code. If you were really sly you would probably use ASCII as a default code page and then let the highest bit being set represent an escape code -- which is exactly how UTF-8 starts out. How you would imagine filling out the rest?
September 14, 2006
On Wed, 13 Sep 2006 14:17:13 -0400, nobody <nobody@mailinator.com> wrote:

>> Metadata. When your document cannot be represented as a simple text file, use something else.
>> 
>
>It is my opinion that if you need metadata in addition to textual data then your method of representing textual data is inadequate.

Ah. So you believe that HTML and XML are garbage, then. Along with all binary word-processor document files.

But then, Unicode is inadequate also. You need additional metadata for anything beyond the simplest text. Unicode gives you a huge selection of characters, but it can't specify paragraphs styles etc.

>I am certain that to freely mix data from any codepage you would probably use something like an escape code.

That would be the most cryptically compressed form of metadata, I suppose. But why compress the metadata at the expense of the character data?

Switching languages and codepages is a relatively rare thing. Most documents don't do it at all. Even those that do are hardly likely to switch every other character.

By the huffman compression principle of representing the most frequent things with the smallest codes, the logical thing to do is to have single byte characters as much as possible and use a multibyte sequence - a tag - to select codepages.


I'm getting the feeling that I've given the wrong impression. Just for the record, I posted some ranting because I have too much time on my hands. And I really do believe that unicode could have been simpler. That doesn't mean I'm saying it's useless, and you shouldn't take it all too seriously.

I can say things in a way that causes offense sometimes. I can be too strong in defending opinions when I really don't care that much, for instance. Picking silly holes in arguments, out of a pure love of absurdity. And it doesn't help that I have an overformal way of saying things that I've been told is like being lectured at all the time.

I mentioned the word Aspie in another post. That's as in Aspergers Syndrome. For info on how and why we end up unintentionally upsetting people, try...

http://www.mugsy.org/asa_faq/

and in particular...

http://www.mugsy.org/asa_faq/getting_along/index.shtml

I dare say someone else here has Aspergers, or at least knows someone. Everyone does these days. It's not always a big deal. I'm having problems, but I really don't want to go on about them here.

Just wanted to make the point that any apparent tone you may pick up from what I write is usually random noise. Sure I criticise things, but it's not that serious. Almost all humour is based around some kind of criticism, directed either inward or outward. I just can't get the tone right is all.

-- 
Remove 'wants' and 'nospam' from e-mail.
September 14, 2006
On Wed, 13 Sep 2006 18:35:32 -0700, Steve Horne <stephenwantshornenospam100@aol.com> wrote:

> I can say things in a way that causes offense sometimes. I can be too
> strong in defending opinions when I really don't care that much, for
> instance. Picking silly holes in arguments, out of a pure love of
> absurdity. And it doesn't help that I have an overformal way of saying
> things that I've been told is like being lectured at all the time.
>
> I mentioned the word Aspie in another post. That's as in Aspergers
> Syndrome. For info on how and why we end up unintentionally upsetting
> people, try...
>
> http://www.mugsy.org/asa_faq/
>
> and in particular...
>
> http://www.mugsy.org/asa_faq/getting_along/index.shtml
>
> I dare say someone else here has Aspergers, or at least knows someone.
> Everyone does these days. It's not always a big deal. I'm having
> problems, but I really don't want to go on about them here.
>
> Just wanted to make the point that any apparent tone you may pick up
> from what I write is usually random noise. Sure I criticise things,
> but it's not that serious. Almost all humour is based around some kind
> of criticism, directed either inward or outward. I just can't get the
> tone right is all.
>


Strange.  I didn't find your tone offensive.  It sounded exactly according to your prior warning -- a little ranting and frustration... no big deal at all.  No need to blame it on Aspergers.  I've heard much worse here from others (which would indicate that this list is full of people plagued with something much more ominous than Aspergers's). ;)

Your post was well within the toleration margin of this group.  Don't worry about it.  And if that's the worst Asperger's can do for a person... well, let's just say you aren't so bad off after all. There are a whole lot of people NOT "diagnosed" with Asperger's that have a nack for offending people.  The test of good character is perhaps not whether it happens or not, but whether one cares enough to make amends once offense is discovered.  I guess you are merely saying that it's difficult for you to discover when you've "crossed the line"?  If so... welcome to the reality of most humans. :D

-JJR
« First   ‹ Prev
1 2