writefln and ASCII - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Learn » writefln and ASCII

Thread overview

writefln and ASCII
Sep 12, 2006 Serg Kovrov
Sep 12, 2006 Oskar Linde
Sep 12, 2006 Marcin Kuszczak
Sep 12, 2006 Serg Kovrov
Sep 13, 2006 Steve Horne
Sep 13, 2006 nobody
Sep 13, 2006 Steve Horne
Sep 13, 2006 nobody
Sep 14, 2006 Steve Horne
Sep 14, 2006 John Reimer
Sep 14, 2006 Steve Horne
Sep 14, 2006 nobody
Sep 14, 2006 John Reimer
Sep 15, 2006 John Reimer
Sep 17, 2006 Steve Horne
Sep 14, 2006 Don Clugston
Sep 15, 2006 Steve Horne
Sep 15, 2006 John Reimer
Sep 15, 2006 Georg Wrede
Sep 14, 2006 nobody

September 12, 2006

writefln and ASCII

Posted by Serg Kovrov

Serg Kovrov

How do I writefln a string from ASCII file contained illegal UTF-8 characters, but legal as ASCII? For example ndash symbol - ASCII 0x96).

Is there a standard routine to convert such ASCII characters to UTF, or other way to get valid UTF string from arbitrary raw data? Filter or substitute bad characters, etc...

-- 
serg.

September 12, 2006

Re: writefln and ASCII

Posted by Oskar Linde
in reply to Serg Kovrov

Oskar Linde

Posted in reply to Serg Kovrov

Serg Kovrov wrote:
> How do I writefln a string from ASCII file contained illegal UTF-8 characters, but legal as ASCII? For example ndash symbol - ASCII 0x96).

0x96 is not valid ASCII. Nothing above 0x7F is valid ASCII (7-bit).

> Is there a standard routine to convert such ASCII characters to UTF, or other way to get valid UTF string from arbitrary raw data? Filter or substitute bad characters, etc...

If your raw data is Latin-1 (ISO 8859-1):

ubyte[] src;
char[] dst;
foreach(s; src)
	std.utf.encode(dst,cast(dchar)s);


/Oskar

September 12, 2006

Re: writefln and ASCII

Posted by Marcin Kuszczak
in reply to Serg Kovrov

Marcin Kuszczak

Posted in reply to Serg Kovrov

Serg Kovrov wrote:

> How do I writefln a string from ASCII file contained illegal UTF-8 characters, but legal as ASCII? For example ndash symbol - ASCII 0x96).
> 
> Is there a standard routine to convert such ASCII characters to UTF, or other way to get valid UTF string from arbitrary raw data? Filter or substitute bad characters, etc...
> 

First - explanation:
If you have file with invalid UTF-8 characters it means that it is in
specific local encoding. It's not an ASCII file as ASCII is only for
characters code from 0..127.

Second -- how to cope with such files:
I used to convert files in local encoding using std.windows.charset. There
are two functions which will be useful:

char[] fromMBSz(char* s, int codePage = 0);  // local encoding to UTF-8
char* toMBSz(char[] s, uint codePage = cast(uint)0); // UTF-8 to local
encoding

In my case encoding was 1250.

-- 
Regards
Marcin Kuszczak
(Aarti_pl)

September 12, 2006

Re: writefln and ASCII

Posted by Serg Kovrov
in reply to Marcin Kuszczak

Serg Kovrov

Posted in reply to Marcin Kuszczak

Marcin Kuszczak wrote:
> char[] fromMBSz(char* s, int codePage = 0);  // local encoding to UTF-8
> char* toMBSz(char[] s, uint codePage = cast(uint)0); // UTF-8 to local
> encoding

Thanks Marcin, fromMBSz() is just fine.

In my case I do not care for correct codepage - just want to dump contents to console as part of debug trace.

-- 
serg.

September 13, 2006

Re: writefln and ASCII

Posted by Steve Horne
in reply to Serg Kovrov

Steve Horne

Posted in reply to Serg Kovrov

On Tue, 12 Sep 2006 15:03:20 +0300, Serg Kovrov <kovrov@no.spam> wrote:

>How do I writefln a string from ASCII file contained illegal UTF-8 characters, but legal as ASCII? For example ndash symbol - ASCII 0x96).

Just to add some angry ranting to what has already been said...

First, there was ASCII. ASCII had character codes 0 to 127.

Then, there was a whole bunch of codepages - different character sets for different countries. These exploited characters 128 to 255, but each codepage defined the characters differently. Some codepages had multi-byte characters.

Then, there was Unicode. Unicode was supposed to make things easier. But instead, it made things harder.

There are millions of defined codes in unicode. A code does not necessarily represent a character - it may take several codes in sequence (for example, applying diacritics).

In addition, the codes are just numbers. There are several different ways to encode them into streams of bytes or whatever.

In UTF-8, several bytes may be needed to specify a unicode code, and several codes may be needed to specify a single character. I don't think any codepage has that level of complexity.

Oh, and let's not forget the UTF-16, UCS-32 and other encodings.

And then, of course, understanding millions of codes is a tall order. Especially since the number isn't fixed, even now. Back when there were 40,000 or so codes, people thought the set was almost complete - hah hah hah! So Unicode explicitly allows that applications and operating systems don't have to understand all possible codes. So you have the Windows XP subset, the MacOS subset, the GTK subset etc etc etc.

And then there's endian marks to worry about!

So much for easier. So much for one standard.


Anyway, the first 256 unicode codes match the characters in the US codepage. Because of the way UTF-8 encoding works, a genuine ASCII file is also a UTF-8 file. But a file that uses characters 128 to 255 for any codepage, US included, is not a valid UTF-8 file.

-- 
Remove 'wants' and 'nospam' from e-mail.

September 13, 2006

Re: writefln and ASCII

Posted by nobody
in reply to Steve Horne

nobody

Posted in reply to Steve Horne

Steve Horne wrote:
> On Tue, 12 Sep 2006 15:03:20 +0300, Serg Kovrov <kovrov@no.spam>
> wrote:
> 
>> How do I writefln a string from ASCII file contained illegal UTF-8 characters, but legal as ASCII? For example ndash symbol - ASCII 0x96).
> 
> Just to add some angry ranting to what has already been said...

I can understand your frustration. I felt the same way you did for awhile. The thing that changed my mind was realizing that I think Unicode has some great features.

Unicode threads do have a tendency to be rather long so here is my short contribution up front. UTF-8 is great if you can be fairly sure you will only be using ASCII data. UTF-16 is great for almost every writing system that is currently used on the planet Earth.

> 
> Then, there was a whole bunch of codepages - different character sets
> for different countries. These exploited characters 128 to 255, but
> each codepage defined the characters differently. Some codepages had
> multi-byte characters.

  Unicode is not so bad
  Unicode 不是那么坏
  Unicode δεν είναι τόσο κακό
  Unicode はあまり悪くない
  Unicode не настолько плох

When I wrote this message I see English, Chinese, Greek, Japanese and Russian characters displayed. My preferred text editor (TextPad) uses codepages and wants me to pick whether to display only one of Chinese, Greek, Japanese or Russian. With Unicode it is possible to read and write all of the above.

If you think Unicode is overly complex then perhaps you should have a go at writing some code to display this message correctly using codepages. You will need to identify codepage boundaries and then you can probably use frequency tables to identify the codepage used within each boundry. You might want to mitigate the high error rates by also checking dictionaries appropriate for each codepage. Of course dictionaries only go so far so you might also need to know how each language and its dialects vary words.

September 13, 2006

Re: writefln and ASCII

Posted by Steve Horne
in reply to nobody

Steve Horne

Posted in reply to nobody

On Wed, 13 Sep 2006 10:55:42 -0400, nobody <nobody@mailinator.com> wrote:

>When I wrote this message I see English, Chinese, Greek, Japanese and Russian characters displayed.

Obviously, yes. I just think Unicode could have been simpler. And perhaps it doesn't really need codepoints for characters in languages and dialects that haven't been in use for a couple of thousand years.

>You will need to identify codepage boundaries and then you can probably use frequency tables to identify the codepage used within each boundry.

Metadata. When your document cannot be represented as a simple text file, use something else.

-- 
Remove 'wants' and 'nospam' from e-mail.

September 13, 2006

Re: writefln and ASCII

Posted by nobody
in reply to Steve Horne

nobody

Posted in reply to Steve Horne

Steve Horne wrote:
> On Wed, 13 Sep 2006 10:55:42 -0400, nobody <nobody@mailinator.com>
> wrote:
> 
>> When I wrote this message I see English, Chinese, Greek, Japanese and Russian characters displayed.
> 
> Obviously, yes. I just think Unicode could have been simpler.

I think they kept it as simple as was reasonably possible. Once you admit a need to use more than a single byte to represent an entity then any solution is going to have the same complications.

They really did need to remain backwards compatible with ASCII while also allowing the bulk of non-ASCII to be represented as 2 bytes.

UTF-8 is free of endian ambiguity and is fully compatible with ASCII data but might use as many as 8 bytes to represent a single Unicode code point. UTF-16 represents the bulk of code points actually used in the world with only 2 bytes but as with any data using more than one byte it has to address endian ambiguities.

> 
>> You will need to identify codepage boundaries and then you can probably use frequency tables to identify the codepage used within each boundry.
> 
> Metadata. When your document cannot be represented as a simple text
> file, use something else.
> 

It is my opinion that if you need metadata in addition to textual data then your method of representing textual data is inadequate.

I am certain that to freely mix data from any codepage you would probably use something like an escape code. If you were really sly you would probably use ASCII as a default code page and then let the highest bit being set represent an escape code -- which is exactly how UTF-8 starts out. How you would imagine filling out the rest?

September 14, 2006

Re: writefln and ASCII

Posted by Steve Horne
in reply to nobody

Steve Horne

Posted in reply to nobody

On Wed, 13 Sep 2006 14:17:13 -0400, nobody <nobody@mailinator.com> wrote:

>> Metadata. When your document cannot be represented as a simple text file, use something else.
>> 
>
>It is my opinion that if you need metadata in addition to textual data then your method of representing textual data is inadequate.

Ah. So you believe that HTML and XML are garbage, then. Along with all binary word-processor document files.

But then, Unicode is inadequate also. You need additional metadata for anything beyond the simplest text. Unicode gives you a huge selection of characters, but it can't specify paragraphs styles etc.

>I am certain that to freely mix data from any codepage you would probably use something like an escape code.

That would be the most cryptically compressed form of metadata, I suppose. But why compress the metadata at the expense of the character data?

Switching languages and codepages is a relatively rare thing. Most documents don't do it at all. Even those that do are hardly likely to switch every other character.

By the huffman compression principle of representing the most frequent things with the smallest codes, the logical thing to do is to have single byte characters as much as possible and use a multibyte sequence - a tag - to select codepages.

I'm getting the feeling that I've given the wrong impression. Just for the record, I posted some ranting because I have too much time on my hands. And I really do believe that unicode could have been simpler. That doesn't mean I'm saying it's useless, and you shouldn't take it all too seriously.

I can say things in a way that causes offense sometimes. I can be too strong in defending opinions when I really don't care that much, for instance. Picking silly holes in arguments, out of a pure love of absurdity. And it doesn't help that I have an overformal way of saying things that I've been told is like being lectured at all the time.

I mentioned the word Aspie in another post. That's as in Aspergers Syndrome. For info on how and why we end up unintentionally upsetting people, try...

http://www.mugsy.org/asa_faq/

and in particular...

http://www.mugsy.org/asa_faq/getting_along/index.shtml

I dare say someone else here has Aspergers, or at least knows someone. Everyone does these days. It's not always a big deal. I'm having problems, but I really don't want to go on about them here.

Just wanted to make the point that any apparent tone you may pick up from what I write is usually random noise. Sure I criticise things, but it's not that serious. Almost all humour is based around some kind of criticism, directed either inward or outward. I just can't get the tone right is all.

-- 
Remove 'wants' and 'nospam' from e-mail.

September 14, 2006

Re: writefln and ASCII

Posted by John Reimer
in reply to Steve Horne

John Reimer

Posted in reply to Steve Horne

On Wed, 13 Sep 2006 18:35:32 -0700, Steve Horne <stephenwantshornenospam100@aol.com> wrote:

> I can say things in a way that causes offense sometimes. I can be too
> strong in defending opinions when I really don't care that much, for
> instance. Picking silly holes in arguments, out of a pure love of
> absurdity. And it doesn't help that I have an overformal way of saying
> things that I've been told is like being lectured at all the time.
>
> I mentioned the word Aspie in another post. That's as in Aspergers
> Syndrome. For info on how and why we end up unintentionally upsetting
> people, try...
>
> http://www.mugsy.org/asa_faq/
>
> and in particular...
>
> http://www.mugsy.org/asa_faq/getting_along/index.shtml
>
> I dare say someone else here has Aspergers, or at least knows someone.
> Everyone does these days. It's not always a big deal. I'm having
> problems, but I really don't want to go on about them here.
>
> Just wanted to make the point that any apparent tone you may pick up
> from what I write is usually random noise. Sure I criticise things,
> but it's not that serious. Almost all humour is based around some kind
> of criticism, directed either inward or outward. I just can't get the
> tone right is all.
>

Strange.  I didn't find your tone offensive.  It sounded exactly according to your prior warning -- a little ranting and frustration... no big deal at all.  No need to blame it on Aspergers.  I've heard much worse here from others (which would indicate that this list is full of people plagued with something much more ominous than Aspergers's). ;)

Your post was well within the toleration margin of this group.  Don't worry about it.  And if that's the worst Asperger's can do for a person... well, let's just say you aren't so bad off after all. There are a whole lot of people NOT "diagnosed" with Asperger's that have a nack for offending people.  The test of good character is perhaps not whether it happens or not, but whether one cares enough to make amends once offense is discovered.  I guess you are merely saying that it's difficult for you to discover when you've "crossed the line"?  If so... welcome to the reality of most humans. :D

-JJR

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation