print non-ASCII/UTF-8 string

Dec 22, 2006

Egor Starostin

Dec 22, 2006

Pragma

Dec 22, 2006

Dec 22, 2006

Dec 22, 2006

Dec 22, 2006

Dec 23, 2006

Let's say that file q.txt contains some characters bigger than 0x7f (for example, from windows-1252 encoding). In such case the following snippet: *** import std.stream; void main() { Stream f = new BufferedFile("q.txt"); for (char[] l; f) { writefln(l); } } *** will fail with 'Error: 4invalid UTF-8 sequence' because D's strings are in UTF-8, right? My question is: is there any way to print out non-UTF-8 data exactly in the same encoding (which may be unknown) as in original file?

December 22, 2006

Re: print non-ASCII/UTF-8 string

Posted by Pragma
in reply to Egor Starostin

Permalink

Pragma

Posted in reply to Egor Starostin

Permalink

Egor Starostin wrote:
> Let's say that file q.txt contains some characters bigger than 0x7f (for
> example, from windows-1252 encoding).
> In such case the following snippet:
> ***
> import std.stream;
> void main() {
>   Stream f = new BufferedFile("q.txt");
>   for (char[] l; f) {
>     writefln(l);
>   }
> }
> ***
> will fail with 'Error: 4invalid UTF-8 sequence' because D's strings are in
> UTF-8, right?
> 
> My question is: is there any way to print out non-UTF-8 data exactly in the
> same encoding (which may be unknown) as in original file?

It's funny that you should bring this up now.  I had a thread over in d.D.learn regarding this very thing.  The following should help you get started:

char[] Latin1ToUTF8(char[] value){
    char[] result;
    for(uint i=0; i<value.length; i++){
        char ch = value[i];
        if(ch < 0x80){
            result ~= ch;
        }
        else{
            result ~= 0xC0  | (ch >> 6);
            result ~= 0x80  | (ch & 0x3F);
        }
    }
    return result;
}

(this could be optimized to use fewer concatenations, but I think it gets the point across)

I have no clue how to work from other code pages, as I gather the transform would be far less than straightforward as Latin-1.

  Also, I have no idea how to *detect* what code page is being used based on the input set. I don't even know if that's possible, like you,  , I'd love to hear about it should someone else know of an algorithm.

-- 
- EricAnderton at yahoo

> > My question is: is there any way to print out non-UTF-8 data exactly in the same encoding (which may be unknown) as in original file? > It's funny that you should bring this up now. I had a thread over in > d.D.learn regarding this very thing. The following should help you get > started: > char[] Latin1ToUTF8(char[] value){ It's not my case, I think. I don't need to convert to UTF-8. I just need to raw print exactly the same string as in original file.

"Egor Starostin" <egorst@gmail.com> wrote in message news:emgvkj$1ll6$1@digitaldaemon.com... > I don't need to convert to UTF-8. I just need to raw print exactly the > same string > as in original file. Hm. This might be one case where printf is actually useful: foreach(l; f) printf("%s\n", toStringz(l));

Jarrett Billingsley schrieb am 2006-12-22: > "Egor Starostin" <egorst@gmail.com> wrote in message news:emgvkj$1ll6$1@digitaldaemon.com... > >> I don't need to convert to UTF-8. I just need to raw print exactly the >> same string >> as in original file. > > Hm. This might be one case where printf is actually useful: > > foreach(l; f) > printf("%s\n", toStringz(l)); This should work more reliable and consume less resources: printf("%.*s\n", l.length, l.ptr); Thomas

Thomas Kuehne wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Jarrett Billingsley schrieb am 2006-12-22: >> "Egor Starostin" <egorst@gmail.com> wrote in message news:emgvkj$1ll6$1@digitaldaemon.com... >> >>> I don't need to convert to UTF-8. I just need to raw print exactly the same string >>> as in original file. >> Hm. This might be one case where printf is actually useful: >> >> foreach(l; f) >> printf("%s\n", toStringz(l)); > > This should work more reliable and consume less resources: > printf("%.*s\n", l.length, l.ptr); > > Thomas This works as well. But only because array parts are in the correct order to begin with printf("%.*s\n", l);

Jarrett Billingsley wrote: > "Egor Starostin" <egorst@gmail.com> wrote in message news:emgvkj$1ll6$1@digitaldaemon.com... > >> I don't need to convert to UTF-8. I just need to raw print exactly the same string >> as in original file. > > Hm. This might be one case where printf is actually useful: > > foreach(l; f) > printf("%s\n", toStringz(l)); > > Or rather: dout.write(cast(ubyte[]) line); ? -- Bruno Medeiros - MSc in CS/E student http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D

Forums