Thread overview
print non-ASCII/UTF-8 string
Dec 22, 2006
Egor Starostin
Dec 22, 2006
Pragma
Dec 22, 2006
Egor Starostin
Dec 22, 2006
Thomas Kuehne
Dec 22, 2006
BCS
Dec 23, 2006
Bruno Medeiros
December 22, 2006
Let's say that file q.txt contains some characters bigger than 0x7f (for
example, from windows-1252 encoding).
In such case the following snippet:
***
import std.stream;
void main() {
  Stream f = new BufferedFile("q.txt");
  for (char[] l; f) {
    writefln(l);
  }
}
***
will fail with 'Error: 4invalid UTF-8 sequence' because D's strings are in
UTF-8, right?

My question is: is there any way to print out non-UTF-8 data exactly in the same encoding (which may be unknown) as in original file?
December 22, 2006
Egor Starostin wrote:
> Let's say that file q.txt contains some characters bigger than 0x7f (for
> example, from windows-1252 encoding).
> In such case the following snippet:
> ***
> import std.stream;
> void main() {
>   Stream f = new BufferedFile("q.txt");
>   for (char[] l; f) {
>     writefln(l);
>   }
> }
> ***
> will fail with 'Error: 4invalid UTF-8 sequence' because D's strings are in
> UTF-8, right?
> 
> My question is: is there any way to print out non-UTF-8 data exactly in the
> same encoding (which may be unknown) as in original file?

It's funny that you should bring this up now.  I had a thread over in d.D.learn regarding this very thing.  The following should help you get started:

char[] Latin1ToUTF8(char[] value){
    char[] result;
    for(uint i=0; i<value.length; i++){
        char ch = value[i];
        if(ch < 0x80){
            result ~= ch;
        }
        else{
            result ~= 0xC0  | (ch >> 6);
            result ~= 0x80  | (ch & 0x3F);
        }
    }
    return result;
}

(this could be optimized to use fewer concatenations, but I think it gets the point across)

I have no clue how to work from other code pages, as I gather the transform would be far less than straightforward as Latin-1.

  Also, I have no idea how to *detect* what code page is being used based on the input set. I don't even know if that's possible, like you,  , I'd love to hear about it should someone else know of an algorithm.

-- 
- EricAnderton at yahoo
December 22, 2006
> > My question is: is there any way to print out non-UTF-8 data exactly in the same encoding (which may be unknown) as in original file?
> It's funny that you should bring this up now.  I had a thread over in
> d.D.learn regarding this very thing.  The following should help you get
> started:
> char[] Latin1ToUTF8(char[] value){
It's not my case, I think.

I don't need to convert to UTF-8. I just need to raw print exactly the same string as in original file.
December 22, 2006
"Egor Starostin" <egorst@gmail.com> wrote in message news:emgvkj$1ll6$1@digitaldaemon.com...

> I don't need to convert to UTF-8. I just need to raw print exactly the
> same string
> as in original file.

Hm.  This might be one case where printf is actually useful:

foreach(l; f)
    printf("%s\n", toStringz(l));


December 22, 2006
Jarrett Billingsley schrieb am 2006-12-22:
> "Egor Starostin" <egorst@gmail.com> wrote in message news:emgvkj$1ll6$1@digitaldaemon.com...
>
>> I don't need to convert to UTF-8. I just need to raw print exactly the
>> same string
>> as in original file.
>
> Hm.  This might be one case where printf is actually useful:
>
> foreach(l; f)
>     printf("%s\n", toStringz(l));

This should work more reliable and consume less resources:
      printf("%.*s\n", l.length, l.ptr);

Thomas


December 22, 2006
Thomas Kuehne wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Jarrett Billingsley schrieb am 2006-12-22:
>> "Egor Starostin" <egorst@gmail.com> wrote in message news:emgvkj$1ll6$1@digitaldaemon.com...
>>
>>> I don't need to convert to UTF-8. I just need to raw print exactly the same string
>>> as in original file.
>> Hm.  This might be one case where printf is actually useful:
>>
>> foreach(l; f)
>>     printf("%s\n", toStringz(l)); 
> 
> This should work more reliable and consume less resources:
>       printf("%.*s\n", l.length, l.ptr);
> 
> Thomas

This works as well. But only because array parts are in the correct order to begin with

printf("%.*s\n", l);
December 23, 2006
Jarrett Billingsley wrote:
> "Egor Starostin" <egorst@gmail.com> wrote in message news:emgvkj$1ll6$1@digitaldaemon.com...
> 
>> I don't need to convert to UTF-8. I just need to raw print exactly the same string
>> as in original file.
> 
> Hm.  This might be one case where printf is actually useful:
> 
> foreach(l; f)
>     printf("%s\n", toStringz(l)); 
> 
> 

Or rather:
  dout.write(cast(ubyte[]) line);
?

-- 
Bruno Medeiros - MSc in CS/E student
http://www.prowiki.org/wiki4d/wiki.cgi?BrunoMedeiros#D