Thread overview
[phobos] Transcoded text stdio
Aug 06, 2010
Shin Fujishiro
Aug 06, 2010
SHOO
Sep 17, 2010
Shin Fujishiro
Sep 17, 2010
Shin Fujishiro
August 06, 2010
Hello,

I'm trying to integrate codeset conversion facility to std.stdio. But how can it be done?

Mixing transcoded and non-transcoded (UTF-8) I/O in the same File structure will mess up the source.  I think separating UTF-8 based I/O and transcoded I/O is necessary.

I could think of the following four ways.

1.  Integrate everything in the File anyway.

2.  Make the File to always perform conversion.

3.  Create a distinct type for transcoded I/O.
----------
    shared TranscodedFile stdout;
    stdout.writeln("Hall?, V?rld!");
----------
# http://github.com/sinfu/misc/blob/master/stdio/test01.d

4.  Simplify the File and define upper layer structures.
----------
    // File itself doesn't provide byLine etc.
    shared File stdout;

    // these 'ports' perform actual I/O for specific purposes
    shared UTF8TextIOPort stdoutUTF8;
    shared NativeTextIOPort stdoutText;
    shared BinaryIOPort stdoutBin;

    // wrap stdout with various 'I/O ports'
    stdoutUTF8 = UTF8TextIOPort(stdout);
    stdoutText = NativeTextIOPort(stdout);
    stdoutBin = BinaryIOPort(stdout);

    // write text in UTF-8
    stdoutUTF8.writeln("Hall?, V?rld!");

    // write text in console encoding
    stdoutText.writeln("Hall?, V?rld!");

    // free functions use stdoutText
    writeln("Hall?, V?rld!");
----------
# http://github.com/sinfu/misc/blob/master/stdio/test02.d

...

I'm uncertain of which is the best.  Perhaps there are more reasonable ways.  What do you think?  Any ideas?


Thanks,
Shin
August 06, 2010
I like #4.

I think we should start specifications development of the interface of new D's I/O.


2010/8/6 Shin Fujishiro <rsinfu at gmail.com>:
> Hello,
>
> I'm trying to integrate codeset conversion facility to std.stdio. But how can it be done?
>
> Mixing transcoded and non-transcoded (UTF-8) I/O in the same File structure will mess up the source. ?I think separating UTF-8 based I/O and transcoded I/O is necessary.
>
> I could think of the following four ways.
>
> 1. ?Integrate everything in the File anyway.
>
> 2. ?Make the File to always perform conversion.
>
> 3. ?Create a distinct type for transcoded I/O.
> ----------
> ? ?shared TranscodedFile stdout;
> ? ?stdout.writeln("Hall?, V?rld!");
> ----------
> # http://github.com/sinfu/misc/blob/master/stdio/test01.d
>
> 4. ?Simplify the File and define upper layer structures.
> ----------
> ? ?// File itself doesn't provide byLine etc.
> ? ?shared File stdout;
>
> ? ?// these 'ports' perform actual I/O for specific purposes
> ? ?shared UTF8TextIOPort stdoutUTF8;
> ? ?shared NativeTextIOPort stdoutText;
> ? ?shared BinaryIOPort stdoutBin;
>
> ? ?// wrap stdout with various 'I/O ports'
> ? ?stdoutUTF8 = UTF8TextIOPort(stdout);
> ? ?stdoutText = NativeTextIOPort(stdout);
> ? ?stdoutBin = BinaryIOPort(stdout);
>
> ? ?// write text in UTF-8
> ? ?stdoutUTF8.writeln("Hall?, V?rld!");
>
> ? ?// write text in console encoding
> ? ?stdoutText.writeln("Hall?, V?rld!");
>
> ? ?// free functions use stdoutText
> ? ?writeln("Hall?, V?rld!");
> ----------
> # http://github.com/sinfu/misc/blob/master/stdio/test02.d
>
> ...
>
> I'm uncertain of which is the best. ?Perhaps there are more reasonable ways. ?What do you think? ?Any ideas?
>
>
> Thanks,
> Shin
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
September 17, 2010
Hi Shin and everyone,


Regarding transcoding output, please let me know I understand the problem correctly: under Windows (and possibly under other OSs under certain configurations) the console is not UTF and cannot be reasonably forced to be UTF.

I think for such situations, the classic Decorator-based design with stacked interfaces works well: you have a TranscodingStream wrapping a NativeStream or a UTFStream or whatever.

The streaming interface question comes again, i.e. what is the interface that allows such stacking with minimal cost in efficiency?

File was not designed for transcoding, but as long as it supports raw reads and writes, I think writing a wrapper over it should be possible. I'm talking about something like this:

auto nativeStdout = nativeTranscoder(stdout);
nativeStdout.writeln("yah");

The native transcoder would only use rawWrite and flush for stdout - not the higher level text functions.

Then we can define more sophisticated transcoders, e.g. one that transcodes from UTF to some Eurasian codepages etc.

Works?


Andrei

On 8/6/10 1:50 CDT, Shin Fujishiro wrote:
> Hello,
>
> I'm trying to integrate codeset conversion facility to std.stdio. But how can it be done?
>
> Mixing transcoded and non-transcoded (UTF-8) I/O in the same File structure will mess up the source.  I think separating UTF-8 based I/O and transcoded I/O is necessary.
>
> I could think of the following four ways.
>
> 1.  Integrate everything in the File anyway.
>
> 2.  Make the File to always perform conversion.
>
> 3.  Create a distinct type for transcoded I/O.
> ----------
>      shared TranscodedFile stdout;
>      stdout.writeln("Hall?, V?rld!");
> ----------
> # http://github.com/sinfu/misc/blob/master/stdio/test01.d
>
> 4.  Simplify the File and define upper layer structures.
> ----------
>      // File itself doesn't provide byLine etc.
>      shared File stdout;
>
>      // these 'ports' perform actual I/O for specific purposes
>      shared UTF8TextIOPort stdoutUTF8;
>      shared NativeTextIOPort stdoutText;
>      shared BinaryIOPort stdoutBin;
>
>      // wrap stdout with various 'I/O ports'
>      stdoutUTF8 = UTF8TextIOPort(stdout);
>      stdoutText = NativeTextIOPort(stdout);
>      stdoutBin = BinaryIOPort(stdout);
>
>      // write text in UTF-8
>      stdoutUTF8.writeln("Hall?, V?rld!");
>
>      // write text in console encoding
>      stdoutText.writeln("Hall?, V?rld!");
>
>      // free functions use stdoutText
>      writeln("Hall?, V?rld!");
> ----------
> # http://github.com/sinfu/misc/blob/master/stdio/test02.d
>
> ...
>
> I'm uncertain of which is the best.  Perhaps there are more reasonable ways.  What do you think?  Any ideas?
>
>
> Thanks,
> Shin
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
September 18, 2010
Thank you for picking up the topic!

Andrei Alexandrescu <andrei at erdani.com> wrote:
> Regarding transcoding output, please let me know I understand the problem correctly: under Windows (and possibly under other OSs under certain configurations) the console is not UTF and cannot be reasonably forced to be UTF.

Yes.  Neither input or output isn't UTF under Windows.

> I think for such situations, the classic Decorator-based design with stacked interfaces works well: you have a TranscodingStream wrapping a NativeStream or a UTFStream or whatever.
> 
> The streaming interface question comes again, i.e. what is the interface that allows such stacking with minimal cost in efficiency?

The cost is minimal when the transcoder has direct access to both I/O device and buffer.  I mean, there would be no redundant copy involved:

    ubyte[N] tmp = void;
    convert(buffer, tmp);
    device_write(tmp);

So, the best layer for doing converted (or filtered) I/O is the stream buffer.  But I feels like it's not quite right... buffering layer might be too 'low level' for character code conversion.


Shin
September 18, 2010
By the way:  For now, how about working around the Windows console problem by putting the following code in LockingTextWriter?

// workaround
if (fps == core.stdc.stdio.stdout && orientation <= 0)
{
    foreach (dchar c; writeme)
    {
        immutable cp = GetConsoleOutputCP();
        wchar[2] wc;
        char[16] mb;
        immutable wcLen = encode(wc, c);
        immutable mbLen = WideCharToMultiByte(
                cp, 0, wc.ptr, wcLen, mb.ptr, mb.length, null, null);
        foreach (char c; mb[0 .. mbLen])
        {
            FPUTC(c, handle);
        }
    }
}

Although the long-term solution is a conversion-aware I/O system, we should make it sure that the following works under Windows:

import std.stdio;
void main()
{
    writeln("Hall?, V?rld!");
}


Shin
January 02, 2011
We definitely need to keep an eye on this issue while the new stream design is in flux. Transcoding Windows I/O is a major application.

Andrei

On 9/17/10 12:04 PM, Shin Fujishiro wrote:
> Thank you for picking up the topic!
>
> Andrei Alexandrescu<andrei at erdani.com>  wrote:
>> Regarding transcoding output, please let me know I understand the problem correctly: under Windows (and possibly under other OSs under certain configurations) the console is not UTF and cannot be reasonably forced to be UTF.
>
> Yes.  Neither input or output isn't UTF under Windows.
>
>> I think for such situations, the classic Decorator-based design with stacked interfaces works well: you have a TranscodingStream wrapping a NativeStream or a UTFStream or whatever.
>>
>> The streaming interface question comes again, i.e. what is the interface that allows such stacking with minimal cost in efficiency?
>
> The cost is minimal when the transcoder has direct access to both I/O device and buffer.  I mean, there would be no redundant copy involved:
>
>      ubyte[N] tmp = void;
>      convert(buffer, tmp);
>      device_write(tmp);
>
> So, the best layer for doing converted (or filtered) I/O is the stream buffer.  But I feels like it's not quite right... buffering layer might be too 'low level' for character code conversion.
>
>
> Shin
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos
January 02, 2011
Shin, I think you may put this into Phobos. Please make sure you do so only on version(Windows).

Andrei

On 9/17/10 12:05 PM, Shin Fujishiro wrote:
> By the way:  For now, how about working around the Windows console problem by putting the following code in LockingTextWriter?
>
> // workaround
> if (fps == core.stdc.stdio.stdout&&  orientation<= 0)
> {
>      foreach (dchar c; writeme)
>      {
>          immutable cp = GetConsoleOutputCP();
>          wchar[2] wc;
>          char[16] mb;
>          immutable wcLen = encode(wc, c);
>          immutable mbLen = WideCharToMultiByte(
>                  cp, 0, wc.ptr, wcLen, mb.ptr, mb.length, null, null);
>          foreach (char c; mb[0 .. mbLen])
>          {
>              FPUTC(c, handle);
>          }
>      }
> }
>
> Although the long-term solution is a conversion-aware I/O system, we should make it sure that the following works under Windows:
>
> import std.stdio;
> void main()
> {
>      writeln("Hall?, V?rld!");
> }
>
>
> Shin
> _______________________________________________
> phobos mailing list
> phobos at puremagic.com
> http://lists.puremagic.com/mailman/listinfo/phobos