Streams and encoding - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Streams and encoding

Thread overview

Streams and encoding
Aug 03, 2004 Sean Kelly
Aug 03, 2004 Arcane Jill
Aug 03, 2004 Sean Kelly
Aug 03, 2004 Arcane Jill
Aug 03, 2004 Sean Kelly
Aug 03, 2004 parabolis
Aug 03, 2004 Regan Heath
Aug 03, 2004 parabolis
Aug 04, 2004 Regan Heath
Aug 04, 2004 parabolis
Aug 04, 2004 Regan Heath
Aug 04, 2004 parabolis
Aug 04, 2004 Regan Heath
Aug 04, 2004 parabolis
Aug 04, 2004 Regan Heath
Aug 04, 2004 parabolis
Aug 04, 2004 Regan Heath
Aug 04, 2004 Andy Friesen
Aug 04, 2004 Regan Heath
Aug 04, 2004 Andy Friesen
Aug 04, 2004 parabolis
Aug 04, 2004 Andy Friesen
Aug 04, 2004 Bent Rasmussen
Aug 04, 2004 Sean Kelly
Aug 04, 2004 parabolis
Aug 04, 2004 Regan Heath
Aug 04, 2004 Andy Friesen
Aug 04, 2004 parabolis
Aug 06, 2004 parabolis
Aug 06, 2004 Regan Heath
Aug 06, 2004 parabolis
Aug 09, 2004 Regan Heath
Aug 06, 2004 Arcane Jill
Aug 06, 2004 parabolis
Aug 06, 2004 Sean Kelly
Aug 06, 2004 parabolis
Aug 03, 2004 Regan Heath
Aug 03, 2004 Sean Kelly
Aug 03, 2004 parabolis
Aug 03, 2004 parabolis
Aug 03, 2004 antiAlias
Aug 03, 2004 Sean Kelly
Aug 04, 2004 antiAlias
Aug 04, 2004 parabolis
Aug 04, 2004 antiAlias
Aug 04, 2004 parabolis
Aug 04, 2004 antiAlias
Aug 04, 2004 parabolis
Aug 04, 2004 antiAlias
Aug 03, 2004 Walter
Aug 04, 2004 Arcane Jill
Aug 04, 2004 Sean Kelly
Aug 04, 2004 Arcane Jill
Aug 04, 2004 Sean Kelly
Aug 04, 2004 Ben Hinkle
Aug 04, 2004 Arcane Jill
Aug 04, 2004 Sean Kelly
Aug 04, 2004 Arcane Jill
Aug 05, 2004 Carlos Santander B.
Aug 05, 2004 Arcane Jill
Aug 04, 2004 Regan Heath
Aug 04, 2004 Walter

August 03, 2004

Streams and encoding

Posted by Sean Kelly

Sean Kelly

I finally got back on my stream mods today and had a question:  how should the wrapper class know the encoding scheme of the low-level data?

For example, say all of the formatted IO code is in a mixin or base class (assume base class for the same of discussion) that calls a  read(void*, size_t) or write(void*, size_t) method in the derived class.  Now say I want to read a char, wchar, or dchar from the stream.  How many bytes should I read and how do I know what the encoding format is?  C++ streams handle this fairly simply by making the char type a template parameter:

# class Stream(CharT) {
#     Stream get(CharT) {}
#     Stream put(CharT) {}
# }

This has the obvious limitation that the programmer must instantiate the proper type of stream for the data format he is trying to read (as there is only one get/put method for any char type: CharT).  But it makes things pretty explicit: Stream!(char) means "this is a stream formatted in UTF8."

The other option I can think off offhand would be to have a class member that the derived class could set which specifies the encoding format:

# class Stream {
#     enum Encoding{ UTF8, UTF16, UTF32 }
#     Encoding encoding;
#     this() { encoding = Encoding.UTF8; }
#     Stream get(char) {}
#     Stream get(wchar) {}
#     Stream get(dchar) {}
#     ...
# }
#
# class File: Stream {
#     void open(wchar[] filename) { encoding = UTF16; }
# }

This has tbe benefit of allowing the user to read and write any char type with a single instantiation, but requires greater complexity in the Stream class and in the Derived class.  And I wonder if such flexibility is truly necessary.

Any other design possibilities?  Preferences?  I'm really trying to establish a good formatted IO design than work out the perfect stream API.  Any other weird issues would be welcome also.


Sean

August 03, 2004

Re: Streams and encoding

Posted by Arcane Jill
in reply to Sean Kelly

Arcane Jill

Posted in reply to Sean Kelly

In article <ceopfj$1hcl$1@digitaldaemon.com>, Sean Kelly says...
>
>I finally got back on my stream mods today and had a question:  how should the wrapper class know the encoding scheme of the low-level data?

Simple answer - it shouldn't have to.

I suggest using a specialized transcoding filter for such things. That's what Java does (Java calls them Readers and Writers), and Java's streams have been hailed as a shining example of how to do things correctly. Then your streams just connect together naturally, as others have shown in other recent threads. e.g.:

# Stream s = new ZipStream(new BufferedStream(new FilterStream(new
Windows1252Reader(stdin))));

(or something similar). You can have factory methods to create transcoders where the encoding is not known until runtime.

Jill

August 03, 2004

Re: Streams and encoding

Posted by Sean Kelly
in reply to Arcane Jill

Sean Kelly

Posted in reply to Arcane Jill

In article <ceor8d$1ihu$1@digitaldaemon.com>, Arcane Jill says...
>
>In article <ceopfj$1hcl$1@digitaldaemon.com>, Sean Kelly says...
>>
>>I finally got back on my stream mods today and had a question:  how should the wrapper class know the encoding scheme of the low-level data?
>
>Simple answer - it shouldn't have to.

Works for me.  So how does a formatted read/write routine know which format it's targeting?

>I suggest using a specialized transcoding filter for such things. That's what Java does (Java calls them Readers and Writers), and Java's streams have been hailed as a shining example of how to do things correctly. Then your streams just connect together naturally, as others have shown in other recent threads. e.g.:
>
># Stream s = new ZipStream(new BufferedStream(new FilterStream(new
>Windows1252Reader(stdin))));

Okay, so all the formatted IO routines go in a Reader class and the type of the reader class determines the format?  ie. there would be an UTF8Writer, UTF8Reader, UTF16Writer, UTF16Reader, etc?


Sean

August 03, 2004

Re: Streams and encoding

Posted by parabolis
in reply to Sean Kelly

parabolis

Posted in reply to Sean Kelly

Sean Kelly wrote:

> I finally got back on my stream mods today and had a question:  how should the
> wrapper class know the encoding scheme of the low-level data?
> 

I have been wondering who was working on a Stream library. I have many thoughts, many of which are covered in OT - scanf in Java. Here are a some notes:

In C (and C++ by extension I would imagine) the char type is the smallest addressable cell in memory. In D the char is a UTF-8 8-bit code unit which is quite a differnent thing. I would suggest you seriously consider defining basic IO using either the ubyte (which represents a general 8-bit value) or possibly the data type that is the native cell size used in memory (something like size_t I believe).

Also I have noticed the tendency for people to not make the distinction between Input and Output streams. This leads to some problems. Say I want to write a class to handle CRC32 on stream data. It is far simpler and less error prone to compute such a digest on a stream in which data flows in only one direction especially in a multi-threaded environment.

Also the Input and Output distinction allows for streams pumps that automatically pull data from one and push data into another. This is especially useful with bifurcating streams that also do logging.

As for the templatization of streams I believe a pair of generic data input/output stream classes can be written using templates which will do impedance matching from the 8-bit streams to the n-bit data type you want to read. So you have to write 8, 16, 32 and possibly 64 and 128 bit functions.

Here is the foundation of the stream library I imagine:
================================================================
interface DataSink {
    uint write( ubyte[] data, uint off = 0, uint len = 0);
}

interface DataSource {
    uint read( inout ubyte[] data, uint off = 0, uint len = 0);
    ulong seek( ulong size );
}
================================================================

The data being read/written by native interface classes:
================================================================
FileInputSream : DataSource
FileOutputSream : DataSink
SocketInputSream : DataSource
SocketOutputSream : DataSink
MMapInputStream : DataSource
MMapOutputStream : DataSink
================================================================

The data is then manipulated providing buffering, digesting, en/de-crpytion and [de]compressoin, etc. Finally it is possible to write interpreters for the data such as TGA, JPEG, etc...

August 03, 2004

Re: Streams and encoding

Posted by Arcane Jill
in reply to Sean Kelly

Arcane Jill

Posted in reply to Sean Kelly

In article <ceorv6$1iti$1@digitaldaemon.com>, Sean Kelly says...

>Okay, so all the formatted IO routines go in a Reader class and the type of the reader class determines the format?  ie. there would be an UTF8Writer, UTF8Reader, UTF16Writer, UTF16Reader, etc?

Got it in one. Plus, you can have a factory function like createReader(char[]),
so you can do Reader r = createReader("UTF-16LE"); etc. (for when the type is
known at run time, not compile time, which is usually). The implementation of
createReader() is just a big swtich statement, with each case return a new
instance of the relevant class.

(I swapped your questions around. Here's the first one).

>Works for me.  So how does a formatted read/write routine know which format it's targeting?

You got me there. I think the question's too vague, and the answer application-specific. Generally speaking, at some level, the encoding is known, somehow. Maybe it's specified in the text file itself (XML and HTTP pull this trick - for it to work the very start of the file must comprise only ASCII characters (although they can be encoded in a UTF)); maybe it's specified in a configuration file; maybe it's deduced using some heuristic test; maybe the OS default is assumed. At the level where the encoding is known, decode it (into UTF-8), and then you can use byte streams from then on. As parabolis said, a stream, in the abstract, deals in ubytes, not chars (because that's what you write to files, sockets, etc.). Classes which implement read() or write() in units other than ubyte shouldn't really be called "streams", which of course is why Java calls them Readers and Writers. (Maybe "filters" for the general case).

Arcane Jill

August 03, 2004

Re: Streams and encoding

Posted by Sean Kelly
in reply to Arcane Jill

Sean Kelly

Posted in reply to Arcane Jill

In article <ceou9t$1kbq$1@digitaldaemon.com>, Arcane Jill says...
>
>In article <ceorv6$1iti$1@digitaldaemon.com>, Sean Kelly says...
>
>>Okay, so all the formatted IO routines go in a Reader class and the type of the reader class determines the format?  ie. there would be an UTF8Writer, UTF8Reader, UTF16Writer, UTF16Reader, etc?
>
>Got it in one. Plus, you can have a factory function like createReader(char[]),
>so you can do Reader r = createReader("UTF-16LE"); etc. (for when the type is
>known at run time, not compile time, which is usually). The implementation of
>createReader() is just a big swtich statement, with each case return a new
>instance of the relevant class.
>
>(I swapped your questions around. Here's the first one).
>
>>Works for me.  So how does a formatted read/write routine know which format it's targeting?
>
>You got me there.

No worries.  If we've got a class per format then it knows implicitly what format to convert to/from.


Sean

August 03, 2004

Re: Streams and encoding

Posted by Regan Heath
in reply to parabolis

Regan Heath

Posted in reply to parabolis

On Tue, 03 Aug 2004 16:21:05 -0400, parabolis <parabolis@softhome.net> wrote:

<snip>

> Here is the foundation of the stream library I imagine:
> ================================================================
> interface DataSink {
>      uint write( ubyte[] data, uint off = 0, uint len = 0);
> }
>
> interface DataSource {
>      uint read( inout ubyte[] data, uint off = 0, uint len = 0);
>      ulong seek( ulong size );
> }
> ================================================================

I think you need functions in the form:

  ulong write(void* data, ulong len = 0, ulong off = 0);

notice I have changed ubyte[] to void*, changed the order of the last two parameters and changed uint into ulong.

If you use ubyte[] you don't need len or off as you can call with:
  ubyte[] big = "regan was here";
  write(big[6..9]);
to achieve both.

The void* allows easy specialised write functions, eg.
  bool write(int x) { write(&x,x.sizeof); }

I'm not sure whether uint or ulong should be used, anyone got opinions/reasons for one or the other?

> The data being read/written by native interface classes:
> ================================================================
> FileInputSream : DataSource
> FileOutputSream : DataSink
> SocketInputSream : DataSource
> SocketOutputSream : DataSink
> MMapInputStream : DataSource
> MMapOutputStream : DataSink
> ================================================================
>
> The data is then manipulated providing buffering, digesting, en/de-crpytion and [de]compressoin, etc. Finally it is possible to write interpreters for the data such as TGA, JPEG, etc...

I think using template bolt-ins for this step is a great idea, for example you simply write a File, Socket, MMap class which implements the methods in the two interfaces above, then bolt them into your stream class which defines all the other stream operations.

See my earlier post (with source) on how this works. Note there was a problem with it which I have since fixed, changing 'super.' to 'this.' in the stream template class.

Regan.

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

August 03, 2004

Re: Streams and encoding

Posted by Regan Heath
in reply to Sean Kelly

Regan Heath

Posted in reply to Sean Kelly

For another perspective/idea have a look at my thread entitled "My stream concept".

I use template bolt-ins.

There was a little problem with it, which was actually trivial to fix, I simply replaced the 'super.' calls with 'this.' calls.

It should also be noted that my idea was strictly for creating the base level stream classes from the various devices i.e. File, Socket, Memory etc. The next step is to add filters (as described by Arcane Jill) I am hoping an idea will come to me as to how I can do that, without needing:
  new MemoryMap(new UTF16Filter(new Stream()));

Regan

On Tue, 3 Aug 2004 19:36:19 +0000 (UTC), Sean Kelly <sean@f4.ca> wrote:

> I finally got back on my stream mods today and had a question:  how should the
> wrapper class know the encoding scheme of the low-level data?
>
> For example, say all of the formatted IO code is in a mixin or base class
> (assume base class for the same of discussion) that calls a  read(void*, size_t)
> or write(void*, size_t) method in the derived class.  Now say I want to read a
> char, wchar, or dchar from the stream.  How many bytes should I read and how do
> I know what the encoding format is?  C++ streams handle this fairly simply by
> making the char type a template parameter:
>
> # class Stream(CharT) {
> #     Stream get(CharT) {}
> #     Stream put(CharT) {}
> # }
>
> This has the obvious limitation that the programmer must instantiate the proper
> type of stream for the data format he is trying to read (as there is only one
> get/put method for any char type: CharT).  But it makes things pretty explicit:
> Stream!(char) means "this is a stream formatted in UTF8."
>
> The other option I can think off offhand would be to have a class member that
> the derived class could set which specifies the encoding format:
>
> # class Stream {
> #     enum Encoding{ UTF8, UTF16, UTF32 }
> #     Encoding encoding;
> #     this() { encoding = Encoding.UTF8; }
> #     Stream get(char) {}
> #     Stream get(wchar) {}
> #     Stream get(dchar) {}
> #     ...
> # }
> #
> # class File: Stream {
> #     void open(wchar[] filename) { encoding = UTF16; }
> # }
>
> This has tbe benefit of allowing the user to read and write any char type with a
> single instantiation, but requires greater complexity in the Stream class and in
> the Derived class.  And I wonder if such flexibility is truly necessary.
>
> Any other design possibilities?  Preferences?  I'm really trying to establish a
> good formatted IO design than work out the perfect stream API.  Any other weird
> issues would be welcome also.
>
>
> Sean
>
>



-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

August 03, 2004

Re: Streams and encoding

Posted by Sean Kelly
in reply to Regan Heath

Sean Kelly

Posted in reply to Regan Heath

In article <opsb6d5aw85a2sq9@digitalmars.com>, Regan Heath says...
>
>For another perspective/idea have a look at my thread entitled "My stream concept".
>
>I use template bolt-ins.
>
>There was a little problem with it, which was actually trivial to fix, I simply replaced the 'super.' calls with 'this.' calls.
>
>It should also be noted that my idea was strictly for creating the base level stream classes from the various devices i.e. File, Socket, Memory etc. The next step is to add filters (as described by Arcane Jill) I am hoping an idea will come to me as to how I can do that, without needing:
>   new MemoryMap(new UTF16Filter(new Stream()));

My design really set out extend the original stream approach, and it seemed the logical extension was pretty C++ like.  I ended up creating a basic set of interfaces--Stream, InputStream, and OutputStream--and putting all the implementation in templates meant to be mixins.  This was somewhat necessary to support the multiple inheritance type model.  So the input file stream looks something like this:

# class InFile : InputStream {
# mixin StreamDefs SD;
# mixin InputStreamDefs!(readFile) ISD;
# private:
#     uint readFile(void* buf, size_t size) {}
}

Works quite well but it's very different from the Java approach.  I'm still not sure which I like better, though I'll grant that the Java version is more flexible (at the expense of verbosity).  The other potential issue is the top-heaviness of the design.  I am warming up to the the idea of separate reader/writer adaptor classes.


Sean

August 03, 2004

Re: Streams and encoding

Posted by parabolis
in reply to Regan Heath

parabolis

Posted in reply to Regan Heath

Regan Heath wrote:
> On Tue, 03 Aug 2004 16:21:05 -0400, parabolis <parabolis@softhome.net> wrote:
> 
> <snip>
> 
>> Here is the foundation of the stream library I imagine:
>> ================================================================
>> interface DataSink {
>>      uint write( ubyte[] data, uint off = 0, uint len = 0);
>> }
>>
>> interface DataSource {
>>      uint read( inout ubyte[] data, uint off = 0, uint len = 0);
>>      ulong seek( ulong size );
>> }
>> ================================================================
> 
> 
> I think you need functions in the form:
> 
>   ulong write(void* data, ulong len = 0, ulong off = 0);
> 
> notice I have changed ubyte[] to void*, changed the order of the last two parameters and changed uint into ulong.
> 
> If you use ubyte[] you don't need len or off as you can call with:
>   ubyte[] big = "regan was here";
>   write(big[6..9]);
> to achieve both.

I will concede the order was wrong. However I believe the slicing will need to create another array wrapper in memory which is then going to have to be GCed. The len and off parameters allow a caller to take either approach.

> 
> The void* allows easy specialised write functions, eg.
>   bool write(int x) { write(&x,x.sizeof); }

The void* is a pointer with no associated type. The arrays in D are infinitely better than void* pointers because arrays have extra information. As I said earlier in my post the behavior of providing data in a particular non-byte format should be done elsewhere in a single DataXXStream.

>> The data being read/written by native interface classes:
>> ================================================================
>> FileInputSream : DataSource
>> FileOutputSream : DataSink
>> SocketInputSream : DataSource
>> SocketOutputSream : DataSink
>> MMapInputStream : DataSource
>> MMapOutputStream : DataSink
>> ================================================================
>>
>> The data is then manipulated providing buffering, digesting, en/de-crpytion and [de]compressoin, etc. Finally it is possible to write interpreters for the data such as TGA, JPEG, etc...
> 
> 
> I think using template bolt-ins for this step is a great idea, for example you simply write a File, Socket, MMap class which implements the methods in the two interfaces above, then bolt them into your stream class which defines all the other stream operations.

I made an argument that I believe input and output should be clearly seperated which is my answer to why anything should not implement both. Until someone convinces me otherwise I do not see how a single class can implement both and be thread friendly without internally keeping all input related variables seperate from output related variables. If it is not possible to share input and output variables then the class can be factored into two smaller classes that are less prone to bugs.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation