Streams and encoding (page 2) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Streams and encoding (page 2)

August 03, 2004

Re: Streams and encoding

Posted by parabolis
in reply to Sean Kelly

parabolis

Posted in reply to Sean Kelly

Sean Kelly wrote:

> Works quite well but it's very different from the Java approach.  I'm still not
> sure which I like better, though I'll grant that the Java version is more
> flexible (at the expense of verbosity).  The other potential issue is the
> top-heaviness of the design.  I am warming up to the the idea of separate
> reader/writer adaptor classes.
> 

I probably should have made the argument explicit but I do believe dealing with incoming and outgoing data at the same time is suspect of multi-threading issues. If your code is MT safe then you probably did much more work than you had to with little apparent benefit.

August 03, 2004

Re: Streams and encoding

Posted by parabolis
in reply to Sean Kelly

parabolis

Posted in reply to Sean Kelly

Sean Kelly wrote:

> Works quite well but it's very different from the Java approach.  I'm still not
> sure which I like better, though I'll grant that the Java version is more
> flexible (at the expense of verbosity).  The other potential issue is the

I also meant to suggest that I really like much less verbose class names like: FileIS and FileOS...

August 03, 2004

Re: Streams and encoding

Posted by antiAlias
in reply to Sean Kelly

antiAlias

Posted in reply to Sean Kelly

What you good folks seem to be describing is pretty much how mango.io operates. All the questions raised so far are quietly handled by that library (even the separate input & output buffers, if you want that), so it might be worthwhile checking it out. It's also house-trained, documented, and has a raft of additional features that you selectively apply where appropriate (it's not all tragically intertwined).

As a bonus, there's a ton of functionality already built on top of mango.io, including http-server, servlet-engine, clustering, logging, local & remote object caching; even tossing remote D executable objects around a local network. The DSP project is also targeting Mango as a delivery mechanism. Check them out over at dsource.org.

I think it's great to have "competing" libraries under way, but at some point is it worth considering funneling efforts instead? Perhaps not?


"Sean Kelly" <sean@f4.ca> wrote in message news:ceopfj$1hcl$1@digitaldaemon.com...
> I finally got back on my stream mods today and had a question:  how should
the
> wrapper class know the encoding scheme of the low-level data?
>
> For example, say all of the formatted IO code is in a mixin or base class
> (assume base class for the same of discussion) that calls a  read(void*,
size_t)
> or write(void*, size_t) method in the derived class.  Now say I want to
read a
> char, wchar, or dchar from the stream.  How many bytes should I read and
how do
> I know what the encoding format is?  C++ streams handle this fairly simply
by
> making the char type a template parameter:
>
> # class Stream(CharT) {
> #     Stream get(CharT) {}
> #     Stream put(CharT) {}
> # }
>
> This has the obvious limitation that the programmer must instantiate the
proper
> type of stream for the data format he is trying to read (as there is only
one
> get/put method for any char type: CharT).  But it makes things pretty
explicit:
> Stream!(char) means "this is a stream formatted in UTF8."
>
> The other option I can think off offhand would be to have a class member
that
> the derived class could set which specifies the encoding format:
>
> # class Stream {
> #     enum Encoding{ UTF8, UTF16, UTF32 }
> #     Encoding encoding;
> #     this() { encoding = Encoding.UTF8; }
> #     Stream get(char) {}
> #     Stream get(wchar) {}
> #     Stream get(dchar) {}
> #     ...
> # }
> #
> # class File: Stream {
> #     void open(wchar[] filename) { encoding = UTF16; }
> # }
>
> This has tbe benefit of allowing the user to read and write any char type
with a
> single instantiation, but requires greater complexity in the Stream class
and in
> the Derived class.  And I wonder if such flexibility is truly necessary.
>
> Any other design possibilities?  Preferences?  I'm really trying to
establish a
> good formatted IO design than work out the perfect stream API.  Any other
weird
> issues would be welcome also.
>
>
> Sean
>
>

August 03, 2004

Re: Streams and encoding

Posted by Sean Kelly
in reply to antiAlias

Sean Kelly

Posted in reply to antiAlias

In article <cep4dd$1nde$1@digitaldaemon.com>, antiAlias says...
>
>What you good folks seem to be describing is pretty much how mango.io operates. All the questions raised so far are quietly handled by that library (even the separate input & output buffers, if you want that), so it might be worthwhile checking it out. It's also house-trained, documented, and has a raft of additional features that you selectively apply where appropriate (it's not all tragically intertwined).

Yup.  I've played around with Mango and kind of like it.  One of the reasons I started these stream mods was to have an alternate design to compare to Mango for the sake of discussion.  ie. I don't want folks to settle on Mango simply because the other choices are missing features.

>I think it's great to have "competing" libraries under way, but at some point is it worth considering funneling efforts instead? Perhaps not?

Definately.


Sean

August 03, 2004

Re: Streams and encoding

Posted by Walter
in reply to Sean Kelly

Walter

Posted in reply to Sean Kelly

"Sean Kelly" <sean@f4.ca> wrote in message news:ceopfj$1hcl$1@digitaldaemon.com...
> This has tbe benefit of allowing the user to read and write any char type
with a
> single instantiation, but requires greater complexity in the Stream class
and in
> the Derived class.  And I wonder if such flexibility is truly necessary.
>
> Any other design possibilities?  Preferences?  I'm really trying to
establish a
> good formatted IO design than work out the perfect stream API.  Any other
weird
> issues would be welcome also.


I'm one of those folks who is very much in favor of a file reader being able to automatically detect the encoding in it. Hence, D can auto-detect the UTF formatting. So, I'd recommend that the format be an enum that can be specifically set or can be auto-detected. Different resulting behaviors can be handled with virtual functions.

Also, formats like UTF-16 have two variants, big end and little end.

It should also be able to read data in other formats, such as code pages, and convert them to utf. These cannot be auto-detected.

August 04, 2004

Re: Streams and encoding

Posted by parabolis
in reply to antiAlias

parabolis

Posted in reply to antiAlias

antiAlias wrote:

> What you good folks seem to be describing is pretty much how mango.io
> operates. All the questions raised so far are quietly handled by that
> library (even the separate input & output buffers, if you want that), so it
> might be worthwhile checking it out. It's also house-trained, documented,
> and has a raft of additional features that you selectively apply where
> appropriate (it's not all tragically intertwined).

I cant help but ask how it manages to do both input and output and still avoid multi-threading issues?

> As a bonus, there's a ton of functionality already built on top of mango.io,
> including http-server, servlet-engine, clustering, logging, local & remote
> object caching; even tossing remote D executable objects around a local
> network. The DSP project is also targeting Mango as a delivery mechanism.
> Check them out over at dsource.org.

I have only started looking over the library. It is rather extensive. The source is well documented and organized. Both are rare to see. I am not fond of the pdf format. Anyway I am impressed at the surface. I will take a look deeper within.

> I think it's great to have "competing" libraries under way, but at some
> point is it worth considering funneling efforts instead? Perhaps not?

On the note of competing libraries I could not help but notice your primes.d implementation. You might want to look at the primes.d on Deimos and consider using that instead. It is rather cleverly designed and could be tuned to do no worse than your bsearch for all ushort values.

August 04, 2004

Re: Streams and encoding

Posted by Regan Heath
in reply to parabolis

Regan Heath

Posted in reply to parabolis

On Tue, 03 Aug 2004 18:02:55 -0400, parabolis <parabolis@softhome.net> wrote:
> Regan Heath wrote:
>> On Tue, 03 Aug 2004 16:21:05 -0400, parabolis <parabolis@softhome.net> wrote:
>>
>> <snip>
>>
>>> Here is the foundation of the stream library I imagine:
>>> ================================================================
>>> interface DataSink {
>>>      uint write( ubyte[] data, uint off = 0, uint len = 0);
>>> }
>>>
>>> interface DataSource {
>>>      uint read( inout ubyte[] data, uint off = 0, uint len = 0);
>>>      ulong seek( ulong size );
>>> }
>>> ================================================================
>>
>>
>> I think you need functions in the form:
>>
>>   ulong write(void* data, ulong len = 0, ulong off = 0);
>>
>> notice I have changed ubyte[] to void*, changed the order of the last two parameters and changed uint into ulong.
>>
>> If you use ubyte[] you don't need len or off as you can call with:
>>   ubyte[] big = "regan was here";
>>   write(big[6..9]);
>> to achieve both.
>
> I will concede the order was wrong. However I believe the slicing will need to create another array wrapper in memory which is then going to have to be GCed.

So.. ?

> The len and off parameters allow a caller to take either approach.

Yeah.. we have default parameters, we can provide both options at no cost, so why not.

>> The void* allows easy specialised write functions, eg.
>>   bool write(int x) { write(&x,x.sizeof); }
>
> The void* is a pointer with no associated type.

Correct.

> The arrays in D are infinitely better than void* pointers because arrays have extra information.

Incorrect. D arrays are better for some things, those that need/want the extra information.

Lets ignore our opinions on the use of void* for now, can you write the write(int x) function above as easily if you do not use void* but use ubyte[] instead?

> As I said earlier in my post the behavior of providing data in a particular non-byte format should be done elsewhere in a single DataXXStream.

Sure, and when/where you provide it, what will it look like if the underlying write operation takes a ubyte[] and not a void*? is it possible? is it worse than simply using a void*?

>>> The data being read/written by native interface classes:
>>> ================================================================
>>> FileInputSream : DataSource
>>> FileOutputSream : DataSink
>>> SocketInputSream : DataSource
>>> SocketOutputSream : DataSink
>>> MMapInputStream : DataSource
>>> MMapOutputStream : DataSink
>>> ================================================================
>>>
>>> The data is then manipulated providing buffering, digesting, en/de-crpytion and [de]compressoin, etc. Finally it is possible to write interpreters for the data such as TGA, JPEG, etc...
>>
>>
>> I think using template bolt-ins for this step is a great idea, for example you simply write a File, Socket, MMap class which implements the methods in the two interfaces above, then bolt them into your stream class which defines all the other stream operations.
>
> I made an argument that I believe input and output should be clearly seperated which is my answer to why anything should not implement both. Until someone convinces me otherwise I do not see how a single class can implement both and be thread friendly without internally keeping all input related variables seperate from output related variables. If it is not possible to share input and output variables then the class can be factored into two smaller classes that are less prone to bugs.

Sure, wanting to do this does not stop you using bolt-ins.

I just have to split my Stream bolt-in into InputStream and OutputStream, in fact, I think I will, as I agree with your reasoning.

Regan.

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

August 04, 2004

Re: Streams and encoding

Posted by antiAlias
in reply to parabolis

antiAlias

Posted in reply to parabolis

The primes.d thing is now a distant and foggy memory :-)

Can I hook you up with a copy of the latest (much better, with annotated source) documentation? You'll see Primes.d is gone, along with some other warts: http://svn.dsource.org/svn/projects/mango/downloads/mango_beta_9-2_doc.zip


"parabolis" <parabolis@softhome.net> wrote in message news:cep9ee$1ov1$1@digitaldaemon.com...
> antiAlias wrote:
>
> > What you good folks seem to be describing is pretty much how mango.io operates. All the questions raised so far are quietly handled by that library (even the separate input & output buffers, if you want that), so
it
> > might be worthwhile checking it out. It's also house-trained,
documented,
> > and has a raft of additional features that you selectively apply where appropriate (it's not all tragically intertwined).
>
> I cant help but ask how it manages to do both input and output and still avoid multi-threading issues?
>
> > As a bonus, there's a ton of functionality already built on top of
mango.io,
> > including http-server, servlet-engine, clustering, logging, local &
remote
> > object caching; even tossing remote D executable objects around a local network. The DSP project is also targeting Mango as a delivery
mechanism.
> > Check them out over at dsource.org.
>
> I have only started looking over the library. It is rather extensive. The source is well documented and organized. Both are rare to see. I am not fond of the pdf format. Anyway I am impressed at the surface. I will take a look deeper within.
>
> > I think it's great to have "competing" libraries under way, but at some point is it worth considering funneling efforts instead? Perhaps not?
>
> On the note of competing libraries I could not help but notice your primes.d implementation. You might want to look at the primes.d on Deimos and consider using that instead. It is rather cleverly designed and could be tuned to do no worse than your bsearch for all ushort values.

August 04, 2004

Re: Streams and encoding

Posted by parabolis
in reply to Regan Heath

parabolis

Posted in reply to Regan Heath

Regan Heath wrote:

> On Tue, 03 Aug 2004 18:02:55 -0400, parabolis <parabolis@softhome.net> 
> 
>> The arrays in D are infinitely better than void* pointers because arrays have extra information.
> 
> 
> Incorrect. D arrays are better for some things, those that need/want the extra information.

Here I must argue that any knowledge of where C went really wrong was with char* which allows buffer overruns because you do not know how long the buffer is...

I also do not see how you could have used slicing and a void*. How would you know when to stop reading before you had off and len?

> Lets ignore our opinions on the use of void* for now, can you write the write(int x) function above as easily if you do not use void* but use ubyte[] instead?
> 

I will do both at the same time... (read on)

> Sure, and when/where you provide it, what will it look like if the underlying write operation takes a ubyte[] and not a void*? is it possible? is it worse than simply using a void*?

I am more concerned with the fact that a ubyte[] should help guard against the char* buffer overruns that creaed a huge security industry. In fact I suspect that you might be somebody from NAV or McAfee and are here only to ensure security holes remain rampant... :P

One of the biggest breakthroughs Java made was in the area of security. Part of this breakthrough was a result of their eliminating that nasty char* and using arrays with length info builtin. Having said that... Of course it is possible to read a
int/long/real/whatever from a byte buffer. Moreover you can test
to see if something went wrong in the buffer because you know how long it is...
================================================================
int readInt( ubyte buf, uint off = 0 ) {
    if( buf.length <= off+4 )
        throw Error( "Buffer overrun" );
    uint result = buf[off+0];
    result |= (cast(int)(buf[off+1])) << 8;
    result |= (cast(int)(buf[off+2])) << 16;
    result |= (cast(int)(buf[off+3])) << 24;
    return result;
}
================================================================

> 
>>>> The data being read/written by native interface classes:
>>>> ================================================================
>>>> FileInputSream : DataSource
>>>> FileOutputSream : DataSink
>>>> SocketInputSream : DataSource
>>>> SocketOutputSream : DataSink
>>>> MMapInputStream : DataSource
>>>> MMapOutputStream : DataSink
>>>> ================================================================
>>>>
>>>> The data is then manipulated providing buffering, digesting, en/de-crpytion and [de]compressoin, etc. Finally it is possible to write interpreters for the data such as TGA, JPEG, etc...
>>>
>>>
>>>
>>> I think using template bolt-ins for this step is a great idea, for example you simply write a File, Socket, MMap class which implements the methods in the two interfaces above, then bolt them into your stream class which defines all the other stream operations.
>>
>>
>> I made an argument that I believe input and output should be clearly seperated which is my answer to why anything should not implement both. Until someone convinces me otherwise I do not see how a single class can implement both and be thread friendly without internally keeping all input related variables seperate from output related variables. If it is not possible to share input and output variables then the class can be factored into two smaller classes that are less prone to bugs.
> 
> 
> Sure, wanting to do this does not stop you using bolt-ins.
> 
> I just have to split my Stream bolt-in into InputStream and OutputStream, in fact, I think I will, as I agree with your reasoning.

I am glad to hear you decided to split them. I think you will find it makes life simpler.

I am not much of a generic programmer. So I am waiting to see how you deal with the combinatorial problem before I am sold on the idea. If you can pull it off then you might be onto something. :)

August 04, 2004

Re: Streams and encoding

Posted by Regan Heath
in reply to parabolis

Regan Heath

Posted in reply to parabolis

On Tue, 03 Aug 2004 21:41:51 -0400, parabolis <parabolis@softhome.net> wrote:
> Regan Heath wrote:
>
>> Incorrect. D arrays are better for some things, those that need/want the extra information.
>
> Here I must argue that any knowledge of where C went really wrong was with char* which allows buffer overruns because you do not know how long the buffer is...

> I also do not see how you could have used slicing and a void*.

I didn't/don't use slicing. I think you may be confusing two different points I made.

My first point was that off and len were not required because you can slice into a ubyte[]. So _if_ you use ubyte[] you don't _need_ off and len.

My second point was that instead of ubyte[] you should use void* for convenience. If you use void* you definately need len.

> How would you know when to stop reading before you had off and len?

I have always had len, my fn prototype is:
  ulong write(void* address, ulong length);

which simply writes length bytes starting at address.

>> Lets ignore our opinions on the use of void* for now, can you write the write(int x) function above as easily if you do not use void* but use ubyte[] instead?
>>
>
> I will do both at the same time... (read on)

both? .. on I read ..

>> Sure, and when/where you provide it, what will it look like if the underlying write operation takes a ubyte[] and not a void*? is it possible? is it worse than simply using a void*?
>
> I am more concerned with the fact that a ubyte[] should help guard against the char* buffer overruns that creaed a huge security industry. In fact I suspect that you might be somebody from NAV or McAfee and are here only to ensure security holes remain rampant... :P
>
> One of the biggest breakthroughs Java made was in the area of security. Part of this breakthrough was a result of their eliminating that nasty char* and using arrays with length info builtin. Having said that... Of course it is possible to read a
> int/long/real/whatever from a byte buffer. Moreover you can test
> to see if something went wrong in the buffer because you know how long it is...
> ================================================================
> int readInt( ubyte buf, uint off = 0 ) {

Typo, you missed the [], I have added them below.

> int readInt( ubyte[] buf, uint off = 0 ) {
>      if( buf.length <= off+4 )
>          throw Error( "Buffer overrun" );
>      uint result = buf[off+0];
>      result |= (cast(int)(buf[off+1])) << 8;
>      result |= (cast(int)(buf[off+2])) << 16;
>      result |= (cast(int)(buf[off+3])) << 24;
>      return result;
> }
> ================================================================

And this is supposed to be nicer/easier/more efficient than..

bool readInt(out int x) {
  if (read(&x,x.sizeof) != x.sizeof)
    throw new Exception("Out of data");
  return true;
}

As you can see using void* allows very convenient and totally buffer overrun safe code.

<snip>

> I am glad to hear you decided to split them. I think you will find it makes life simpler.
>
> I am not much of a generic programmer. So I am waiting to see how you deal with the combinatorial problem before I am sold on the idea. If you can pull it off then you might be onto something. :)

You mean the problem you see with threads and shared buffers?

Regan.

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation