Streams and encoding (page 5) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Streams and encoding (page 5)

August 04, 2004

Re: Streams and encoding

Posted by parabolis
in reply to Sean Kelly

parabolis

Posted in reply to Sean Kelly

Sean Kelly wrote:

> In article <cepsao$1vbo$1@digitaldaemon.com>, Andy Friesen says...
> 
>>
>>D arrays are the same way.  Accidentally constructing an invalid array is much less likely to occur than using an explicit pointer/length pair. :)
> 
> 
> Not sure I agree in this case.
> 
> # void read( void* addr, size_t size );
> # void read( ubyte[] val );
> # # int x;
> # read( &x, x.sizeof );
> # read( cast(ubyte[]) &x[0..x.sizeof] );
> 
> Both instances of the above code require the programmer to be a bit evil about
> how they specify access to a range of memory.  To me, the void* call just looks
> cleaner and less confusing while being no more prone to user error (in fact
> possibly less, as the calling syntax is simpler).

I am pretty sure the second read in your example parses it be treating the address of x as a ubyte array and then slicing into which creates a valid ubyte[] array to pass to a function.

August 04, 2004

Re: Streams and encoding

Posted by Andy Friesen
in reply to Sean Kelly

Andy Friesen

Posted in reply to Sean Kelly

Sean Kelly wrote:

> In article <cepsao$1vbo$1@digitaldaemon.com>, Andy Friesen says...
> 
>>
>>D arrays are the same way.  Accidentally constructing an invalid array is much less likely to occur than using an explicit pointer/length pair. :)
> 
> 
> Not sure I agree in this case.
> 
> # void read( void* addr, size_t size );
> # void read( ubyte[] val );
> # # int x;
> # read( &x, x.sizeof );
> # read( cast(ubyte[]) &x[0..x.sizeof] );
> 
> Both instances of the above code require the programmer to be a bit evil about
> how they specify access to a range of memory.  To me, the void* call just looks
> cleaner and less confusing while being no more prone to user error (in fact
> possibly less, as the calling syntax is simpler).

I changed my mind.  You're right. :)

Getting an invalid array is hard, except when you start slicing pointers, at which point it becomes a bit too easy.

 -- andy

August 04, 2004

Re: Streams and encoding

Posted by Arcane Jill
in reply to Sean Kelly

Arcane Jill

Posted in reply to Sean Kelly

In article <ceqv9a$15b$1@digitaldaemon.com>, Sean Kelly says...
>
>That reminds me.  Which format does the code in utf.d use?

To be honest, I don't understand the question.


>I'm thinking I may
>do something like this for encoding for now:
>
>enum Format {
>UTF8 = 0,
>UTF16 = 1,
>UTF16LE = 1,
>UTF16BE = 2
>}
>
>So "UTF-16" would actually default to one of the two methods.

Whatever works, works. But I'd make the enum private. Encodings should be universally known by their IANA registered name, otherwise how can you map name to number. (For example, you encounter an XML file which declares its own encoding to be "X-ARCANE-JILLS-CUSTOM-ENCODING" - how do you turn that into an enum?)


Got an unrelated question for you. In the stream function void read(out int), there is an assumption that the bytes will be embedded in the stream in little-endian order. Should applications assume (a) it's always little endian, regardless of host architecture, or (b) it's always host-byte order. Is there a big endian version? Is there a network byte order version?

Should there be?

Jill

August 04, 2004

Re: Streams and encoding

Posted by Sean Kelly
in reply to Arcane Jill

Sean Kelly

Posted in reply to Arcane Jill

In article <cer4k8$7jj$1@digitaldaemon.com>, Arcane Jill says...
>
>In article <ceqv9a$15b$1@digitaldaemon.com>, Sean Kelly says...
>>
>>That reminds me.  Which format does the code in utf.d use?
>
>To be honest, I don't understand the question.

std.utf has methods like toUTF16.  But does this target the big or little endian encoding scheme?  I suppose I could assume it corresponds to the byte order of the target machine, but this would imply different behavior on different platforms.

>>I'm thinking I may
>>do something like this for encoding for now:
>>
>>enum Format {
>>UTF8 = 0,
>>UTF16 = 1,
>>UTF16LE = 1,
>>UTF16BE = 2
>>}
>>
>>So "UTF-16" would actually default to one of the two methods.
>
>Whatever works, works. But I'd make the enum private. Encodings should be universally known by their IANA registered name, otherwise how can you map name to number. (For example, you encounter an XML file which declares its own encoding to be "X-ARCANE-JILLS-CUSTOM-ENCODING" - how do you turn that into an enum?)

This raises an interesting question.  Rather than having the encoding handled directly by the Stream layer perhaps it should be dropped into another class.  I can't imagine coding a base lib to support "Joe's custom encoding scheme."  For the moment though, I think I'll leave stream.d as-is.  This seems like a design issue that will take a bit of talk to get right.

>Got an unrelated question for you. In the stream function void read(out int), there is an assumption that the bytes will be embedded in the stream in little-endian order. Should applications assume (a) it's always little endian, regardless of host architecture, or (b) it's always host-byte order. Is there a big endian version? Is there a network byte order version?

Not currently.  This corresponds to the C++ design: unformatted IO is assumed to be in the byte order of the host platform.

>Should there be?

Probably.  Or at least one that converts to/from network byte order.  I'll probably have the first cut of stream.d done in a few more days and after that we can talk about what's wrong with it, etc.


Sean

August 04, 2004

Re: Streams and encoding

Posted by Ben Hinkle
in reply to Sean Kelly

Ben Hinkle

Posted in reply to Sean Kelly

> >Whatever works, works. But I'd make the enum private. Encodings should be universally known by their IANA registered name, otherwise how can you
map name
> >to number. (For example, you encounter an XML file which declares its own encoding to be "X-ARCANE-JILLS-CUSTOM-ENCODING" - how do you turn that
into an
> >enum?)
>
> This raises an interesting question.  Rather than having the encoding
handled
> directly by the Stream layer perhaps it should be dropped into another
class.  I
> can't imagine coding a base lib to support "Joe's custom encoding scheme."
For
> the moment though, I think I'll leave stream.d as-is.  This seems like a
design
> issue that will take a bit of talk to get right.

I wonder if delegates could help out here. Instead of subclasses or wrapping a stream in another stream the primary Stream class could have a delegate to sort out big/little endian or encoding issues. I'm not exactly sure how it would work but it's worth investigating. There might be issues with sharing data between the stream and the encoder/decoder delegate.

> >Got an unrelated question for you. In the stream function void read(out
int),
> >there is an assumption that the bytes will be embedded in the stream in little-endian order. Should applications assume (a) it's always little
endian,
> >regardless of host architecture, or (b) it's always host-byte order. Is
there a
> >big endian version? Is there a network byte order version?
>
> Not currently.  This corresponds to the C++ design: unformatted IO is
assumed to
> be in the byte order of the host platform.
>
> >Should there be?
>
> Probably.  Or at least one that converts to/from network byte order.  I'll probably have the first cut of stream.d done in a few more days and after
that
> we can talk about what's wrong with it, etc.
>
>
> Sean
>
>

August 04, 2004

Re: Streams and encoding

Posted by Arcane Jill
in reply to Sean Kelly

Arcane Jill

Posted in reply to Sean Kelly

In article <cer7fh$9t5$1@digitaldaemon.com>, Sean Kelly says...
>
>std.utf has methods like toUTF16.  But does this target the big or little endian encoding scheme?  I suppose I could assume it corresponds to the byte order of the target machine, but this would imply different behavior on different platforms.

Neither, really. toUTF16 returns an array of wchars, not an array of chars, so (conceptually) there is no byte-order issue involved. A wchar is (conceptually) a sixteen bit wide value, with bit 0 being the low order bit, and bit 15 being the high order bit. Byte ordering doesn't come into it.

Problems occur, however, when a wchar or a dchar leaves the nice safe environment of D and heads out into a stream. Only then does byte ordering become an issue (as it does also with arrays of ints, etc.).

If you cast a wchar[] (or an int[], etc.) to a void[], then the bytes of data don't change, only the reference has a different type. In practice, this means you have (inadvertantly) applied a host-byte-order encoding to the array. There doesn't seem to be much that a stream can do about this, so, I reckon the problem here lies not with the stream, but with the cast. In short, a cast is not the most architecture-independent way to convert an arbitrary array into a void[]. Maybe some new functions could be written to implement this?



>This raises an interesting question.  Rather than having the encoding handled directly by the Stream layer perhaps it should be dropped into another class.  I can't imagine coding a base lib to support "Joe's custom encoding scheme."  For the moment though, I think I'll leave stream.d as-is.  This seems like a design issue that will take a bit of talk to get right.

Right. Someone writing an application ought to be able to make their own transcoder (extending a library-defined base class; implementing a library-defined interface; whatever). Let's say that (in an application, not a library) I define classes JoesCustomReader and JoesCustomWriter. Now, I should still be able to do:

#    Stream s = new Reader(underlyingCharFilter, "X-JOES-CUSTOM-ENCODING");

and read the file. If a reader needs to be identified by a globally unique enum, then I can't do that without the possibility of an enum value clash. But if, on the other hand, they are identified by a string, then the possibility of a clash becomes vanishingly small.

I do agree with you that registration of readers/writers and the dispatching mechanism is something best left until later, however.


Jill

August 04, 2004

Re: Streams and encoding

Posted by Sean Kelly
in reply to Arcane Jill

Sean Kelly

Posted in reply to Arcane Jill

In article <ceraen$c48$1@digitaldaemon.com>, Arcane Jill says...
>
>Problems occur, however, when a wchar or a dchar leaves the nice safe environment of D and heads out into a stream. Only then does byte ordering become an issue (as it does also with arrays of ints, etc.).

Bah.  Of course.  So the two UTF schemes just depend on the byte order when serialized.  Makes sense.

>If you cast a wchar[] (or an int[], etc.) to a void[], then the bytes of data don't change, only the reference has a different type. In practice, this means you have (inadvertantly) applied a host-byte-order encoding to the array. There doesn't seem to be much that a stream can do about this, so, I reckon the problem here lies not with the stream, but with the cast. In short, a cast is not the most architecture-independent way to convert an arbitrary array into a void[]. Maybe some new functions could be written to implement this?

I think byte order should be specified, perhaps as a quality of the stream.  It could default to native and perhaps be switchable?  The only other catch I see is that a console stream should probably ignore this setting and always leave everything in native format.  In any case, this byte order would affect encoding schemes using > 1 byte characters and perhaps a new set of unformatted IO methods as well.  Again something I'm going to ignore for now as it's more complexity than we need quite yet.


Sean

August 04, 2004

Re: Streams and encoding

Posted by Arcane Jill
in reply to Sean Kelly

Arcane Jill

Posted in reply to Sean Kelly

In article <cerbsa$d02$1@digitaldaemon.com>, Sean Kelly says...

>>In short, a cast is
>>not the most architecture-independent way to convert an arbitrary array into a
>>void[]. Maybe some new functions could be written to implement this?

>I think byte order should be specified, perhaps as a quality of the stream.  It could default to native and perhaps be switchable?

Well, from one point of view, the problem we've got here is serialization. How do you serialize an array of primitive types having sizeof > 1? This boils down to a simpler question: how do you serialize a single primitive with sizeof > 1. Let's cut to a clear example - how do you serialize an int?

std.stream.Stream.write(int) serializes in little-endian order. But the specs say "Outside of byte, ubyte, and char, the format is implementation-specific and should only be used in conjunction with read." I think this is scary. Perhaps it would be better for a stream to /mandate/ the order. As you suggest, it could be a property of the stream, but there are disadvantages to that - if you chain a whole bunch of streams together, each with different endianness, you could end up with a lot of byteswapping going on. Another possibility might be to ditch the function write(int), and replace it with two functions, writeBE(int) and writeLE(int), (and similarly with all other primitive types). That would be absolutely guaranteed to be platform independent.

Of course that applies to wchar and dchar too, but the whole point of encodings (well, /one/ of the points of encodings anyway) is that you never have to spit out anything other than a stream of /bytes/. The encoding itself determines the byte order. There really is no such encoding as "UTF-16" (although calling wchar[]s UTF-16 does make sense). As far as actual encodings are concerned, the name "UTF-16" is just a shorthand way of saying "either UTF-16LE or UTF-16BE". When reading, you have to auto-detect between them, but once you've /established/ the encoding, then you rewind the stream and start reading it again with the now known encoding. When writing, you get to choose, arbitrarily (so you would probably choose native byte order), but you can make it easier for subsequent readers to auto-detect by writing a BOM at the start of the stream.

How does this affect users' code? Well, you simply don't allow anyone to write

#    Reader s = new UTF16Reader(underlyingStream)

(i.e. you define no such class). Instead, give them a factory method. Make them
write:

#    Reader s = createUTF16Reader(underlyingStream)

or even

#    Reader s = createReader(underlyingStream, "UTF-16");

(but we said we wouldn't talk about dispatching yet, so let's stick with
createUTF16Reader() to keep things simple)

The function createUTF16Reader() reads the underlying stream, auto-detects between UTF-16LE and UTF-16BE, and then constructs either a UTF16LEReader or a UTF16BEReader, and returns it. Somehow it needs a method of pushing back the characters it's already read into the stream. Then, when the caller calls s.read(), the exact encoding is known, and the stream is (re)read from the start.



>The only other catch I see
>is that a console stream should probably ignore this setting and always leave
>everything in native format.

Maybe writeLE() and writeBE() could be supplemented by writeNative(), with the
warning that it's no longer cross-platform? (Of course, the function write()
does that right now, but calling it writeNative() would give you a clue that you
were doing something a bit parochial).


>In any case, this byte order would affect encoding
>schemes using > 1 byte characters and perhaps a new set of unformatted IO
>methods as well.

I don't think it would affect encodings at all, only the serialization of
primitive types other than byte, ubyte and char. Transcoders, as I said, read or
write /bytes/ to or from an underlying stream (but have dchar read() and/or void
write(dchar) methods for callers to use).


>Again something I'm going to ignore for now as it's more complexity than we need quite yet.

Righty ho. I vaguely remember Hauke saying he was working on a class to do something about transcoding issues, but I don't know the specifics.

Arcane Jill

August 04, 2004

Re: Streams and encoding

Posted by Regan Heath
in reply to Arcane Jill

Regan Heath

Posted in reply to Arcane Jill

On Wed, 4 Aug 2004 16:58:48 +0000 (UTC), Arcane Jill <Arcane_member@pathlink.com> wrote:

<snip>

> Got an unrelated question for you. In the stream function void read(out int),
> there is an assumption that the bytes will be embedded in the stream in
> little-endian order. Should applications assume (a) it's always little endian,
> regardless of host architecture, or (b) it's always host-byte order. Is there a
> big endian version? Is there a network byte order version?
>
> Should there be?

I think we go with (b).

I think it is best handled with a filter. eg.

Stream s = new BigEndian(new FileStream("test.dat",FileMode.READ));

so BigEndian looks like:

#class BigEndian {
#  ulong read(void* address, ulong length) {
#    version(LittleEndian) {
#      //on a little endian system we convert.
#    }
#    else {
#      //no conversion is required.
#    }
#  }
#}

You'll need a LittleEndian one too.
Using the filter you can guarantee the endian-ness of the data.

Of course if you're sending binary data from a LE to BE system via sockets you need to know what you're doing, and you need to decide what endian-ness will be used for the transmission, in this case on the one end of the socket you'll need a toBigEndian/toLittleEndian filter.

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

August 04, 2004

Re: Streams and encoding

Posted by Regan Heath
in reply to parabolis

Regan Heath

Posted in reply to parabolis

On Wed, 04 Aug 2004 11:55:43 -0400, parabolis <parabolis@softhome.net> wrote:

> Sean Kelly wrote:
>
>> In article <cepsao$1vbo$1@digitaldaemon.com>, Andy Friesen says...
>>
>>>
>>> D arrays are the same way.  Accidentally constructing an invalid array is much less likely to occur than using an explicit pointer/length pair. :)
>>
>>
>> Not sure I agree in this case.
>>
>> # void read( void* addr, size_t size );
>> # void read( ubyte[] val );
>> # # int x;
>> # read( &x, x.sizeof );
>> # read( cast(ubyte[]) &x[0..x.sizeof] );
>>
>> Both instances of the above code require the programmer to be a bit evil about
>> how they specify access to a range of memory.  To me, the void* call just looks
>> cleaner and less confusing while being no more prone to user error (in fact
>> possibly less, as the calling syntax is simpler).
>
> I am pretty sure the second read in your example parses it be treating the address of x as a ubyte array and then slicing into which creates a valid ubyte[] array to pass to a function.

It's not guaranteed to be valid. replace x.sizeof with 1000 and it's an invalid ubyte[] array.

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation