Streams and encoding (page 6) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Streams and encoding (page 6)

August 04, 2004

Re: Streams and encoding

Posted by Regan Heath
in reply to parabolis

Regan Heath

Posted in reply to parabolis

On Wed, 04 Aug 2004 11:37:05 -0400, parabolis <parabolis@softhome.net> wrote:
> Regan Heath wrote:
>
>> On Wed, 04 Aug 2004 01:24:46 -0400, parabolis <parabolis@softhome.net> wrote:
>>
>>> Regan Heath wrote:
>>>
>>>> On Tue, 03 Aug 2004 23:30:03 -0400, parabolis <parabolis@softhome.net> wrote:
>>>>
>>>> So now the read function takes a ubyte[] and is itself buffer safe.. however this does not mean buffer overruns are not possible, consider...
>>>>
>>>> void badBuggyRead(out char x)
>>>> {
>>>>     read(cast(ubyte[])(&x)[0..1000]);
>>>> }
>>>>
>>>> so even tho read uses a ubyte[] it can still overrun.
>>>
>>>
>>> You can always circumvent a security measure. The point is that with the measure there you *have* to go out of your way to get around it.
>>
>>
>> But people will. Assume you're trying to read/write a struct, int, float, whatever, you _have_ to write code like that above and you might get it wrong, it's exactly the same as if you were using:
>
> Not really. My DataXXXStream would handle reading all cases where you want to read a primitive. The struct thing is a special case that I will say should be handled by library read/write functions. So it is expected that people who want a primitive/struct will use a library function. Should somebody have the need for something strange and defeat the security measure then it is expected they will not do it in a way that causes a buffer overrun.
>
> Most buffer overruns are a result of the fact that deal with char* on a regular basis leads to small bugs. I eliminate those with ubyte[] (or possibly void[]).

I don't think so.

> You fail to do that with void*.

I don't try. Because it's impossible.

<snip>

>> I am not going to alias all x possible combinations right now :)
>>
>
> So for something that reads from a file then does buffering then decompression then computes a CRC check of the input stream and reads image data you would use something like this:

Nope.

alias ImageStream!(CRCStream!(DecompressStream!(File) CompressedImageCRC;
// my 'File' is buffered.

CompressedImageCRC f = new CompressedImageCRC();

or more likely 'CompressedImageCRC' will be replaced by a name that has context where I use it, if for example it was an image resource for a game it might be simply 'Image'

> ================================================================
> alias BufferedInputStream!(FileInputStream)
>      BufferedFileInputStream;
> alias DecompressionInputStream!(BufferedFileInputStream)
>      DecompressionBufferedFileInputStream;
> alias CRCInputStream!(DecompressionBufferedFileInputStream)
>      CRCDecompressionBufferedFileInputStream;
> alias ImageInputStream!(CRCDecompressionBufferedFileInputStream)
>      ImageCRCDecompressionBufferedFileInputStream;
>
> CRCInputSream crc_in = new
>      CRCDecompressionBufferedFileInputStream(filename);
> ImageInputSream iin= new
>      ImageCRCDecompressionBufferedFileInputStream(crc_in);
> ================================================================
> File - 10 times
> Buffered - 10 times
> Decompression - 8 times
> CRC - 7 times
> Image - 4 times
> ================================
>
> I cannot imagine why you would like having all that alias clutter up your file instead of just using the minimal:
> ================================================================
> CRCInputStream crc_in = new CRCInputStream
> (   new DecompressionInputStream
>      (   new BufferedInputStream
>          (  new FileInputStream( filename )
>          )
>      )
> );
> ImageInputSream iin = new ImageInputStream( crc_in );
> ================================================================
> File - 1 time
> Buffered - 1 time
> Decompression - 1 time
> CRC - 2 times
> Image - 2 times
> ================

Now instantiate it 10 times and give me a tally.

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

August 04, 2004

Re: Streams and encoding

Posted by Walter
in reply to Arcane Jill

Walter

Posted in reply to Arcane Jill

"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:ceq0mg$20d8$1@digitaldaemon.com...
> In article <cep6nb$1o72$1@digitaldaemon.com>, Walter says...
>
> >I'm one of those folks who is very much in favor of a file reader being
able
> >to automatically detect the encoding in it. Hence, D can auto-detect the
UTF
> >formatting. So, I'd recommend that the format be an enum that can be specifically set or can be auto-detected. Different resulting behaviors
can
> >be handled with virtual functions.
>
> With all due respect, Walter, that's not really feasible. It is very hard,
for
> example, to distinguish between ISO-8859-1 and ISO-8859-2 (not to mention ISO-8859-3, etc.). Yes, distinguishing between UTFs is straightforward,
but not
> all encodings make life that easy for us. You can't use an enum, because
there
> are an unlimited number of possible encodings.

I understand there are limits to this. I think it should be done where possible, and that it should not be precluded by design.

> Besides, if you're parsing an HTTP header, and if, within that header, you
read
> "Content-Type: text/plain; encoding=MAC-ROMAN", then you can be pretty
sure you
> know what the encoding of the following document is going to be. Other
formats
> have different indicators (HTML meta tags; Python source file
comments; -the
> list is endless). Only at the application level can you /really/ sort this
out,
> because the application presumably knows what it's looking at.

Yes. And this argues for a capability to switch horses midstream, so to speak.

August 05, 2004

Re: Streams and encoding

Posted by Carlos Santander B.
in reply to Arcane Jill

Carlos Santander B.

Posted in reply to Arcane Jill

"Arcane Jill" <Arcane_member@pathlink.com> escribió en el mensaje
news:cerk3u$i4f$1@digitaldaemon.com
| (so you would probably choose native byte order), but you can make it easier
for
| subsequent readers to auto-detect by writing a BOM at the start of the stream.
|
| ...
|
| between UTF-16LE and UTF-16BE, and then constructs either a UTF16LEReader or a
| UTF16BEReader, and returns it. Somehow it needs a method of pushing back the
| characters it's already read into the stream. Then, when the caller calls
| s.read(), the exact encoding is known, and the stream is (re)read from the
| start.
|


In the former case (the stream includes a BOM), would re-reading from the start include the BOM? If so, what good would it be for a user who just wants to read the file, independent of the encoding? (did I make myself clear?)

-----------------------
Carlos Santander Bernal

August 05, 2004

Re: Streams and encoding

Posted by Arcane Jill
in reply to Carlos Santander B.

Arcane Jill

Posted in reply to Carlos Santander B.

In article <ces5mu$r8p$1@digitaldaemon.com>, Carlos Santander B. says...

>In the former case (the stream includes a BOM), would re-reading from the start
>include the BOM?

Good question. I guess probably not. If the encoding is known, then it's known - Since a BOM serves only to identify the encoding, you don't need to re-read it in this instance.

That said, it's still best that readers be prepared to ignore it. That is, if a reader reads U+FEFF as the first character, it would be harmless to throw that character away and return instead the second one.

Pretty much all BOM related questions are answered here: http://www.unicode.org/faq/utf_bom.html#BOM.



>If so, what good would it be for a user who just wants to read the file, independent of the encoding? (did I make myself clear?)

If you fail to discard a BOM, and accidently treat it as a character, it will appear to your application as the character U+FEFF (ZERO WIDTH NON-BREAKING SPACE). It will display as a zero-width space. It has a general category of Cf (which actually makes it a formatting control, not a space!). Basically, it tries as hard as it can to do nothing at all.

So it's useless to the "user who just wants to read the file" - useless, but harmless, most especially if you can recognise it and throw it away.

Arcane Jill

August 06, 2004

Re: Streams and encoding

Posted by parabolis
in reply to Andy Friesen

parabolis

Posted in reply to Andy Friesen

Andy Friesen wrote:

> Might I suggest that DataSources and DataSinks use void[]?
> 
> void[] knows how many bytes it points to and is slicable.  Whether or not void[] was created for this exact scenerio is uncertain, but they are exceptionally well suited to the task regardless.

This is a good suggestion because void /is/ a much better conceptual match for general data coming from or going to someplace than byte or int. It is also a good suggestion because using void[] gives you some assurance against buffer overruns.

However I think the conceptual problems void[] introduces outweigh the benefits. void[] does a rather unspected thing when it gives you a byte count in .length. The default assumption would be (or at least my default assumption was) that the .length would be the same for an int[] being treated as a void[]. This suggests that at least some people using/writing functions with void[] parameters will do strange things. I believe the ensuing confusion warrants using a ubyte[] which which has behaviour that people will already understand.

August 06, 2004

Re: Streams and encoding

Posted by Regan Heath
in reply to parabolis

Regan Heath

Posted in reply to parabolis

On Thu, 05 Aug 2004 21:11:58 -0400, parabolis <parabolis@softhome.net> wrote:
> Andy Friesen wrote:
>
>> Might I suggest that DataSources and DataSinks use void[]?
>>
>> void[] knows how many bytes it points to and is slicable.  Whether or not void[] was created for this exact scenerio is uncertain, but they are exceptionally well suited to the task regardless.
>
> This is a good suggestion because void /is/ a much better conceptual match for general data coming from or going to someplace than byte or int. It is also a good suggestion because using void[] gives you some assurance against buffer overruns.

I still don't agree with the last bit, void[] gives no _assurance_ at all, neither does ubyte[] or any other [].

> However I think the conceptual problems void[] introduces outweigh the benefits. void[] does a rather unspected thing when it gives you a byte count in .length.

That is what I assumed it would do. A void* is a pointer to 'something', the smallest addressable unit is a byte. As you do not know what 'something' is, you have to provide the ability to address the smallest addressable unit, i.e. a byte.

> The default assumption would be (or at least my default assumption was) that the .length would be the same for an int[] being treated as a void[].

But then you cannot address each of the 4 bytes of each int.

> This suggests that at least some people using/writing functions with void[] parameters will do strange things.

Have you used 'void' as a type before, I suspect only people who have not used the concept before will get this wrong, and a simple line of documentation describing void[] will put them right.

> I believe the ensuing confusion warrants using a ubyte[] which which has behaviour that people will already understand.

I agree ubyte[] is the 'right' type, the data itself is a bunch of unsigned bytes, but, void[] or void* give you ease of use that ubyte[] lacks.

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

August 06, 2004

Re: Streams and encoding

Posted by Arcane Jill
in reply to parabolis

Arcane Jill

Posted in reply to parabolis

In article <ceulss$2fj6$1@digitaldaemon.com>, parabolis says...

>void[] does a rather unspected thing when it gives you a byte count in .length. The default assumption would be (or at least my default assumption was) that the .length would be the same for an int[] being treated as a void[].

For all D types, the number of bytes occupied by a T[] of length N is (N * T.sizeof). This should have been your default assumption. void.sizeof is 1.

Jill

August 06, 2004

Re: Streams and encoding

Posted by parabolis
in reply to Arcane Jill

parabolis

Posted in reply to Arcane Jill

Arcane Jill wrote:

> In article <ceulss$2fj6$1@digitaldaemon.com>, parabolis says...
> 
> 
>>void[] does a rather unspected thing when it gives you a byte count in .length. The default assumption would be (or at least my default assumption was) that the .length would be the same for an int[] being treated as a void[].
> 
> 
> For all D types, the number of bytes occupied by a T[] of length N is (N *
> T.sizeof). This should have been your default assumption. void.sizeof is 1.

Sorry I meant from the docs

http://www.digitalmars.com/d/type.html:

    void  no type
     bit  single bit
    byte  signed 8 bits
   ubyte  unsigned 8 bits
   ....

August 06, 2004

Re: Streams and encoding

Posted by Sean Kelly
in reply to parabolis

Sean Kelly

Posted in reply to parabolis

In article <cf0e4q$mqi$1@digitaldaemon.com>, parabolis says...
>
>
>Sorry I meant from the docs
>
>http://www.digitalmars.com/d/type.html:
>
>     void  no type
>      bit  single bit
>     byte  signed 8 bits
>    ubyte  unsigned 8 bits
>    ....

Probably an esoteric question, but I assume that the byte size gurantee is only for machines with the proper architecture?  Not that I expect to see a D compiler for the very few machines that support strange byte sizes, just wondering...


Sean

August 06, 2004

Re: Streams and encoding

Posted by parabolis
in reply to Regan Heath

parabolis

Posted in reply to Regan Heath

Regan Heath wrote:

> On Thu, 05 Aug 2004 21:11:58 -0400, parabolis <parabolis@softhome.net> wrote:
> 
>> Andy Friesen wrote:
>>
>>> Might I suggest that DataSources and DataSinks use void[]?
>>>
>>> void[] knows how many bytes it points to and is slicable.  Whether or not void[] was created for this exact scenerio is uncertain, but they are exceptionally well suited to the task regardless.
>>
>>
>> This is a good suggestion because void /is/ a much better conceptual match for general data coming from or going to someplace than byte or int. It is also a good suggestion because using void[] gives you some assurance against buffer overruns.
> 
> 
> I still don't agree with the last bit, void[] gives no _assurance_ at all, neither does ubyte[] or any other [].

My argument is that there exists a program in which a bug will be caught. You argument is that there does not exist a program such that a bug will be caught (or that for all programs there is no program such that a bug is caught).

Assuming we have the function:
     read_bad(void*,uint len)
    read_good(ubyte[],uint len)

A exerpt from program P in which a bug is caught is as follows:
============================== P ==============================
    ubyte ex[256];
     read_bad(ex,0xFFFF_FFFF); // memory overwritten
    read_good(ex,0xFFFF_FFFF); // exception thrown
================================================================

P contains a bug that is caught using an array parameter. The existance of P simultaneously proves my argument and disproves yours.

Yet we have had this discussion before and you seem to insist that since you can find examples where a bug is not caught my argument must be wrong somehow. I am not familiar with any logic in which such claims are expected. Either you will have to explain the logic system you are using to me so I can explain my claim properly or you will have to use the one I am using. Here are some links to mine:

http://en.wikipedia.org/wiki/Logic
http://en.wikipedia.org/wiki/Predicate_logic
http://en.wikipedia.org/wiki/Universal_quantifier
http://en.wikipedia.org/wiki/Existential_quantifier

> 
>> However I think the conceptual problems void[] introduces outweigh the benefits. void[] does a rather unspected thing when it gives you a byte count in .length.
> 
> 
> That is what I assumed it would do. A void* is a pointer to 'something', the smallest addressable unit is a byte. As you do not know what 'something' is, you have to provide the ability to address the smallest addressable unit, i.e. a byte.

Wonderful guess. It is entirely more complicated than a ubyte[] being a partition of memory on 8-bit boundries and knowing how the length and sizeof will work.

>> The default assumption would be (or at least my default assumption was) that the .length would be the same for an int[] being treated as a void[].
> 
> 
> But then you cannot address each of the 4 bytes of each int.

Yes that was exactly my point.

> 
>> This suggests that at least some people using/writing functions with void[] parameters will do strange things.
> 
> 
> Have you used 'void' as a type before, I suspect only people who have 

No I have never used void as a type before. I have always been under the impression that "void varX;" is not a legal declaration/definition in C or C++. I have used void* frequently in C/C++ but the size of any void* variables is of course the size of any pointer.

> not used the concept before will get this wrong, and a simple line of documentation describing void[] will put them right.

Or using ubyte[] will write the documentation for me and provide some assurance that in cases in which people did not read the docs will have a chance of getting it right from the start.

>> I believe the ensuing confusion warrants using a ubyte[] which which has behaviour that people will already understand.
> 
> 
> I agree ubyte[] is the 'right' type, the data itself is a bunch of unsigned bytes, but, void[] or void* give you ease of use that ubyte[] lacks.

No actually I have been saying void is 'right' because streaming data is only partitioned according to the semantics of the interpretation of the data. Partitioning data into a byte forces an arbitrary partition of general data that would not happen conceptually with void.

I just feel that using void[] lacks the ease of use you get with ubyte[].

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation