August 04, 2004 Re: Streams and encoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to parabolis | On Wed, 04 Aug 2004 11:37:05 -0400, parabolis <parabolis@softhome.net> wrote: > Regan Heath wrote: > >> On Wed, 04 Aug 2004 01:24:46 -0400, parabolis <parabolis@softhome.net> wrote: >> >>> Regan Heath wrote: >>> >>>> On Tue, 03 Aug 2004 23:30:03 -0400, parabolis <parabolis@softhome.net> wrote: >>>> >>>> So now the read function takes a ubyte[] and is itself buffer safe.. however this does not mean buffer overruns are not possible, consider... >>>> >>>> void badBuggyRead(out char x) >>>> { >>>> read(cast(ubyte[])(&x)[0..1000]); >>>> } >>>> >>>> so even tho read uses a ubyte[] it can still overrun. >>> >>> >>> You can always circumvent a security measure. The point is that with the measure there you *have* to go out of your way to get around it. >> >> >> But people will. Assume you're trying to read/write a struct, int, float, whatever, you _have_ to write code like that above and you might get it wrong, it's exactly the same as if you were using: > > Not really. My DataXXXStream would handle reading all cases where you want to read a primitive. The struct thing is a special case that I will say should be handled by library read/write functions. So it is expected that people who want a primitive/struct will use a library function. Should somebody have the need for something strange and defeat the security measure then it is expected they will not do it in a way that causes a buffer overrun. > > Most buffer overruns are a result of the fact that deal with char* on a regular basis leads to small bugs. I eliminate those with ubyte[] (or possibly void[]). I don't think so. > You fail to do that with void*. I don't try. Because it's impossible. <snip> >> I am not going to alias all x possible combinations right now :) >> > > So for something that reads from a file then does buffering then decompression then computes a CRC check of the input stream and reads image data you would use something like this: Nope. alias ImageStream!(CRCStream!(DecompressStream!(File) CompressedImageCRC; // my 'File' is buffered. CompressedImageCRC f = new CompressedImageCRC(); or more likely 'CompressedImageCRC' will be replaced by a name that has context where I use it, if for example it was an image resource for a game it might be simply 'Image' > ================================================================ > alias BufferedInputStream!(FileInputStream) > BufferedFileInputStream; > alias DecompressionInputStream!(BufferedFileInputStream) > DecompressionBufferedFileInputStream; > alias CRCInputStream!(DecompressionBufferedFileInputStream) > CRCDecompressionBufferedFileInputStream; > alias ImageInputStream!(CRCDecompressionBufferedFileInputStream) > ImageCRCDecompressionBufferedFileInputStream; > > CRCInputSream crc_in = new > CRCDecompressionBufferedFileInputStream(filename); > ImageInputSream iin= new > ImageCRCDecompressionBufferedFileInputStream(crc_in); > ================================================================ > File - 10 times > Buffered - 10 times > Decompression - 8 times > CRC - 7 times > Image - 4 times > ================================ > > I cannot imagine why you would like having all that alias clutter up your file instead of just using the minimal: > ================================================================ > CRCInputStream crc_in = new CRCInputStream > ( new DecompressionInputStream > ( new BufferedInputStream > ( new FileInputStream( filename ) > ) > ) > ); > ImageInputSream iin = new ImageInputStream( crc_in ); > ================================================================ > File - 1 time > Buffered - 1 time > Decompression - 1 time > CRC - 2 times > Image - 2 times > ================ Now instantiate it 10 times and give me a tally. Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/ |
August 04, 2004 Re: Streams and encoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to Arcane Jill | "Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:ceq0mg$20d8$1@digitaldaemon.com... > In article <cep6nb$1o72$1@digitaldaemon.com>, Walter says... > > >I'm one of those folks who is very much in favor of a file reader being able > >to automatically detect the encoding in it. Hence, D can auto-detect the UTF > >formatting. So, I'd recommend that the format be an enum that can be specifically set or can be auto-detected. Different resulting behaviors can > >be handled with virtual functions. > > With all due respect, Walter, that's not really feasible. It is very hard, for > example, to distinguish between ISO-8859-1 and ISO-8859-2 (not to mention ISO-8859-3, etc.). Yes, distinguishing between UTFs is straightforward, but not > all encodings make life that easy for us. You can't use an enum, because there > are an unlimited number of possible encodings. I understand there are limits to this. I think it should be done where possible, and that it should not be precluded by design. > Besides, if you're parsing an HTTP header, and if, within that header, you read > "Content-Type: text/plain; encoding=MAC-ROMAN", then you can be pretty sure you > know what the encoding of the following document is going to be. Other formats > have different indicators (HTML meta tags; Python source file comments; -the > list is endless). Only at the application level can you /really/ sort this out, > because the application presumably knows what it's looking at. Yes. And this argues for a capability to switch horses midstream, so to speak. |
August 05, 2004 Re: Streams and encoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to Arcane Jill | "Arcane Jill" <Arcane_member@pathlink.com> escribió en el mensaje news:cerk3u$i4f$1@digitaldaemon.com | (so you would probably choose native byte order), but you can make it easier for | subsequent readers to auto-detect by writing a BOM at the start of the stream. | | ... | | between UTF-16LE and UTF-16BE, and then constructs either a UTF16LEReader or a | UTF16BEReader, and returns it. Somehow it needs a method of pushing back the | characters it's already read into the stream. Then, when the caller calls | s.read(), the exact encoding is known, and the stream is (re)read from the | start. | In the former case (the stream includes a BOM), would re-reading from the start include the BOM? If so, what good would it be for a user who just wants to read the file, independent of the encoding? (did I make myself clear?) ----------------------- Carlos Santander Bernal |
August 05, 2004 Re: Streams and encoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to Carlos Santander B. | In article <ces5mu$r8p$1@digitaldaemon.com>, Carlos Santander B. says... >In the former case (the stream includes a BOM), would re-reading from the start >include the BOM? Good question. I guess probably not. If the encoding is known, then it's known - Since a BOM serves only to identify the encoding, you don't need to re-read it in this instance. That said, it's still best that readers be prepared to ignore it. That is, if a reader reads U+FEFF as the first character, it would be harmless to throw that character away and return instead the second one. Pretty much all BOM related questions are answered here: http://www.unicode.org/faq/utf_bom.html#BOM. >If so, what good would it be for a user who just wants to read the file, independent of the encoding? (did I make myself clear?) If you fail to discard a BOM, and accidently treat it as a character, it will appear to your application as the character U+FEFF (ZERO WIDTH NON-BREAKING SPACE). It will display as a zero-width space. It has a general category of Cf (which actually makes it a formatting control, not a space!). Basically, it tries as hard as it can to do nothing at all. So it's useless to the "user who just wants to read the file" - useless, but harmless, most especially if you can recognise it and throw it away. Arcane Jill |
August 06, 2004 Re: Streams and encoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to Andy Friesen | Andy Friesen wrote:
> Might I suggest that DataSources and DataSinks use void[]?
>
> void[] knows how many bytes it points to and is slicable. Whether or not void[] was created for this exact scenerio is uncertain, but they are exceptionally well suited to the task regardless.
This is a good suggestion because void /is/ a much better conceptual match for general data coming from or going to someplace than byte or int. It is also a good suggestion because using void[] gives you some assurance against buffer overruns.
However I think the conceptual problems void[] introduces outweigh the benefits. void[] does a rather unspected thing when it gives you a byte count in .length. The default assumption would be (or at least my default assumption was) that the .length would be the same for an int[] being treated as a void[]. This suggests that at least some people using/writing functions with void[] parameters will do strange things. I believe the ensuing confusion warrants using a ubyte[] which which has behaviour that people will already understand.
|
August 06, 2004 Re: Streams and encoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to parabolis | On Thu, 05 Aug 2004 21:11:58 -0400, parabolis <parabolis@softhome.net> wrote: > Andy Friesen wrote: > >> Might I suggest that DataSources and DataSinks use void[]? >> >> void[] knows how many bytes it points to and is slicable. Whether or not void[] was created for this exact scenerio is uncertain, but they are exceptionally well suited to the task regardless. > > This is a good suggestion because void /is/ a much better conceptual match for general data coming from or going to someplace than byte or int. It is also a good suggestion because using void[] gives you some assurance against buffer overruns. I still don't agree with the last bit, void[] gives no _assurance_ at all, neither does ubyte[] or any other []. > However I think the conceptual problems void[] introduces outweigh the benefits. void[] does a rather unspected thing when it gives you a byte count in .length. That is what I assumed it would do. A void* is a pointer to 'something', the smallest addressable unit is a byte. As you do not know what 'something' is, you have to provide the ability to address the smallest addressable unit, i.e. a byte. > The default assumption would be (or at least my default assumption was) that the .length would be the same for an int[] being treated as a void[]. But then you cannot address each of the 4 bytes of each int. > This suggests that at least some people using/writing functions with void[] parameters will do strange things. Have you used 'void' as a type before, I suspect only people who have not used the concept before will get this wrong, and a simple line of documentation describing void[] will put them right. > I believe the ensuing confusion warrants using a ubyte[] which which has behaviour that people will already understand. I agree ubyte[] is the 'right' type, the data itself is a bunch of unsigned bytes, but, void[] or void* give you ease of use that ubyte[] lacks. Regan -- Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/ |
August 06, 2004 Re: Streams and encoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to parabolis | In article <ceulss$2fj6$1@digitaldaemon.com>, parabolis says... >void[] does a rather unspected thing when it gives you a byte count in .length. The default assumption would be (or at least my default assumption was) that the .length would be the same for an int[] being treated as a void[]. For all D types, the number of bytes occupied by a T[] of length N is (N * T.sizeof). This should have been your default assumption. void.sizeof is 1. Jill |
August 06, 2004 Re: Streams and encoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to Arcane Jill | Arcane Jill wrote: > In article <ceulss$2fj6$1@digitaldaemon.com>, parabolis says... > > >>void[] does a rather unspected thing when it gives you a byte count in .length. The default assumption would be (or at least my default assumption was) that the .length would be the same for an int[] being treated as a void[]. > > > For all D types, the number of bytes occupied by a T[] of length N is (N * > T.sizeof). This should have been your default assumption. void.sizeof is 1. Sorry I meant from the docs http://www.digitalmars.com/d/type.html: void no type bit single bit byte signed 8 bits ubyte unsigned 8 bits .... |
August 06, 2004 Re: Streams and encoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to parabolis | In article <cf0e4q$mqi$1@digitaldaemon.com>, parabolis says... > > >Sorry I meant from the docs > >http://www.digitalmars.com/d/type.html: > > void no type > bit single bit > byte signed 8 bits > ubyte unsigned 8 bits > .... Probably an esoteric question, but I assume that the byte size gurantee is only for machines with the proper architecture? Not that I expect to see a D compiler for the very few machines that support strange byte sizes, just wondering... Sean |
August 06, 2004 Re: Streams and encoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to Regan Heath | Regan Heath wrote: > On Thu, 05 Aug 2004 21:11:58 -0400, parabolis <parabolis@softhome.net> wrote: > >> Andy Friesen wrote: >> >>> Might I suggest that DataSources and DataSinks use void[]? >>> >>> void[] knows how many bytes it points to and is slicable. Whether or not void[] was created for this exact scenerio is uncertain, but they are exceptionally well suited to the task regardless. >> >> >> This is a good suggestion because void /is/ a much better conceptual match for general data coming from or going to someplace than byte or int. It is also a good suggestion because using void[] gives you some assurance against buffer overruns. > > > I still don't agree with the last bit, void[] gives no _assurance_ at all, neither does ubyte[] or any other []. My argument is that there exists a program in which a bug will be caught. You argument is that there does not exist a program such that a bug will be caught (or that for all programs there is no program such that a bug is caught). Assuming we have the function: read_bad(void*,uint len) read_good(ubyte[],uint len) A exerpt from program P in which a bug is caught is as follows: ============================== P ============================== ubyte ex[256]; read_bad(ex,0xFFFF_FFFF); // memory overwritten read_good(ex,0xFFFF_FFFF); // exception thrown ================================================================ P contains a bug that is caught using an array parameter. The existance of P simultaneously proves my argument and disproves yours. Yet we have had this discussion before and you seem to insist that since you can find examples where a bug is not caught my argument must be wrong somehow. I am not familiar with any logic in which such claims are expected. Either you will have to explain the logic system you are using to me so I can explain my claim properly or you will have to use the one I am using. Here are some links to mine: http://en.wikipedia.org/wiki/Logic http://en.wikipedia.org/wiki/Predicate_logic http://en.wikipedia.org/wiki/Universal_quantifier http://en.wikipedia.org/wiki/Existential_quantifier > >> However I think the conceptual problems void[] introduces outweigh the benefits. void[] does a rather unspected thing when it gives you a byte count in .length. > > > That is what I assumed it would do. A void* is a pointer to 'something', the smallest addressable unit is a byte. As you do not know what 'something' is, you have to provide the ability to address the smallest addressable unit, i.e. a byte. Wonderful guess. It is entirely more complicated than a ubyte[] being a partition of memory on 8-bit boundries and knowing how the length and sizeof will work. >> The default assumption would be (or at least my default assumption was) that the .length would be the same for an int[] being treated as a void[]. > > > But then you cannot address each of the 4 bytes of each int. Yes that was exactly my point. > >> This suggests that at least some people using/writing functions with void[] parameters will do strange things. > > > Have you used 'void' as a type before, I suspect only people who have No I have never used void as a type before. I have always been under the impression that "void varX;" is not a legal declaration/definition in C or C++. I have used void* frequently in C/C++ but the size of any void* variables is of course the size of any pointer. > not used the concept before will get this wrong, and a simple line of documentation describing void[] will put them right. Or using ubyte[] will write the documentation for me and provide some assurance that in cases in which people did not read the docs will have a chance of getting it right from the start. >> I believe the ensuing confusion warrants using a ubyte[] which which has behaviour that people will already understand. > > > I agree ubyte[] is the 'right' type, the data itself is a bunch of unsigned bytes, but, void[] or void* give you ease of use that ubyte[] lacks. No actually I have been saying void is 'right' because streaming data is only partitioned according to the semantics of the interpretation of the data. Partitioning data into a byte forces an arbitrary partition of general data that would not happen conceptually with void. I just feel that using void[] lacks the ease of use you get with ubyte[]. |
Copyright © 1999-2021 by the D Language Foundation