Streams and encoding (page 4) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Streams and encoding (page 4)

August 04, 2004

Re: Streams and encoding

Posted by Andy Friesen
in reply to Regan Heath

Andy Friesen

Posted in reply to Regan Heath

Regan Heath wrote:

>> Slicing does not create garbage.
> 
> Really? doesn't slicing create another array structure (the one you have described below) exactly the same as if/when you pass one to a function, so..
> 
> void foo(char[] a)
> {
> }
> void main()
> {
>   char[] a = "12345";
>   foo(a[1..3]);
> }
> 
> the above code creates 3 arrays:
>  1- 'a' at the start of main
>  2- one for the slice
>  3- one for the function call.
> 
> leaving out the slice creates one less copy of the array (not the data)
> 
> I think that is what parabolis meant.

Sure, but the second two can probably be optimized into one and the same.

Besides, it's stack space.  Nothing is faster than stack allocation. (sub esp, ...)

>> The whole idea behind DataSources and DataSinks is that they just pull bytes in and out of some other place without ever having any concern for their meaning.
>>
>> This is a textbook case of the right place to use void*. :)  (or void[])
> 
> 
> I agree void* or void[] should be used.
> 
> Parabolis's other concern was a buffer overrun, but as I see it neither void[], void * or ubyte[] are any more buffer safe (see my other post for a detailed explaination)

References are to be preferred over pointers in C++ because constructing a null reference isn't easily possible to do by accident.  It's easy to do on purpose, but if you do, Santa will put you on his Naughty list and give you coal.  Also, your programs might crash or something.

D arrays are the same way.  Accidentally constructing an invalid array is much less likely to occur than using an explicit pointer/length pair. :)

 -- andy

August 04, 2004

Re: Streams and encoding

Posted by parabolis
in reply to Andy Friesen

parabolis

Posted in reply to Andy Friesen

Andy Friesen wrote:

> Besides, it's stack space.  Nothing is faster than stack allocation. (sub esp, ...)

Sure there is. Not allocating is infinitely faster. :)

August 04, 2004

Re: Streams and encoding

Posted by Andy Friesen
in reply to parabolis

Andy Friesen

Posted in reply to parabolis

parabolis wrote:
> Andy Friesen wrote:
> 
>> Besides, it's stack space.  Nothing is faster than stack allocation. (sub esp, ...)
> 
> 
> Sure there is. Not allocating is infinitely faster. :)

Passing an array slice as an argument is exactly the same as passing a pointer to its contents and a size. (the exact same code should be emitted)

This is why the %.*s trick works with printf.  The length gets pushed first, then the pointer, which just so happens to be the same format as expected by %.*s.

printf("%.*s\n", str); <===> printf("%.*s\n", str.length, &str[0]);

While we're on the topic of speed hacking, though, might I suggest the following for improving application performance:

    main() { return 0; }

(it reduces memory consumption too!)

;)

 -- andy

August 04, 2004

Re: Streams and encoding

Posted by Arcane Jill
in reply to Walter

Arcane Jill

Posted in reply to Walter

In article <cep6nb$1o72$1@digitaldaemon.com>, Walter says...

>I'm one of those folks who is very much in favor of a file reader being able to automatically detect the encoding in it. Hence, D can auto-detect the UTF formatting. So, I'd recommend that the format be an enum that can be specifically set or can be auto-detected. Different resulting behaviors can be handled with virtual functions.

With all due respect, Walter, that's not really feasible. It is very hard, for example, to distinguish between ISO-8859-1 and ISO-8859-2 (not to mention ISO-8859-3, etc.). Yes, distinguishing between UTFs is straightforward, but not all encodings make life that easy for us. You can't use an enum, because there are an unlimited number of possible encodings.

Besides, if you're parsing an HTTP header, and if, within that header, you read "Content-Type: text/plain; encoding=MAC-ROMAN", then you can be pretty sure you know what the encoding of the following document is going to be. Other formats have different indicators (HTML meta tags; Python source file comments; -the list is endless). Only at the application level can you /really/ sort this out, because the application presumably knows what it's looking at.


>Also, formats like UTF-16 have two variants, big end and little end.

Best to treat those as two separate encodings, although if the encoding is specified as "UTF-16" you may still need to auto-detect which variant is being used. Once you know for sure, stick with it.



>It should also be able to read data in other formats, such as code pages, and convert them to utf. These cannot be auto-detected.

I think that's the whole point. Windows code pages /are/ encodings. WINDOWS-1252 is an encoding, same as UTF-8. I think people here are talking about encodings generally, not just UTFs.

Jill

August 04, 2004

Re: Streams and encoding

Posted by antiAlias
in reply to parabolis

antiAlias

Posted in reply to parabolis

"parabolis" <parabolis@softhome.net> wrote in message news:cepppv$1ugt$1@digitaldaemon.com...
> antiAlias wrote:
>
> > "parabolis" wrote..
> >
> >
> >>Quote: ================================================================ Note that these Tokenizers do not maintain any state of their own. Thus they are all thread-safe. ================================================================ This is always good to know from documentation. :)
> >>
> >>
> >>However I am curious about IPickle's design. Would it not be possible to serialize objects based on the data in ClassInfo?
> >
> >
> > Doing it the introspection way (ala Java) has a bunch of issues all of
it's
> > own, and D doesn't have the power to expose all the requisite data as
yet (I
> > could be wrong on the latter though).
>
> I think I was premature to suppose D could do that. I just gave the issue some thought and there is just enough introspection to make a shallow copy which is obviously not sufficient.
>
> > IPickle was a nice and simple way to approach it; there's no monkey
business
> > anywhere (like Java has), it's explicit, and it's very fast. While not
an
> > overriding design factor, throughput is one of the main things all the
Mango
> > branches/packages keep an watchful eye upon. Frankly, I'd like to see a decent introspection approach emerge along the way; perhaps as a
complement
> > rather than a replacement: within Mango there's no obvious reason why
the
> > two approaches could not produce an equivalent serialized stream, and therefore be interchangeable at the endpoints.
>
> Any automated serializing algorithm would have to either allow IPickles to [de-]serialize themselves or ignore read/write. However given one of those holds then the serialization ought to be compatible.
>
> > This is one area where I think getting other people involved in the
project
> > would help tremendously.
>
> I think I am probably sold on being willing to help. It is more an issue of whether I can provide anything that will further mango. :)
>
There's lots to do <g>

Here's some things that have been noted:
http://www.dsource.org/forums/viewtopic.php?t=174&sid=f5f234d101f0405ebaf9cb
df728af44a
And here's some more:
http://www.dsource.org/forums/viewtopic.php?t=157&sid=f5f234d101f0405ebaf9cb
df728af44a

That's just the tip of the iceberg though. For example, there's no Unicode support as yet since we decided to wait until Hauke & AJ released all the requisite pieces (better to do it properly); IO filters/decorators such as companders have not actually been implemented yet, although there's a solid placeholder for them; there's some annoying things that are currently unimplemented on Unix (noted in the documentation todo list); etc. etc.

Plenty of room for improvement all over the place, and that's before you hit the upper decks :-)

The project is very open to other packages hooking in at any level: as a peer, as part of the Mango Tree itself, or as a package user. For example, there's currently a bit-sliced XML/SAX engine in the works (okay; "byte-sliced" then), plus the DSP project mentioned earlier (which looks to be really uber cool ... everyone should check that one out).

Having real-world user-code drive the design and functionality is of truly immense value: the bad stuff is typically identified and removed/replaced rather quickly. Anyone who would like to get involved, please jump on the dsource.org forums!

August 04, 2004

Re: Streams and encoding

Posted by Regan Heath
in reply to parabolis

Regan Heath

Posted in reply to parabolis

On Wed, 04 Aug 2004 01:24:46 -0400, parabolis <parabolis@softhome.net> wrote:

> Regan Heath wrote:
>
>> On Tue, 03 Aug 2004 23:30:03 -0400, parabolis <parabolis@softhome.net> wrote:
>>
>>> Show me a safe function that takes void* as a parameter. That was really more the point I was making. There is no way to guanratee in read(void*,uint len) that len is not actually longer than the array someone passes in. When that happens your read function will overwrite the end of the array and eventually write over executable code. Somebody will find that bug and send a specially formatted overly long string that has machine code in it and hijack the program.
>>
>>
>> I agree this is a problem, I have been dealing with it for years at work (we work with C only).
>>
>> The solution in this case is that nobody outside the Stream template class actually calls the read/write functions that take void* instead they call the ones provided for int, float, ubyte[], and so on.
>>
>> However, someone might want the void* ones in order to read/write a struct..
>
> That is a good point.
>
>>
>> ...
>>
>> I have just discovered you can use ubyte[] and get the same sort of function as my void* one, check out...
>>
>> class Stream
>> {
>>     ulong read(ubyte[] buffer, ulong length = 0, ulong offset = 0)
>>     {
>>         if (length == 0) length = buffer.length;
>>         buffer[offset..length] = 65;
>
> Now that is pretty neat.

Just that bit.. or the whole thing? That bit above was a little hack, it sets the whole buffer to 65 or ascii 'A'.

>> So now the read function takes a ubyte[] and is itself buffer safe.. however this does not mean buffer overruns are not possible, consider...
>>
>> void badBuggyRead(out char x)
>> {
>>     read(cast(ubyte[])(&x)[0..1000]);
>> }
>>
>> so even tho read uses a ubyte[] it can still overrun.
>
> You can always circumvent a security measure. The point is that with the measure there you *have* to go out of your way to get around it.

But people will. Assume you're trying to read/write a struct, int, float, whatever, you _have_ to write code like that above and you might get it wrong, it's exactly the same as if you were using:

  read(void* address, ulong length);

you might call that wrong to. I cannot see a difference and void* is easier to use and smaller than void[].

>>>> <snip>
>>>>
>>>>> I am glad to hear you decided to split them. I think you will find it makes life simpler.
>>>>>
>>>>> I am not much of a generic programmer. So I am waiting to see how you deal with the combinatorial problem before I am sold on the idea. If you can pull it off then you might be onto something. :)
>>>>
>>>>
>>>>
>>>> You mean the problem you see with threads and shared buffers?
>>>>
>>>
>>> Sorry I meant the problem with threads and shared buffers should be easier now.
>>
>>
>> :)
>>
>>> The bit about the combinatorial problem goes back to the other thread in which I wanted to see how you combine multiple streams...
>>
>>
>> Ahh yes.. I am waiting for an idea to come to me.. my first idea is that I combine them in the same way as I combine the ones I currently have eg.
>>
>> alias OutputStream!(InputStream!(RawFile)) File;
>>
>> or something, I have not tried splitting them yet, then..
>>
>> alias CRCReader!(File) CRCFileReader;
>> alias CRCWriter!(File) CRCFileWriter;
>>
>> alias ZIPReader!(File) ZIPFileReader;
>> alias ZIPWriter!(File) ZIPFileWriter;
>>
>> now, this is fine for types we know about at compile time, however we may need to choose at runtime, so some sort of factory approach will have to be used...
>>
>> Regan
>>
>
> Consider the number of combinations of just Readers that are possible:
>
>     File,Net,Mem - choose 1 of 3
>
>     Compression
>     CRC           } - choose any number and in any order
>     Buffering
>
>     Image,Audio,Video - choose 1 of 3
>
> If I am not to sleepy to be thinking straight then there are rougly 100 combinations of readers with just these 9 classes.

Yeah.. so? when I need one I make an alias and use it.. when I need another I make an alias and use it, it's no different to simply typing
 new A(new B(new C)))

when you use it, _except_, if you re-use it in several places then my alias is neater.

I am not going to alias all x possible combinations right now :)

Regan.

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

August 04, 2004

Re: Streams and encoding

Posted by Bent Rasmussen
in reply to Andy Friesen

Bent Rasmussen

Posted in reply to Andy Friesen

> While we're on the topic of speed hacking, though, might I suggest the following for improving application performance:
>
>      main() { return 0; }
>
> (it reduces memory consumption too!)

That is post-mature optimization. You should never have created application.d in the first place! :-)

August 04, 2004

Re: Streams and encoding

Posted by Sean Kelly
in reply to Andy Friesen

Sean Kelly

Posted in reply to Andy Friesen

In article <cepsao$1vbo$1@digitaldaemon.com>, Andy Friesen says...
>
>
>D arrays are the same way.  Accidentally constructing an invalid array is much less likely to occur than using an explicit pointer/length pair. :)

Not sure I agree in this case.

# void read( void* addr, size_t size );
# void read( ubyte[] val );
#
# int x;
# read( &x, x.sizeof );
# read( cast(ubyte[]) &x[0..x.sizeof] );

Both instances of the above code require the programmer to be a bit evil about how they specify access to a range of memory.  To me, the void* call just looks cleaner and less confusing while being no more prone to user error (in fact possibly less, as the calling syntax is simpler).

I had actually added wrapper functions to unformatted read/write all primitive types but recently removed them because they seemed redundant.  I suppose if there's enough of a demand I'll add them back.


Sean

August 04, 2004

Re: Streams and encoding

Posted by Sean Kelly
in reply to Arcane Jill

Sean Kelly

Posted in reply to Arcane Jill

In article <ceq0mg$20d8$1@digitaldaemon.com>, Arcane Jill says...
>
>In article <cep6nb$1o72$1@digitaldaemon.com>, Walter says...
>
>>Also, formats like UTF-16 have two variants, big end and little end.
>
>Best to treat those as two separate encodings, although if the encoding is specified as "UTF-16" you may still need to auto-detect which variant is being used. Once you know for sure, stick with it.

That reminds me.  Which format does the code in utf.d use?  I'm thinking I may do something like this for encoding for now:

enum Format {
UTF8 = 0,
UTF16 = 1,
UTF16LE = 1,
UTF16BE = 2
}

So "UTF-16" would actually default to one of the two methods.


Sean

August 04, 2004

Re: Streams and encoding

Posted by parabolis
in reply to Regan Heath

parabolis

Posted in reply to Regan Heath

Regan Heath wrote:

> On Wed, 04 Aug 2004 01:24:46 -0400, parabolis <parabolis@softhome.net> wrote:
> 
>> Regan Heath wrote:
>>
>>> On Tue, 03 Aug 2004 23:30:03 -0400, parabolis <parabolis@softhome.net> wrote:
>>>
>>> So now the read function takes a ubyte[] and is itself buffer safe.. however this does not mean buffer overruns are not possible, consider...
>>>
>>> void badBuggyRead(out char x)
>>> {
>>>     read(cast(ubyte[])(&x)[0..1000]);
>>> }
>>>
>>> so even tho read uses a ubyte[] it can still overrun.
>>
>>
>> You can always circumvent a security measure. The point is that with the measure there you *have* to go out of your way to get around it.
> 
> 
> But people will. Assume you're trying to read/write a struct, int, float, whatever, you _have_ to write code like that above and you might get it wrong, it's exactly the same as if you were using:

Not really. My DataXXXStream would handle reading all cases where you want to read a primitive. The struct thing is a special case that I will say should be handled by library read/write functions. So it is expected that people who want a primitive/struct will use a library function. Should somebody have the need for something strange and defeat the security measure then it is expected they will not do it in a way that causes a buffer overrun.

Most buffer overruns are a result of the fact that deal with char* on a regular basis leads to small bugs. I eliminate those with ubyte[] (or possibly void[]). You fail to do that with void*.

> 
>   read(void* address, ulong length);
> 
> you might call that wrong to. I cannot see a difference and void* is easier to use and smaller than void[].
> 
>>>>> <snip>
>>>>>
>>>>>> I am glad to hear you decided to split them. I think you will find it makes life simpler.
>>>>>>
>>>>>> I am not much of a generic programmer. So I am waiting to see how you deal with the combinatorial problem before I am sold on the idea. If you can pull it off then you might be onto something. :)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> You mean the problem you see with threads and shared buffers?
>>>>>
>>>>
>>>> Sorry I meant the problem with threads and shared buffers should be easier now.
>>>
>>>
>>>
>>> :)
>>>
>>>> The bit about the combinatorial problem goes back to the other thread in which I wanted to see how you combine multiple streams...
>>>
>>>
>>>
>>> Ahh yes.. I am waiting for an idea to come to me.. my first idea is that I combine them in the same way as I combine the ones I currently have eg.
>>>
>>> alias OutputStream!(InputStream!(RawFile)) File;
>>>
>>> or something, I have not tried splitting them yet, then..
>>>
>>> alias CRCReader!(File) CRCFileReader;
>>> alias CRCWriter!(File) CRCFileWriter;
>>>
>>> alias ZIPReader!(File) ZIPFileReader;
>>> alias ZIPWriter!(File) ZIPFileWriter;
>>>
>>> now, this is fine for types we know about at compile time, however we may need to choose at runtime, so some sort of factory approach will have to be used...
>>>
>>> Regan
>>>
>>
>> Consider the number of combinations of just Readers that are possible:
>>
>>     File,Net,Mem - choose 1 of 3
>>
>>     Compression
>>     CRC           } - choose any number and in any order
>>     Buffering
>>
>>     Image,Audio,Video - choose 1 of 3
>>
>> If I am not to sleepy to be thinking straight then there are rougly 100 combinations of readers with just these 9 classes.
> 
> 
> Yeah.. so? when I need one I make an alias and use it.. when I need another I make an alias and use it, it's no different to simply typing
>  new A(new B(new C)))
> 
> when you use it, _except_, if you re-use it in several places then my alias is neater.
> 
> I am not going to alias all x possible combinations right now :)
> 

So for something that reads from a file then does buffering then decompression then computes a CRC check of the input stream and reads image data you would use something like this:
================================================================
alias BufferedInputStream!(FileInputStream)
    BufferedFileInputStream;
alias DecompressionInputStream!(BufferedFileInputStream)
    DecompressionBufferedFileInputStream;
alias CRCInputStream!(DecompressionBufferedFileInputStream)
    CRCDecompressionBufferedFileInputStream;
alias ImageInputStream!(CRCDecompressionBufferedFileInputStream)
    ImageCRCDecompressionBufferedFileInputStream;

CRCInputSream crc_in = new
    CRCDecompressionBufferedFileInputStream(filename);
ImageInputSream iin= new
    ImageCRCDecompressionBufferedFileInputStream(crc_in);
================================================================
File - 10 times
Buffered - 10 times
Decompression - 8 times
CRC - 7 times
Image - 4 times
================================

I cannot imagine why you would like having all that alias clutter up your file instead of just using the minimal:
================================================================
CRCInputStream crc_in = new CRCInputStream
(   new DecompressionInputStream
    (   new BufferedInputStream
        (  new FileInputStream( filename )
        )
    )
);
ImageInputSream iin = new ImageInputStream( crc_in );
================================================================
File - 1 time
Buffered - 1 time
Decompression - 1 time
CRC - 2 times
Image - 2 times
================

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation