August 17, 2004
Heinz Saathoff wrote:
> That fgetc() has much overhead is true, but I wasn't sure why it's nearly a factor of 10 against my primitive buffering approach. Walter told me that fgetc has to be aware of multithreading. There will be some error handling too. All this is overhead.
> When I find some time I will have a look to the sources and see what happens.

Well, you could purchase the CD and look at the code. :-)

fgetc() for 32-bit is handcoded assembly (see src/CORE32/FPUTC.ASM.) It's about as fast as you're going to get it to run. It does deal with multithreaded locking of the file descriptor, which is the place where the slowdown occurs. If you look at the code for LockSemaphoreNested, you'll see a LOCK-prefixed instruction -- this is **really** slow because it forces a lot of synchrony in the Pentium pipeline. It's the most conservative way of doing locking if you can't do self-modifying code or don't offer processor-specific versions of the RTL.

Of course, you can create your own CPU-specific version of the RTL because the build system and the code is available from the CD. For example, you can call CMPXCHG or XADD instead of LOCK INC (because the lock semantics are implied.) If you're not in a SMP environment, you can simply use MOV so long as it's an aligned MOV (gauranteed atomic.)

> If fgetc() was implemented the way I did in my simple buffering file wrapper it would be as fast as my version. As you told fgetc() does more than just picking a char from a buffer and incrementing a pointer. I didn't expect this overhead in first place but now I know not to use fgetc() in timecritical applications. 

Clearly, your own optimizations work better than the generic version, which has to make few and conservative assumptions. If you're looking to be maximally portable, stick with fgetc() or use fread() if the size of the object is known. fread() will amortize the penalty of calling stdio over a larger number of bytes.
August 18, 2004
Hello Scott,

Scott Michel wrote...
> Heinz Saathoff wrote:
> > That fgetc() has much overhead is true, but I wasn't sure why it's
> > nearly a factor of 10 against my primitive buffering approach. Walter
> > told me that fgetc has to be aware of multithreading. There will be some
> > error handling too. All this is overhead.
> > When I find some time I will have a look to the sources and see what
> > happens.
> 
> Well, you could purchase the CD and look at the code. :-)

I already have. That' why I have the sources.


> fgetc() for 32-bit is handcoded assembly (see src/CORE32/FPUTC.ASM.) It's about as fast as you're going to get it to run. It does deal with multithreaded locking of the file descriptor, which is the place where the slowdown occurs. If you look at the code for LockSemaphoreNested, you'll see a LOCK-prefixed instruction -- this is **really** slow because it forces a lot of synchrony in the Pentium pipeline. It's the most conservative way of doing locking if you can't do self-modifying code or don't offer processor-specific versions of the RTL.

Thank's for the hint.


> Of course, you can create your own CPU-specific version of the RTL because the build system and the code is available from the CD. For example, you can call CMPXCHG or XADD instead of LOCK INC (because the lock semantics are implied.) If you're not in a SMP environment, you can simply use MOV so long as it's an aligned MOV (gauranteed atomic.)

It's not necessary in the moment. The app I wrote is only used by myself. Now that I'm aware of this bottleneck I can evade it.


> > If fgetc() was implemented the way I did in my simple buffering file wrapper it would be as fast as my version. As you told fgetc() does more than just picking a char from a buffer and incrementing a pointer. I didn't expect this overhead in first place but now I know not to use fgetc() in timecritical applications.
> 
> Clearly, your own optimizations work better than the generic version, which has to make few and conservative assumptions. If you're looking to be maximally portable, stick with fgetc() or use fread() if the size of the object is known. fread() will amortize the penalty of calling stdio over a larger number of bytes.

Yes, it's always good to know where the time is spent and how small changes in code can result in great performance gains.


- Heinz
August 18, 2004
Heinz Saathoff wrote:
> Yes, it's always good to know where the time is spent and how small changes in code can result in great performance gains. 

Google for the linux kernel patch that modifies the kernel at run-time to select the "right" atomic instructions depending on whether the machine is SMP and the processor model/rev. Pretty cool looking stuff, but with the XP patches that prevent modifying executable pages, I doubt this could be easily implemented in the RTL.


-scooter
1 2
Next ›   Last »