new streams - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » D » new streams

Thread overview

new streams
May 10, 2002 Pavel Minayev
May 10, 2002 Walter
May 10, 2002 Pavel Minayev
May 10, 2002 Russ Lewis
May 10, 2002 Pavel Minayev
May 10, 2002 Russ Lewis
May 10, 2002 Andrew Feldstein
May 10, 2002 Pavel Minayev
May 10, 2002 Walter
May 11, 2002 Robert W. Cunningham
May 10, 2002 Walter
May 10, 2002 Burton Radons
May 10, 2002 Pavel Minayev
May 10, 2002 Burton Radons
May 10, 2002 Pavel Minayev
May 11, 2002 OddesE
May 10, 2002 Walter
May 11, 2002 Pavel Minayev
May 11, 2002 Walter
May 11, 2002 Martin M. Pedersen
May 11, 2002 Walter

May 10, 2002

Posted by Pavel Minayev

Pavel Minayev

You can find the new stream module at my site, http://int19h.tamb.ru.

Far the most interesting addition is scanf(). What is even better,
it can read D strings!

    char[] s;
    stdin.scanf("%.*s", &s);

Yes, this really works! Of course, you can still read C strings (%s), but who needs it anymore? Note, however, that scanf wasn't tested much, so it might contain bugs. Be careful!

Going further, readLine() has learnt to read lines terminated with a single CR (aka It Came From Mac). And writeLine() now follows Windows conventions, and writes CR/LF terminated lines. On Linux, it should write a single LF, and a CR on Mac, whenever D gets there - a bit of underlying platform transparency.

Unicode strings now work - no, really! =) readStringW(),
writeStringW(), readLineW(), and writeLineW() do their job
not any worse then their ANSI counterparts.

Generic read() and write() can now handle strings as well. Unlike
readString() and writeString(), these also store the length in
the stream:

    char[] s;
    ...
    file.write(s);   // writes s.length, followed by s
    ...
    file.read(s);    // reads length, then string of that length

Two new functions: getc() and ungetc(). I guess you know what
are these for. =) They also have Unicode versions, getcw() and
ungetcw().

Enumerations changed names again:

    enum SeekPos
    {
        Set,
        Current,
        End
    }

    enum FileMode
    {
        In,
        Out
    }

Now these have proper case, and should be more consistent to other Phobos modules.

And finally, the module is NO LONGER DEPENDANT on my windows.d import, and can be used with the one that comes with D. Thus, it can easily replace the old and outdated stream module in Phobos.

By the way, Walter, could you pleeease replace the old version, that
you'd put into Phobos, with this new one? It's much better, has
less bugs, and since it is now self-sufficient (no need for my crappy
win32 import module), it should be easy to do...

May 10, 2002

Re: new streams

Posted by Walter
in reply to Pavel Minayev

Walter

Posted in reply to Pavel Minayev

"Pavel Minayev" <evilone@omen.ru> wrote in message news:abglum$2kij$1@digitaldaemon.com...
> You can find the new stream module at my site, http://int19h.tamb.ru.

Cool!

> Going further, readLine() has learnt to read lines terminated with a single CR (aka It Came From Mac). And writeLine() now follows Windows conventions, and writes CR/LF terminated lines. On Linux, it should write a single LF, and a CR on Mac, whenever D gets there - a bit of underlying platform transparency.

What I do is treat as "newline" any of the following:

1) CR
2) CR LF
3) LF

It requires a bit of lookahead to distinguish case 1 from case 2, but it works with files generated by Windows, linux, and Mac.

> By the way, Walter, could you pleeease replace the old version, that you'd put into Phobos, with this new one?

Sure!

May 10, 2002

Re: new streams

Posted by Pavel Minayev
in reply to Walter

Pavel Minayev

Posted in reply to Walter

"Walter" <walter@digitalmars.com> wrote in message news:abgshp$2qbp$1@digitaldaemon.com...

> What I do is treat as "newline" any of the following:
>
> 1) CR
> 2) CR LF
> 3) LF
>
> It requires a bit of lookahead to distinguish case 1 from case 2, but it works with files generated by Windows, linux, and Mac.

That's exactly what I did in the new version. It requires ungetc()
(for the case when CR is not followed by LF) though, so I had to
add it as well.

May 10, 2002

Re: new streams

Posted by Russ Lewis
in reply to Walter

Russ Lewis

Posted in reply to Walter

Walter wrote:

> What I do is treat as "newline" any of the following:
>
> 1) CR
> 2) CR LF
> 3) LF
>
> It requires a bit of lookahead to distinguish case 1 from case 2, but it works with files generated by Windows, linux, and Mac.

This has caused me some HUGE headaches doing streaming on UNIX boxes.  At least some of the tools do "lookahead", so they don't echo a line out until you have printed 1 character AFTER the newline...in some cases, it has caused my programs to hang for minutes or hours (while, say, a long find command runs) until either another (unnecessary) line is printed, or the stream runs into EOF.

IMHO, you should immediately interpret CR as a newline, but put a marker on the stream such that if another character is read and that character is a LF, then it will be consumed LATER.  DON'T lookahead for it :(

--
The Villagers are Online! villagersonline.com

.[ (the fox.(quick,brown)) jumped.over(the dog.lazy) ]
.[ (a version.of(English).(precise.more)) is(possible) ]
?[ you want.to(help(develop(it))) ]

May 10, 2002

Re: new streams

Posted by Pavel Minayev
in reply to Russ Lewis

Pavel Minayev

Posted in reply to Russ Lewis

"Russ Lewis" <spamhole-2001-07-16@deming-os.org> wrote in message news:3CDBF812.D68ED4F6@deming-os.org...

> IMHO, you should immediately interpret CR as a newline, but put a marker
on
> the stream such that if another character is read and that character is a LF, then it will be consumed LATER.  DON'T lookahead for it :(

I do a lookahead, but I have ungetc() implemented and working...

May 10, 2002

Re: new streams

Posted by Russ Lewis
in reply to Pavel Minayev

Russ Lewis

Posted in reply to Pavel Minayev

Pavel Minayev wrote:

> "Russ Lewis" <spamhole-2001-07-16@deming-os.org> wrote in message news:3CDBF812.D68ED4F6@deming-os.org...
>
> > IMHO, you should immediately interpret CR as a newline, but put a marker
> on
> > the stream such that if another character is read and that character is a LF, then it will be consumed LATER.  DON'T lookahead for it :(
>
> I do a lookahead, but I have ungetc() implemented and working...

Ungetc doesn't help the problem I was talking about.  If you do lookahead but there is not a character available, then your library will block until one more character is available to read (or you detect EOF)...which could be a LONG time from now.

--
The Villagers are Online! villagersonline.com

.[ (the fox.(quick,brown)) jumped.over(the dog.lazy) ]
.[ (a version.of(English).(precise.more)) is(possible) ]
?[ you want.to(help(develop(it))) ]

May 10, 2002

Re: new streams

Posted by Burton Radons
in reply to Pavel Minayev

Burton Radons

Posted in reply to Pavel Minayev

On Fri, 10 May 2002 18:42:01 +0400, "Pavel Minayev" <evilone@omen.ru> wrote:

>You can find the new stream module at my site, http://int19h.tamb.ru.
>
>Far the most interesting addition is scanf(). What is even better,
>it can read D strings!
>
>    char[] s;
>    stdin.scanf("%.*s", &s);
>
>Yes, this really works! Of course, you can still read C strings (%s), but who needs it anymore? Note, however, that scanf wasn't tested much, so it might contain bugs. Be careful!

I think we should get the scanf and fmt format codes aligned.  My method is "%s" for char[], "%S" for wchar[], "%+s" for char*, and "%+S" for wchar*.  Different semantics for what looks like the same thing is bad city.

[snip]
>Generic read() and write() can now handle strings as well. Unlike
>readString() and writeString(), these also store the length in
>the stream:
>
>    char[] s;
>    ...
>    file.write(s);   // writes s.length, followed by s
>    ...
>    file.read(s);    // reads length, then string of that length

Since this format is our own (that is to say, there's no standard for counted strings -- some are 32-bit, some are 16-bit, some are 8-bit, with varying rules on NUL termination and alignment), we may as well use dynamic-sized integers for this.  For each byte we take the first seven bits and read another byte if the eighth bit is set, like:

    /* Write an unsigned long using the minimum number of bytes */
    void dwrite(ulong value)
    {
        do
        {
            write ((value & 127) | (value > 127 ? 128 : 0));
            value = value >> 7;
        }
        while (value);
    }

    /* Read an unsigned long using the minimum number of bytes */
    void dread(out ulong value)
    {
        ulong shift = 0;
        ubyte buffer;

        value = 0;

        do
        {
            if (shift >= 64)
                throw new ReadError("integer overflow on reading
value");
            read (buffer);
            value |= (ulong) (buffer & 127) << shift;
            shift += 7;
        }
        while (buffer & 128);

        return value;
    }

When writing uint you'll usually get three or two bytes savings, which really adds up when writing meshes, and you have your future covered, and it's endian neutral.

Signed values can be written by preprocessing them for writing:

    if (value < 0)
        ovalue = (-value << 1) | 1;
    else
        ovalue = value << 1;

and postprocessing them after reading:

    ovalue = (value >> 1);
    if (value & 1)
        ovalue = -ovalue;

Uh, except that you can't write the minimum value of long then.  Think of the byte case - you start with a range of -128 to 127 and end with a range of -127 to 127 if you kept just to byte.  If they existed I'd cast to a bignum and save that, although real bignums should be saved like counted strings.

For my code I won't be able to use the class if it doesn't handle endian properly - I'm just too ethically opposed to blindly writing values.  It's just one step down from writing structs, IMO.  Standard read/write could use little endian, with bread/bwrite for big endian.

[snip]
>Enumerations changed names again:
>
>    enum SeekPos
>    {
>        Set,
>        Current,
>        End
>    }

Why not Cur?  "Set" is already nonsensical; Start or Beginning would be more appropriate, so we may as well use the convenient nonsense we're used to.

Hm.  I don't like writing the name of the enumeration when there's only one type that can fit in the argument.  How about we have this:

    file.seek (x, .Current);
    file.seek (x, .Set);
    file.seek (x, .End);

Minimise namespace pollution and too much writingitis at the same time.  Of course, it means that you can't find the enumeration value until after the function has been decided upon, but it shouldn't be ambiguous; it's clearly an enumeration of some sort.

[snip]

May 10, 2002

Re: new streams

Posted by Pavel Minayev
in reply to Burton Radons

Pavel Minayev

Posted in reply to Burton Radons

"Burton Radons" <loth@users.sourceforge.net> wrote in message news:sjunduovt39c2tcntmkv6rp23cn8thmk9g@4ax.com...

> I think we should get the scanf and fmt format codes aligned.  My method is "%s" for char[], "%S" for wchar[], "%+s" for char*, and "%+S" for wchar*.  Different semantics for what looks like the same thing is bad city.

Agreed, but I think we should first have Walter to agree with this, so it'd become "official". Once it is, I will be happy to standartize streams appropriately.

> Since this format is our own (that is to say, there's no standard for counted strings -- some are 32-bit, some are 16-bit, some are 8-bit, with varying rules on NUL termination and alignment), we may as well use dynamic-sized integers for this.  For each byte we take the first seven bits and read another byte if the eighth bit is set, like:
...
> When writing uint you'll usually get three or two bytes savings, which really adds up when writing meshes, and you have your future covered, and it's endian neutral.

But at a cost of speed... and I wonder if it is really needed? Is file size so important?

> For my code I won't be able to use the class if it doesn't handle endian properly - I'm just too ethically opposed to blindly writing values.  It's just one step down from writing structs, IMO.  Standard read/write could use little endian, with bread/bwrite for big endian.

I would prefer read() and write() to operate in "current endianness"
(because often you just don't care - all you want is that your
program should be able to read data it previously written, on that
computer, savegames etc). If you really care about endianness, you'll
have to use functions like bread() and lread().

> Why not Cur?  "Set" is already nonsensical; Start or Beginning would be more appropriate, so we may as well use the convenient nonsense we're used to.

Because "Set" is a word, and so is "Current", but not "Cur". But if you really think that "Start" looks better, I'll probably change it...

May 10, 2002

Re: new streams

Posted by Andrew Feldstein
in reply to Russ Lewis

Andrew Feldstein

Posted in reply to Russ Lewis

I agree that Russ's way is better, but it is still not ideal.  The user should be able to set some sort of library flag to determine how to handle end of lines *correctly* given the needs of the program.  This flag could control both writing as well as reading, knowing how to handle \n, for example.  For example, under *nix, it is incorrect to treat CR as part of a newline, and under MAC, I believe, the LF the same.  Of course, any implementation should should default to the text model used by the underlying operating system and should handle the oddball cases cleanly.  Of course reading and writing don't *have* to be the same....

Pavel, how would your new function read, say, a file containing nothing but three <CR>'s followed by two <LF>'s?  Under various text models this could be interpreted as any of 1, 2, 3, 4, or 5 blank lines.

In article <3CDBFA09.5F8DD76D@deming-os.org>, Russ Lewis says...
>
>Pavel Minayev wrote:
>
>> "Russ Lewis" <spamhole-2001-07-16@deming-os.org> wrote in message news:3CDBF812.D68ED4F6@deming-os.org...
>>
>> > IMHO, you should immediately interpret CR as a newline, but put a marker
>> on
>> > the stream such that if another character is read and that character is a LF, then it will be consumed LATER.  DON'T lookahead for it :(
>>
>> I do a lookahead, but I have ungetc() implemented and working...
>
>Ungetc doesn't help the problem I was talking about.  If you do lookahead but there is not a character available, then your library will block until one more character is available to read (or you detect EOF)...which could be a LONG time from now.
>
>--
>The Villagers are Online! villagersonline.com
>
>.[ (the fox.(quick,brown)) jumped.over(the dog.lazy) ]
>.[ (a version.of(English).(precise.more)) is(possible) ]
>?[ you want.to(help(develop(it))) ]
>
>

May 10, 2002

Re: new streams

Posted by Burton Radons
in reply to Pavel Minayev

Burton Radons

Posted in reply to Pavel Minayev

On Fri, 10 May 2002 22:20:55 +0400, "Pavel Minayev" <evilone@omen.ru> wrote:

>"Burton Radons" <loth@users.sourceforge.net> wrote in message news:sjunduovt39c2tcntmkv6rp23cn8thmk9g@4ax.com...
>
>> I think we should get the scanf and fmt format codes aligned.  My method is "%s" for char[], "%S" for wchar[], "%+s" for char*, and "%+S" for wchar*.  Different semantics for what looks like the same thing is bad city.
>
>Agreed, but I think we should first have Walter to agree with this, so it'd become "official". Once it is, I will be happy to standartize streams appropriately.

Sure, but I'm leaving the option open to kick his ass if he decides to go with "%format-a-string;".  ;-)

>> Since this format is our own (that is to say, there's no standard for counted strings -- some are 32-bit, some are 16-bit, some are 8-bit, with varying rules on NUL termination and alignment), we may as well use dynamic-sized integers for this.  For each byte we take the first seven bits and read another byte if the eighth bit is set, like:
>...
>> When writing uint you'll usually get three or two bytes savings, which really adds up when writing meshes, and you have your future covered, and it's endian neutral.
>
>But at a cost of speed... and I wonder if it is really needed? Is file size so important?

It should be a little faster on a competent compiler.  We have to buffer the data anyway; flushing the buffer takes a long time; loops can be unrolled; dynamic-sized integers lower the incidence of flushing; dynamic-sized integers are faster.  But this is splitting hairs in any case.  Endian independence and a much smaller normal case are far more important.

>> For my code I won't be able to use the class if it doesn't handle endian properly - I'm just too ethically opposed to blindly writing values.  It's just one step down from writing structs, IMO.  Standard read/write could use little endian, with bread/bwrite for big endian.
>
>I would prefer read() and write() to operate in "current endianness"
>(because often you just don't care - all you want is that your
>program should be able to read data it previously written, on that
>computer, savegames etc). If you really care about endianness, you'll
>have to use functions like bread() and lread().

Uh, if you don't care, then it can default to little endian.  :-)

>> Why not Cur?  "Set" is already nonsensical; Start or Beginning would be more appropriate, so we may as well use the convenient nonsense we're used to.
>
>Because "Set" is a word, and so is "Current", but not "Cur". But if you really think that "Start" looks better, I'll probably change it...

It's a word, but so is "Catholicity", and it's as appropriate as "Set".  My dictionary gives 125 meanings for set.  The only thing that could be related is in the context of "setting sun", which is quite the opposite.

Besides which, cur is a word.  Uh, perhaps not in your part of the world.  It means a worthless dog, or contemptible scoundrel.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation