July 19, 2005
> I've now got it installed and running.  Just a few issues drop it short
> of perfect.  For example, it's a step behind std.stream in that it only detects EOF after trying to read past the end.

Hmm, i simply used feof() for determining the File end. If it does not support detecting EOF before actually hitting the end, it would have to be replaced by another function, which in turn would lead to the replacement of all the fileengine functions (fopen, fread, fwrite, etc.). I am not sure if this is worth this improvement? As is stated in the File documentation, i intentionally used the C streaming API because it is portable (look at the amount of code required to get stat() working on three platforms to get an idea of this advantage), and it already includes buffering and all that jazz which is needed for higher-level devices like DataStream and TextStream (which often read only a few bytes at a time). The alternative is to use the open, read, write family of functions, and implement buffering ourselves. Another 1000 lines of code (minimum).

> My thought is that an I/O library should be able to detect EOF when it gets there, and that one should also be able to rely on exceptions to catch a premature end of file.

Well, the IODevices are not meant to be used directly. I know, in my helper program imupdate and others i use them directly, but that is only because there is no TextStream yet. Look at DataStream, how it handles premature EOF. This is the interface the user will see: simple shifting of values into / out of the stream, and if something fails an exception is thrown. TextStream should be implemented in a similar manner. IODevice/File should only be used directly by the user if he really wants to, normally he will create a stream for every kind of IODevice (perhaps there will be a socket implementation in the future, or some kind of memory-based file).

> But I believe a TextStream is a feasible addition to this library, with the help of a few more IODevice members to help with UTF detection and the like.

UTF detection is also something that should be taken care of in a higher-level protocol, i.e. TextStream. IODevices encapsulate reading/writing raw data to/from an abstract kind of device, like a file, socket, memory. Interpretation of this data is done by DataStream (look into its implementation, it takes care of the number of bytes for every data type, the storing format used for arrays and the like, and the byte order of the machine) or TextStream.

By the way, i am currently writing Unicode character properties and message formatting. The character properties are already finished. I am not sure if they are needed for TextStream implementation (isSpace() perhaps?). Anyways, i uploaded the current Indigo 0.94 to my homepage, thus you can look at the changes.

Thanks
uwe
July 19, 2005
Uwe Salomon wrote:
>> I've now got it installed and running.  Just a few issues drop it short of perfect.  For example, it's a step behind std.stream in that it only detects EOF after trying to read past the end.
> 
> Hmm, i simply used feof() for determining the File end. If it does not support detecting EOF before actually hitting the end, it would have to be  replaced by another function, which in turn would lead to the replacement of all the fileengine functions (fopen, fread, fwrite, etc.). I am not sure if this is worth this improvement?

Rewriting one function doesn't have to mean changing which API the whole library wraps.  Especially when this single rewrite doesn't rely on access to another API at all.

The solution for files is very simple: just test pos() == size().

For sequential streams such as stdin, process I/O streams and (?) sockets, it isn't quite so simple, but it isn't complicated either.  To test for EOF, read in the next byte.  If it returns EOF, return true. Otherwise, add it to the unread buffer or something.  (Is the unread buffer supposed to be internal?  Or are the wrappers in DataStream and/or TextStream waiting to be written?)

<snip>
>> My thought is that an I/O library should be able to detect EOF when it gets there, and that one should also be able to rely on exceptions to catch a premature end of file.
> 
> Well, the IODevices are not meant to be used directly.

But testing for EOF should certainly be part of the API that one is meant to use.  I refer you back to the concept of expected versus unexpected EOF.

> I know, in my helper program imupdate and others i use them directly, but that is only because there is no TextStream yet. Look at DataStream, how it handles premature EOF. This is the interface the user will see: simple shifting of values into / out of the stream, and if something fails an exception is thrown.

It didn't throw an exception when I tried it.  At least as far as an infinite loop of DataStream >> char is indeed infinite.  Has this changed?

<snip>
> UTF detection is also something that should be taken care of in a higher-level protocol, i.e. TextStream. IODevices encapsulate reading/writing raw data to/from an abstract kind of device, like a file, socket, memory. Interpretation of this data is done by DataStream (look into its implementation, it takes care of the number of bytes for every data type, the storing format used for arrays and the like, and the byte order of the machine) or TextStream.

The way encoding detection would work depends on the kind of IODevice. So you're saying TextStream should enumerate the possibilities and take appropriate action?

(For that matter, what is OS API support like for detecting what encoding the console is using, for stdin/out/err stuff?)

> By the way, i am currently writing Unicode character properties and message formatting. The character properties are already finished. I am not sure if they are needed for TextStream implementation (isSpace() perhaps?). Anyways, i uploaded the current Indigo 0.94 to my homepage, thus you can look at the changes.

I'll check it out.

Stewart.

-- 
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCS/M d- s:- a->--- UB@ P+ L E@ W++@ N+++ o K- w++@ O? M V? PS- PE- Y? PGP- t- 5? X? R b DI? D G e++>++++ h-- r-- !y
------END GEEK CODE BLOCK------

My e-mail is valid but not my primary mailbox.  Please keep replies on the 'group where everyone may benefit.
July 19, 2005
>> I know, in my helper program imupdate and others i use them directly, but that is only because there is no TextStream yet. Look at DataStream, how it handles premature EOF. This is the interface the user will see: simple shifting of values into / out of the stream, and if something fails an exception is thrown.
>
> It didn't throw an exception when I tried it.  At least as far as an infinite loop of DataStream >> char is indeed infinite.  Has this changed?

Huh? I could not believe that, but i tried myself and it is indeed infinite. %)  It is funny that message catalogues are working perfect, despite their usage of DataStream. Always interesting how bugs can hide. Well, this is not intended behaviour, and i'll fix it as fast as possible.

>> UTF detection is also something that should be taken care of in a higher-level protocol, i.e. TextStream. IODevices encapsulate reading/writing raw data to/from an abstract kind of device, like a file, socket, memory. Interpretation of this data is done by DataStream (look into its implementation, it takes care of the number of bytes for every data type, the storing format used for arrays and the like, and the byte order of the machine) or TextStream.
>
> The way encoding detection would work depends on the kind of IODevice. So you're saying TextStream should enumerate the possibilities and take appropriate action?
>
> (For that matter, what is OS API support like for detecting what encoding the console is using, for stdin/out/err stuff?)

Hm, perhaps you're right. I thought the TextStream could read the first bytes and determine if they are a byte order mark. But that does only make sense if it is instantly plugged to a File. Thus the encoding information should be moved into the IODevice, yes. I'll take a look at that.

Ciao
uwe
July 19, 2005
> It didn't throw an exception when I tried it.  At least as far as an infinite loop of DataStream >> char is indeed infinite.  Has this changed?

This has been fixed, and i uploaded the changed version. Interestingly, it only happened for 1-byte data types: char, byte, ubyte.

Thanks
uwe
July 20, 2005
>>> UTF detection is also something that should be taken care of in a higher-level protocol, i.e. TextStream. IODevices encapsulate reading/writing raw data to/from an abstract kind of device, like a file, socket, memory. Interpretation of this data is done by DataStream (look into its implementation, it takes care of the number of bytes for every data type, the storing format used for arrays and the like, and the byte order of the machine) or TextStream.
>>
>> The way encoding detection would work depends on the kind of IODevice. So you're saying TextStream should enumerate the possibilities and take appropriate action?
>>
>> (For that matter, what is OS API support like for detecting what encoding the console is using, for stdin/out/err stuff?)
>
> Hm, perhaps you're right. I thought the TextStream could read the first bytes and determine if they are a byte order mark. But that does only make sense if it is instantly plugged to a File. Thus the encoding information should be moved into the IODevice, yes. I'll take a look at that.

Well, in Qt the TextStream assumes that the device is in the local 8-bit encoding, but autodetects UTF encodings if the first thing it reads is a BOM. There are functions to turn autodetection off, and there are functions to set the encoding. Problems are:

(1) As you pointed out, this is not always the best solution. But it does its job, as terminals and stuff are local 8-bit encoded, and sockets should be set by the user anyways.

(2) We have no codecs currently. Just some (pretty fast) functions in indigo.i18n.conversion for the different UTF flavours. I wanted to add a toAscii function there. But codecs for more local 8-bits? Some chinese multibyte encodings??? This is a little bit too ambitious, i think. Anyways, the question remains if we provide the "Codec" base class, and some derived classes for the UTFs, ASCII and ISO-8859-1 to be prepared for later expansion.

By the way, should i talk with Brad about a forum at dsource? If you really plan to contribute to Indigo, this would be a little more convenient perhaps?

Ciao
uwe
July 20, 2005
Uwe Salomon wrote:
<snip>
> Well, in Qt the TextStream assumes that the device is in the local 8-bit encoding, but autodetects UTF encodings if the first thing it reads is a BOM.

I see.  But what about heuristic detection?

http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/7102

though it's practically useless on certain kinds of sequential streams.  Maybe throw in the heuristic fallback only if the stream is seekable?

> There are functions to turn autodetection off, and there are functions to set the encoding. Problems are:
> 
> (1) As you pointed out, this is not always the best solution. But it does its job, as terminals and stuff are local 8-bit encoded, and sockets should be set by the user anyways.

But how can we determine the local 8-bit encoding, in the cases where it isn't a constant of the platform?

> (2) We have no codecs currently. Just some (pretty fast) functions in  indigo.i18n.conversion for the different UTF flavours. I wanted to add a toAscii function there. But codecs for more local 8-bits? Some chinese multibyte encodings??? This is a little bit too ambitious, i think.  Anyways, the question remains if we provide the "Codec" base class, and some derived classes for the UTFs, ASCII and ISO-8859-1 to be prepared for later expansion.

I think that would be a good idea.

> By the way, should i talk with Brad about a forum at dsource? If you  really plan to contribute to Indigo, this would be a little more  convenient perhaps?

The only trouble is that for some strange reason the dsource svn server seems to be incompatible with my Internet connection.

Stewart.

-- 
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCS/M d- s:- a->--- UB@ P+ L E@ W++@ N+++ o K- w++@ O? M V? PS- PE- Y? PGP- t- 5? X? R b DI? D G e++>++++ h-- r-- !y
------END GEEK CODE BLOCK------

My e-mail is valid but not my primary mailbox.  Please keep replies on the 'group where everyone may benefit.
July 20, 2005
>> Well, in Qt the TextStream assumes that the device is in the local 8-bit encoding, but autodetects UTF encodings if the first thing it reads is a BOM.
>
> I see.  But what about heuristic detection?

Nice thing. We'll add that later if we feel like it. Maybe as a member function of the codecs, for example like this:

int Codec.suitability(void[] someData);

The higher the return value, the more appropiate the codec is for the contents of someData. Then we register all existing codecs in a list, and test them. But only if the user requests that (IODevice.determineEncoding or similar).

And you are right, i think the best solution is to code this into the IODevice, i.e. the different devices have functions for receiving their encoding, and TextStream only calls them after creation. The IODevice base class implements common functionality like this heuristic or BOM detection.

> But how can we determine the local 8-bit encoding, in the cases where it isn't a constant of the platform?

Huh, i'm not even sure how to detect the local 8 bit on the different platforms. In Linux its an appendix to the LANG environment variable, and Windows will have some function, i guess. Mac OS too. And if there's nothing, we fall back to ASCII :)

>> (2) We have no codecs currently. Just some (pretty fast) functions in  indigo.i18n.conversion for the different UTF flavours. I wanted to add a toAscii function there. But codecs for more local 8-bits? Some chinese multibyte encodings??? This is a little bit too ambitious, i think.  Anyways, the question remains if we provide the "Codec" base class, and some derived classes for the UTFs, ASCII and ISO-8859-1 to be prepared for later expansion.
>
> I think that would be a good idea.

All of them should be easy to write, cause the functions already exist or are trivial (ASCII and ISO-8859-1). Well, the class should be "TextCodec", and it should work roughly like http://doc.trolltech.com/4.0/qtextcodec.html

>> By the way, should i talk with Brad about a forum at dsource? If you  really plan to contribute to Indigo, this would be a little more  convenient perhaps?
>
> The only trouble is that for some strange reason the dsource svn server seems to be incompatible with my Internet connection.

I thought of it mostly because of the chattering here in the announce NG...

Ciao
uwe
July 22, 2005
Uwe Salomon wrote:
>>> Well, in Qt the TextStream assumes that the device is in the local  8-bit encoding, but autodetects UTF encodings if the first thing it  reads is a BOM.
>>
>> I see.  But what about heuristic detection?
> 
> Nice thing. We'll add that later if we feel like it. Maybe as a member  function of the codecs, for example like this:
> 
> int Codec.suitability(void[] someData);

Good idea.

> The higher the return value, the more appropiate the codec is for the  contents of someData. Then we register all existing codecs in a list, and  test them. But only if the user requests that (IODevice.determineEncoding  or similar).
> 
> And you are right, i think the best solution is to code this into the  IODevice, i.e. the different devices have functions for receiving their  encoding, and TextStream only calls them after creation. The IODevice base  class implements common functionality like this heuristic or BOM detection.

Not sure.  Now that I come to think about it, the only thing that's
really dependent on the kind of IODevice is whether it makes sense to
apply heuristics.  And thinking about it now, if we're going to rely on the class user to request heuristic detection anyway, I guess we can go with the first quoted paragraph and don't really need to add this stuff to IODevice.

>> But how can we determine the local 8-bit encoding, in the cases where it  isn't a constant of the platform?
> 
> Huh, i'm not even sure how to detect the local 8 bit on the different  platforms. In Linux its an appendix to the LANG environment variable, and  Windows will have some function, i guess. Mac OS too. And if there's  nothing, we fall back to ASCII :)
<snip>

Here's how I'm thinking of implementing it now.

The only thing that needs to be added to IODevice is a read-only codec property.  This would return null by default, and implement the
platform-dependent logic to detect the local codec for stdin/out/err.
TextStream would retrieve this from the IODevice on construction, or on
setting the device if no codec is already set.

When the time comes to read some data from the TextStream, if the codec
is null then it'll do the BOM detection, and if no BOM is present, fall
back to UTF-8.  We could also have a detectBOM method, which will look
for a BOM at the current point and set the codec as appropriate if one
is present, otherwise leave the codec unchanged.  This'll enable stuff like

    stream.codec = ISO_8859_1;  /* or whatever naming convention we
                                   decide on/copy */
    stream.detectBOM();

meaning "if there's a BOM, honour it, otherwise treat it as ISO-8859-1".  Except that QTextStream has an autoDetectUnicode property - where would this fit into the equation?

I'll try and get somewhere with coding up TextStream over the weekend.
Probably supporting only UTF-8 at first, and then improving it to
support codecs once these are implemented.

Stewart.

-- 
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCS/M d- s:- a->--- UB@ P+ L E@ W++@ N+++ o K- w++@ O? M V? PS- PE- Y?
PGP- t- 5? X? R b DI? D G e++>++++ h-- r-- !y
------END GEEK CODE BLOCK------

My e-mail is valid but not my primary mailbox.  Please keep replies on
the 'group where everyone may benefit.
July 22, 2005
We have a forum now :)  thanks to Brad. Just go to

http://www.dsource.org/forums/viewforum.php?f=67

There are the posts we've made so far, and my answer to your post.

Ciao
uwe
1 2
Next ›   Last »