Thread overview
BOMs and std.stream
Nov 21, 2004
J C Calvarese
Nov 21, 2004
Ben Hinkle
Nov 21, 2004
Kris
Nov 22, 2004
Stewart Gordon
November 21, 2004
Currently std.stream doesn't recognize BOMs, and while it might not be a big
thing, there're times where it could be important.
I saved an XML file with .NET with encoding UTF-8, so it set a BOM. When I tried
to read it using Miguel Ferreira Simões' XML library, it complained about the
file not being well-formed. Further testing made me discover that removing the
BOM solved the problem. So it's a problem, ATM.
I think std.stream should change somehow, but I just don't know how.

-----------------------
Carlos Santander Bernal


November 21, 2004
Carlos Santander B. wrote:
> Currently std.stream doesn't recognize BOMs, and while it might not be a big thing, there're times where it could be important.
> I saved an XML file with .NET with encoding UTF-8, so it set a BOM. When I tried to read it using Miguel Ferreira Simões' XML library, it complained about the file not being well-formed. Further testing made me discover that removing the BOM solved the problem. So it's a problem, ATM.
> I think std.stream should change somehow, but I just don't know how.
> 
> -----------------------
> Carlos Santander Bernal 

I think there is a need for something like this in std.stream. I ran into this challenge a while back, and I didn't really think of a good solution at the time. But I just came up with an idea for a fix (it's not complicated, but I think it'd work).

We could add a function called something like getBOM. If a BOM is present, it will return a string with the BOM and move the current location past the BOM. If there isn't a BOM, an empty string is returned and the current location doesn't change.

That's just one idea for a design. A similar idea is that an enum could be returned instead of a string.

-- 
Justin (a/k/a jcc7)
http://jcc_7.tripod.com/d/
November 21, 2004
In article <cnp7hv$2tls$1@digitaldaemon.com>, J C Calvarese says...
>
>Carlos Santander B. wrote:
>> Currently std.stream doesn't recognize BOMs, and while it might not be a big
>> thing, there're times where it could be important.
>> I saved an XML file with .NET with encoding UTF-8, so it set a BOM. When I tried
>> to read it using Miguel Ferreira Simões' XML library, it complained about the
>> file not being well-formed. Further testing made me discover that removing the
>> BOM solved the problem. So it's a problem, ATM.
>> I think std.stream should change somehow, but I just don't know how.
>> 
>> -----------------------
>> Carlos Santander Bernal
>
>I think there is a need for something like this in std.stream. I ran into this challenge a while back, and I didn't really think of a good solution at the time. But I just came up with an idea for a fix (it's not complicated, but I think it'd work).
>
>We could add a function called something like getBOM. If a BOM is present, it will return a string with the BOM and move the current location past the BOM. If there isn't a BOM, an empty string is returned and the current location doesn't change.
>
>That's just one idea for a design. A similar idea is that an enum could be returned instead of a string.
>
>-- 
>Justin (a/k/a jcc7)
>http://jcc_7.tripod.com/d/

I like the enum idea. It would be nice if the stream remembered the BOM in the UTF-16 case so that the code that reads strings can swap byte orders if needed. Otherwise the user is hosed if the stream is in the wrong byte-ordering. I sense another std.stream project in the next few days...

-Ben


November 21, 2004
The ICU project provides this kind of thing: (from the documentation)

        static final char[] detectSignature (void[] input)

                Detects Unicode signature byte sequences at the start
                of the byte stream and returns the charset name of the
                indicated Unicode charset. A null is returned where no
                Unicode signature is recognized.

                A caller can create a UConverter using the charset name.
                The first code unit (wchar) from the start of the stream
                will be U+FEFF (the Unicode BOM/signature character)
                and can usually be ignored.

You might take a look at the breadth of that project; you'll find it covers pretty much anything you'll need for regular Unicode processing, and then some ...

http://www.dsource.org/forums/viewtopic.php?t=420



"J C Calvarese" <jcc7@cox.net> wrote in message
news:cnp7hv$2tls$1@digitaldaemon.com...
| Carlos Santander B. wrote:
| > Currently std.stream doesn't recognize BOMs, and while it might not be a
big
| > thing, there're times where it could be important.
| > I saved an XML file with .NET with encoding UTF-8, so it set a BOM. When
I tried
| > to read it using Miguel Ferreira Simões' XML library, it complained
about the
| > file not being well-formed. Further testing made me discover that
removing the
| > BOM solved the problem. So it's a problem, ATM.
| > I think std.stream should change somehow, but I just don't know how.
| >
| > -----------------------
| > Carlos Santander Bernal
|
| I think there is a need for something like this in std.stream. I ran
| into this challenge a while back, and I didn't really think of a good
| solution at the time. But I just came up with an idea for a fix (it's
| not complicated, but I think it'd work).
|
| We could add a function called something like getBOM. If a BOM is
| present, it will return a string with the BOM and move the current
| location past the BOM. If there isn't a BOM, an empty string is returned
| and the current location doesn't change.
|
| That's just one idea for a design. A similar idea is that an enum could
| be returned instead of a string.
|
| --
| Justin (a/k/a jcc7)
| http://jcc_7.tripod.com/d/


November 22, 2004
Carlos Santander B. wrote:
> Currently std.stream doesn't recognize BOMs, and while it might not be a big thing, there're times where it could be important.
> I saved an XML file with .NET with encoding UTF-8, so it set a BOM. When I tried to read it using Miguel Ferreira Simões' XML library, it complained about the file not being well-formed. Further testing made me discover that removing the BOM solved the problem. So it's a problem, ATM.

The problem is that std.stream seems to be designed to work with binary files, with a few text capabilities thrown in but not to this level.

> I think std.stream should change somehow, but I just don't know how.

My thought is to develop a new set of classes for working with text files.  I posted something on this a while back:

http://www.digitalmars.com/drn-bin?wwwnews?digitalmars.D/6089

Stewart.