Thread overview | |||||||
---|---|---|---|---|---|---|---|
|
November 21, 2004 BOMs and std.stream | ||||
---|---|---|---|---|
| ||||
Currently std.stream doesn't recognize BOMs, and while it might not be a big thing, there're times where it could be important. I saved an XML file with .NET with encoding UTF-8, so it set a BOM. When I tried to read it using Miguel Ferreira Simões' XML library, it complained about the file not being well-formed. Further testing made me discover that removing the BOM solved the problem. So it's a problem, ATM. I think std.stream should change somehow, but I just don't know how. ----------------------- Carlos Santander Bernal |
November 21, 2004 Re: BOMs and std.stream | ||||
---|---|---|---|---|
| ||||
Posted in reply to Carlos Santander B. | Carlos Santander B. wrote: > Currently std.stream doesn't recognize BOMs, and while it might not be a big thing, there're times where it could be important. > I saved an XML file with .NET with encoding UTF-8, so it set a BOM. When I tried to read it using Miguel Ferreira Simões' XML library, it complained about the file not being well-formed. Further testing made me discover that removing the BOM solved the problem. So it's a problem, ATM. > I think std.stream should change somehow, but I just don't know how. > > ----------------------- > Carlos Santander Bernal I think there is a need for something like this in std.stream. I ran into this challenge a while back, and I didn't really think of a good solution at the time. But I just came up with an idea for a fix (it's not complicated, but I think it'd work). We could add a function called something like getBOM. If a BOM is present, it will return a string with the BOM and move the current location past the BOM. If there isn't a BOM, an empty string is returned and the current location doesn't change. That's just one idea for a design. A similar idea is that an enum could be returned instead of a string. -- Justin (a/k/a jcc7) http://jcc_7.tripod.com/d/ |
November 21, 2004 Re: BOMs and std.stream | ||||
---|---|---|---|---|
| ||||
Posted in reply to J C Calvarese | In article <cnp7hv$2tls$1@digitaldaemon.com>, J C Calvarese says... > >Carlos Santander B. wrote: >> Currently std.stream doesn't recognize BOMs, and while it might not be a big >> thing, there're times where it could be important. >> I saved an XML file with .NET with encoding UTF-8, so it set a BOM. When I tried >> to read it using Miguel Ferreira Simões' XML library, it complained about the >> file not being well-formed. Further testing made me discover that removing the >> BOM solved the problem. So it's a problem, ATM. >> I think std.stream should change somehow, but I just don't know how. >> >> ----------------------- >> Carlos Santander Bernal > >I think there is a need for something like this in std.stream. I ran into this challenge a while back, and I didn't really think of a good solution at the time. But I just came up with an idea for a fix (it's not complicated, but I think it'd work). > >We could add a function called something like getBOM. If a BOM is present, it will return a string with the BOM and move the current location past the BOM. If there isn't a BOM, an empty string is returned and the current location doesn't change. > >That's just one idea for a design. A similar idea is that an enum could be returned instead of a string. > >-- >Justin (a/k/a jcc7) >http://jcc_7.tripod.com/d/ I like the enum idea. It would be nice if the stream remembered the BOM in the UTF-16 case so that the code that reads strings can swap byte orders if needed. Otherwise the user is hosed if the stream is in the wrong byte-ordering. I sense another std.stream project in the next few days... -Ben |
November 21, 2004 Re: BOMs and std.stream | ||||
---|---|---|---|---|
| ||||
Posted in reply to J C Calvarese | The ICU project provides this kind of thing: (from the documentation) static final char[] detectSignature (void[] input) Detects Unicode signature byte sequences at the start of the byte stream and returns the charset name of the indicated Unicode charset. A null is returned where no Unicode signature is recognized. A caller can create a UConverter using the charset name. The first code unit (wchar) from the start of the stream will be U+FEFF (the Unicode BOM/signature character) and can usually be ignored. You might take a look at the breadth of that project; you'll find it covers pretty much anything you'll need for regular Unicode processing, and then some ... http://www.dsource.org/forums/viewtopic.php?t=420 "J C Calvarese" <jcc7@cox.net> wrote in message news:cnp7hv$2tls$1@digitaldaemon.com... | Carlos Santander B. wrote: | > Currently std.stream doesn't recognize BOMs, and while it might not be a big | > thing, there're times where it could be important. | > I saved an XML file with .NET with encoding UTF-8, so it set a BOM. When I tried | > to read it using Miguel Ferreira Simões' XML library, it complained about the | > file not being well-formed. Further testing made me discover that removing the | > BOM solved the problem. So it's a problem, ATM. | > I think std.stream should change somehow, but I just don't know how. | > | > ----------------------- | > Carlos Santander Bernal | | I think there is a need for something like this in std.stream. I ran | into this challenge a while back, and I didn't really think of a good | solution at the time. But I just came up with an idea for a fix (it's | not complicated, but I think it'd work). | | We could add a function called something like getBOM. If a BOM is | present, it will return a string with the BOM and move the current | location past the BOM. If there isn't a BOM, an empty string is returned | and the current location doesn't change. | | That's just one idea for a design. A similar idea is that an enum could | be returned instead of a string. | | -- | Justin (a/k/a jcc7) | http://jcc_7.tripod.com/d/ |
November 22, 2004 Re: BOMs and std.stream | ||||
---|---|---|---|---|
| ||||
Posted in reply to Carlos Santander B. | Carlos Santander B. wrote: > Currently std.stream doesn't recognize BOMs, and while it might not be a big thing, there're times where it could be important. > I saved an XML file with .NET with encoding UTF-8, so it set a BOM. When I tried to read it using Miguel Ferreira Simões' XML library, it complained about the file not being well-formed. Further testing made me discover that removing the BOM solved the problem. So it's a problem, ATM. The problem is that std.stream seems to be designed to work with binary files, with a few text capabilities thrown in but not to this level. > I think std.stream should change somehow, but I just don't know how. My thought is to develop a new set of classes for working with text files. I posted something on this a while back: http://www.digitalmars.com/drn-bin?wwwnews?digitalmars.D/6089 Stewart. |
Copyright © 1999-2021 by the D Language Foundation