Thread overview
std.stream, BOM, and deprecation
Oct 14, 2012
Charles Hixson
Oct 14, 2012
Jonathan M Davis
Oct 14, 2012
Ali Çehreli
Oct 15, 2012
Nick Sabalausky
Oct 16, 2012
Charles Hixson
Oct 16, 2012
Charles Hixson
October 14, 2012
If std.stream is being deprecated, what is the correct way to deal with file BOMs.  This is particularly concerning utf8 files, which I understand to be a bit problematic, as there isn't, actually, a utf8 BOM, merely a convention which isn't a part of a standard.  But the std.stdio documentation doesn't so much as mention byte order marks (BOMs).

If this should wait until std.io is released, then I could use std.stream until them, but the documentation is already warning to avoid using it.
October 14, 2012
On Saturday, October 13, 2012 18:53:48 Charles Hixson wrote:
> If std.stream is being deprecated, what is the correct way to deal with
> file BOMs.  This is particularly concerning utf8 files, which I
> understand to be a bit problematic, as there isn't, actually, a utf8
> BOM, merely a convention which isn't a part of a standard.  But the
> std.stdio documentation doesn't so much as mention byte order marks (BOMs).
> 
> If this should wait until std.io is released, then I could use std.stream until them, but the documentation is already warning to avoid using it.

std.stream will be around until after std.io has been introduced, because std.io will be its replacement. As for dealing with BOMs, I don't really know anything about that, so I don't really have any suggestions. I know that it's come up before, and you can probably find some discussion on it in the archives, but for the most part, Phobos' I/O assumes UTF-8 or compatible, and if you want something else, you have to deal with it yourself. It's an area where Phobos needs improvement.

You can use std.stream, but just be aware that in the long term, you'll either have to refactor your code so that it uses another solution (presumably std.io) or copy std.stream to your own stuff, because it's going to be removed from Phobos eventually.

- Jonathan M Davis
October 14, 2012
On 10/13/2012 06:53 PM, Charles Hixson wrote:
> If std.stream is being deprecated, what is the correct way to deal with
> file BOMs. This is particularly concerning utf8 files, which I
> understand to be a bit problematic, as there isn't, actually, a utf8
> BOM,

That's correct. There is just one byte order for UTF-8.

> merely a convention which isn't a part of a standard.

I am not sure about that. The Unicode standard describes UTF-8 as code units following each other in the file. There can't be any confusion about their order. According to Wikipedia, the only use of BOM for UTF-8 is to identify the file as having been encoded in UTF-8:

  http://en.wikipedia.org/wiki/Byte_order_mark#UTF-8

But that can't have any meaning. The file could have been encoded in any one of the multitude of code pages as well. Treating the first three bytes as BOM would be taking a chance in that case and dropping those three characters.

> But the
> std.stdio documentation doesn't so much as mention byte order marks (BOMs).
>
> If this should wait until std.io is released, then I could use
> std.stream until them, but the documentation is already warning to avoid
> using it.

As I understand it, it is all down to convention any way. What is the meaning of the non-ASCII code 166? Only the generator of the file knows. :/

Ali

October 15, 2012
On Sat, 13 Oct 2012 18:53:48 -0700
Charles Hixson <charleshixsn@earthlink.net> wrote:

> If std.stream is being deprecated, what is the correct way to deal with file BOMs.  This is particularly concerning utf8 files, which I understand to be a bit problematic, as there isn't, actually, a utf8 BOM, merely a convention which isn't a part of a standard.  But the std.stdio documentation doesn't so much as mention byte order marks (BOMs).
> 
> If this should wait until std.io is released, then I could use std.stream until them, but the documentation is already warning to avoid using it.

Personally, I think it's kind of cumbersome to deal with in Phobos, so I wrote this wrapper that I use instead, which handles everything:

https://bitbucket.org/Abscissa/semitwistdtools/src/977820d5dcb0/src/semitwist/util/io.d?at=master#cl-24

And then there's the utfConvert below it if you already have the data in memory instead of on disk.

(Maybe I should add some range capability and make a Phobos pull request. I don't know if it'd fly though. It uses a lot of custom endian- and bom-related code since I found the existing endian/bom stuff in phobos inadequate. So that stuff would have to be accepted, and then this too, and it's usually a bit of a pain to get things approved.)

October 15, 2012
On Sat, 13 Oct 2012 21:53:48 -0400, Charles Hixson <charleshixsn@earthlink.net> wrote:

> If std.stream is being deprecated, what is the correct way to deal with file BOMs.  This is particularly concerning utf8 files, which I understand to be a bit problematic, as there isn't, actually, a utf8 BOM, merely a convention which isn't a part of a standard.  But the std.stdio documentation doesn't so much as mention byte order marks (BOMs).
>
> If this should wait until std.io is released, then I could use std.stream until them, but the documentation is already warning to avoid using it.

When std.io is released, it will be fully BOM-aware by default (as long as you use the purely D versions).  The plan from my point of view is for std.io be be a replacement backend for std.stdio, with the C version being the default (as it must be for compatibility purposes).

-Steve
October 16, 2012
On 10/15/2012 10:29 AM, Steven Schveighoffer wrote:
> On Sat, 13 Oct 2012 21:53:48 -0400, Charles Hixson
> <charleshixsn@earthlink.net> wrote:
>
>> If std.stream is being deprecated, what is the correct way to deal
>> with file BOMs. This is particularly concerning utf8 files, which I
>> understand to be a bit problematic, as there isn't, actually, a utf8
>> BOM, merely a convention which isn't a part of a standard. But the
>> std.stdio documentation doesn't so much as mention byte order marks
>> (BOMs).
>>
>> If this should wait until std.io is released, then I could use
>> std.stream until them, but the documentation is already warning to
>> avoid using it.
>
> When std.io is released, it will be fully BOM-aware by default (as long
> as you use the purely D versions). The plan from my point of view is for
> std.io be be a replacement backend for std.stdio, with the C version
> being the default (as it must be for compatibility purposes).
>
> -Steve
That sounds good.  All of the files I'm interested should have been converted to utf8 (if they weren't already), but many of them have the utf8 BOM (so they won't be confused with other non-unicode files).  It sounds like std.io will handle this in a transparent fashion.
October 16, 2012
On 10/14/2012 10:28 PM, Nick Sabalausky wrote:
> On Sat, 13 Oct 2012 18:53:48 -0700
> Charles Hixson<charleshixsn@earthlink.net>  wrote:
>
>> If std.stream is being deprecated, what is the correct way to deal
>> with file BOMs.  This is particularly concerning utf8 files, which I
>> understand to be a bit problematic, as there isn't, actually, a utf8
>> BOM, merely a convention which isn't a part of a standard.  But the
>> std.stdio documentation doesn't so much as mention byte order marks
>> (BOMs).
>>
>> If this should wait until std.io is released, then I could use
>> std.stream until them, but the documentation is already warning to
>> avoid using it.
>
> Personally, I think it's kind of cumbersome to deal with in Phobos, so
> I wrote this wrapper that I use instead, which handles everything:
>
> https://bitbucket.org/Abscissa/semitwistdtools/src/977820d5dcb0/src/semitwist/util/io.d?at=master#cl-24
>
> And then there's the utfConvert below it if you already have the data
> in memory instead of on disk.
>
> (Maybe I should add some range capability and make a Phobos pull
> request. I don't know if it'd fly though. It uses a lot of custom
> endian- and bom-related code since I found the existing endian/bom
> stuff in phobos inadequate. So that stuff would have to be accepted,
> and then this too, and it's usually a bit of a pain to get things
> approved.)
>
That wrapper looks very nice, but it's a lot more than what I need.  I want to deal only with utf8 files, many of which have BOMs.  I *can* handle that by detecting the BOM and dropping it.  I don't need anything else.  I was merely wondering what the appropriate way to approach this was now that std.stream is being documented as deprecated, but no replacement specified.  It sounds like the appropriate response is to use std.stdio, and handle the BOM myself.