View mode: basic / threaded / horizontal-split · Log in · Help
October 14, 2012
std.stream, BOM, and deprecation
If std.stream is being deprecated, what is the correct way to deal with 
file BOMs.  This is particularly concerning utf8 files, which I 
understand to be a bit problematic, as there isn't, actually, a utf8 
BOM, merely a convention which isn't a part of a standard.  But the 
std.stdio documentation doesn't so much as mention byte order marks (BOMs).

If this should wait until std.io is released, then I could use 
std.stream until them, but the documentation is already warning to avoid 
using it.
October 14, 2012
Re: std.stream, BOM, and deprecation
On Saturday, October 13, 2012 18:53:48 Charles Hixson wrote:
> If std.stream is being deprecated, what is the correct way to deal with
> file BOMs.  This is particularly concerning utf8 files, which I
> understand to be a bit problematic, as there isn't, actually, a utf8
> BOM, merely a convention which isn't a part of a standard.  But the
> std.stdio documentation doesn't so much as mention byte order marks (BOMs).
> 
> If this should wait until std.io is released, then I could use
> std.stream until them, but the documentation is already warning to avoid
> using it.

std.stream will be around until after std.io has been introduced, because 
std.io will be its replacement. As for dealing with BOMs, I don't really know 
anything about that, so I don't really have any suggestions. I know that it's 
come up before, and you can probably find some discussion on it in the 
archives, but for the most part, Phobos' I/O assumes UTF-8 or compatible, and 
if you want something else, you have to deal with it yourself. It's an area 
where Phobos needs improvement.

You can use std.stream, but just be aware that in the long term, you'll either 
have to refactor your code so that it uses another solution (presumably 
std.io) or copy std.stream to your own stuff, because it's going to be removed 
from Phobos eventually.

- Jonathan M Davis
October 14, 2012
Re: std.stream, BOM, and deprecation
On 10/13/2012 06:53 PM, Charles Hixson wrote:
> If std.stream is being deprecated, what is the correct way to deal with
> file BOMs. This is particularly concerning utf8 files, which I
> understand to be a bit problematic, as there isn't, actually, a utf8
> BOM,

That's correct. There is just one byte order for UTF-8.

> merely a convention which isn't a part of a standard.

I am not sure about that. The Unicode standard describes UTF-8 as code 
units following each other in the file. There can't be any confusion 
about their order. According to Wikipedia, the only use of BOM for UTF-8 
is to identify the file as having been encoded in UTF-8:

  http://en.wikipedia.org/wiki/Byte_order_mark#UTF-8

But that can't have any meaning. The file could have been encoded in any 
one of the multitude of code pages as well. Treating the first three 
bytes as BOM would be taking a chance in that case and dropping those 
three characters.

> But the
> std.stdio documentation doesn't so much as mention byte order marks 
(BOMs).
>
> If this should wait until std.io is released, then I could use
> std.stream until them, but the documentation is already warning to avoid
> using it.

As I understand it, it is all down to convention any way. What is the 
meaning of the non-ASCII code 166? Only the generator of the file knows. :/

Ali
October 15, 2012
Re: std.stream, BOM, and deprecation
On Sat, 13 Oct 2012 18:53:48 -0700
Charles Hixson <charleshixsn@earthlink.net> wrote:

> If std.stream is being deprecated, what is the correct way to deal
> with file BOMs.  This is particularly concerning utf8 files, which I 
> understand to be a bit problematic, as there isn't, actually, a utf8 
> BOM, merely a convention which isn't a part of a standard.  But the 
> std.stdio documentation doesn't so much as mention byte order marks
> (BOMs).
> 
> If this should wait until std.io is released, then I could use 
> std.stream until them, but the documentation is already warning to
> avoid using it.

Personally, I think it's kind of cumbersome to deal with in Phobos, so
I wrote this wrapper that I use instead, which handles everything:

https://bitbucket.org/Abscissa/semitwistdtools/src/977820d5dcb0/src/semitwist/util/io.d?at=master#cl-24

And then there's the utfConvert below it if you already have the data
in memory instead of on disk.

(Maybe I should add some range capability and make a Phobos pull
request. I don't know if it'd fly though. It uses a lot of custom
endian- and bom-related code since I found the existing endian/bom
stuff in phobos inadequate. So that stuff would have to be accepted,
and then this too, and it's usually a bit of a pain to get things
approved.)
October 15, 2012
Re: std.stream, BOM, and deprecation
On Sat, 13 Oct 2012 21:53:48 -0400, Charles Hixson  
<charleshixsn@earthlink.net> wrote:

> If std.stream is being deprecated, what is the correct way to deal with  
> file BOMs.  This is particularly concerning utf8 files, which I  
> understand to be a bit problematic, as there isn't, actually, a utf8  
> BOM, merely a convention which isn't a part of a standard.  But the  
> std.stdio documentation doesn't so much as mention byte order marks  
> (BOMs).
>
> If this should wait until std.io is released, then I could use  
> std.stream until them, but the documentation is already warning to avoid  
> using it.

When std.io is released, it will be fully BOM-aware by default (as long as  
you use the purely D versions).  The plan from my point of view is for  
std.io be be a replacement backend for std.stdio, with the C version being  
the default (as it must be for compatibility purposes).

-Steve
October 16, 2012
Re: std.stream, BOM, and deprecation
On 10/15/2012 10:29 AM, Steven Schveighoffer wrote:
> On Sat, 13 Oct 2012 21:53:48 -0400, Charles Hixson
> <charleshixsn@earthlink.net> wrote:
>
>> If std.stream is being deprecated, what is the correct way to deal
>> with file BOMs. This is particularly concerning utf8 files, which I
>> understand to be a bit problematic, as there isn't, actually, a utf8
>> BOM, merely a convention which isn't a part of a standard. But the
>> std.stdio documentation doesn't so much as mention byte order marks
>> (BOMs).
>>
>> If this should wait until std.io is released, then I could use
>> std.stream until them, but the documentation is already warning to
>> avoid using it.
>
> When std.io is released, it will be fully BOM-aware by default (as long
> as you use the purely D versions). The plan from my point of view is for
> std.io be be a replacement backend for std.stdio, with the C version
> being the default (as it must be for compatibility purposes).
>
> -Steve
That sounds good.  All of the files I'm interested should have been 
converted to utf8 (if they weren't already), but many of them have the 
utf8 BOM (so they won't be confused with other non-unicode files).  It 
sounds like std.io will handle this in a transparent fashion.
October 16, 2012
Re: std.stream, BOM, and deprecation
On 10/14/2012 10:28 PM, Nick Sabalausky wrote:
> On Sat, 13 Oct 2012 18:53:48 -0700
> Charles Hixson<charleshixsn@earthlink.net>  wrote:
>
>> If std.stream is being deprecated, what is the correct way to deal
>> with file BOMs.  This is particularly concerning utf8 files, which I
>> understand to be a bit problematic, as there isn't, actually, a utf8
>> BOM, merely a convention which isn't a part of a standard.  But the
>> std.stdio documentation doesn't so much as mention byte order marks
>> (BOMs).
>>
>> If this should wait until std.io is released, then I could use
>> std.stream until them, but the documentation is already warning to
>> avoid using it.
>
> Personally, I think it's kind of cumbersome to deal with in Phobos, so
> I wrote this wrapper that I use instead, which handles everything:
>
> https://bitbucket.org/Abscissa/semitwistdtools/src/977820d5dcb0/src/semitwist/util/io.d?at=master#cl-24
>
> And then there's the utfConvert below it if you already have the data
> in memory instead of on disk.
>
> (Maybe I should add some range capability and make a Phobos pull
> request. I don't know if it'd fly though. It uses a lot of custom
> endian- and bom-related code since I found the existing endian/bom
> stuff in phobos inadequate. So that stuff would have to be accepted,
> and then this too, and it's usually a bit of a pain to get things
> approved.)
>
That wrapper looks very nice, but it's a lot more than what I need.  I 
want to deal only with utf8 files, many of which have BOMs.  I *can* 
handle that by detecting the BOM and dropping it.  I don't need anything 
else.  I was merely wondering what the appropriate way to approach this 
was now that std.stream is being documented as deprecated, but no 
replacement specified.  It sounds like the appropriate response is to 
use std.stdio, and handle the BOM myself.
Top | Discussion index | About this forum | D home