Thread overview
Unicode BOM and endianness
Aug 04, 2006
Tim Locke
Aug 04, 2006
Derek Parnell
Aug 04, 2006
Hasan Aljudy
Aug 04, 2006
Thomas Kuehne
Aug 04, 2006
Tim Locke
Aug 04, 2006
Derek
August 04, 2006
How do I acquire and determine the BOM and endianness of a file I am reading?

Thanks
August 04, 2006
On Fri, 04 Aug 2006 00:36:21 -0300, Tim Locke wrote:

> How do I acquire and determine the BOM and endianness of a file I am reading?
> 
> Thanks

You might check out http://en.wikipedia.org/wiki/Byte_Order_Mark

-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
"Down with mediocrity!"
4/08/2006 2:14:46 PM
August 04, 2006

Derek Parnell wrote:
> On Fri, 04 Aug 2006 00:36:21 -0300, Tim Locke wrote:
> 
> 
>>How do I acquire and determine the BOM and endianness of a file I am
>>reading?
>>
>>Thanks
> 
> 
> You might check out http://en.wikipedia.org/wiki/Byte_Order_Mark
> 

Are GNU tools really as ignorant of Unicode as that page implies?

[quote]
While UTF-8 does not have byte order issues, a BOM encoded in UTF-8 may be used to mark text as UTF-8. Quite a lot of Windows software (including Windows Notepad) adds one to UTF-8 files. However in Unix-like systems (which make heavy use of text files for configuration) this practice is not recommended, as it will interfere with correct processing of important codes such as the hash-bang at the start of an interpreted script. It may also interfere with source for programming languages that don't recognise it. For example, gcc reports stray characters at the beginning of a source file, and in PHP, if output buffering is disabled, it has the subtle effect of causing the page to start being sent to the browser, preventing custom headers from being specified by the PHP script
[/quote]
August 04, 2006
On Fri, 4 Aug 2006 14:15:00 +1000, Derek Parnell <derek@nomail.afraid.org> wrote:

>On Fri, 04 Aug 2006 00:36:21 -0300, Tim Locke wrote:
>
>> How do I acquire and determine the BOM and endianness of a file I am reading?
>> 
>> Thanks
>
>You might check out http://en.wikipedia.org/wiki/Byte_Order_Mark

I'm sorry but I wasn't clear in what I am looking for.

I'm looking to be able to open a file and have D automatically tell me which format it is, e.g. UTF-8, UTF-16LE, UTF-16BE, etc. without my having to code it. Ideally I would like to be able to read any unicode or ascii file and have D automatically detect its type and allow me to read it into whatever format I want, such as char, wchar, dchar.
August 04, 2006
On Fri, 04 Aug 2006 08:44:21 -0300, Tim Locke wrote:

> On Fri, 4 Aug 2006 14:15:00 +1000, Derek Parnell <derek@nomail.afraid.org> wrote:
> 
>>On Fri, 04 Aug 2006 00:36:21 -0300, Tim Locke wrote:
>>
>>> How do I acquire and determine the BOM and endianness of a file I am reading?
>>> 
>>> Thanks
>>
>>You might check out http://en.wikipedia.org/wiki/Byte_Order_Mark
> 
> I'm sorry but I wasn't clear in what I am looking for.
> 
> I'm looking to be able to open a file and have D automatically tell me which format it is, e.g. UTF-8, UTF-16LE, UTF-16BE, etc. without my having to code it. Ideally I would like to be able to read any unicode or ascii file and have D automatically detect its type and allow me to read it into whatever format I want, such as char, wchar, dchar.

The phobos library supplied by Walter does not have this functionality. The mango library and maybe others do. I know that I had to code this myself when I needed it.

-- 
Derek Parnell
Melbourne, Australia
"Down with mediocrity!"
August 04, 2006
Hasan Aljudy schrieb am 2006-08-04:
>
>
> Derek Parnell wrote:
>> On Fri, 04 Aug 2006 00:36:21 -0300, Tim Locke wrote:
>> 
>> 
>>>How do I acquire and determine the BOM and endianness of a file I am reading?
>>>
>>>Thanks
>> 
>> 
>> You might check out http://en.wikipedia.org/wiki/Byte_Order_Mark
>> 
>
> Are GNU tools really as ignorant of Unicode as that page implies?
>
> [quote]
> While UTF-8 does not have byte order issues, a BOM encoded in UTF-8 may
> be used to mark text as UTF-8. Quite a lot of Windows software
> (including Windows Notepad) adds one to UTF-8 files. However in
> Unix-like systems (which make heavy use of text files for configuration)
> this practice is not recommended, as it will interfere with correct
> processing of important codes such as the hash-bang at the start of an
> interpreted script.

Let's have 2 UTF-8 files with BOMs: A and B

cat A B > C

A's BOM will remain a BOM but B's BOM is going to be interpreted as
"zero-width no-break space". Thus using BOMs in combination with streaming,
concating etc. will allways cause problems. In contrast to Windows, Linux
- - home to the GNU tools - treats "text" and "binary" files as "binary" files.

Thomas