Thread overview |
---|
July 22, 2014 How to know whether a file's encoding is ansi or utf8? | ||||
---|---|---|---|---|
| ||||
Greetings! As subjected,how can I know whether a file is in UTF8 encoding or ansi? Thanks for the help in advance. Regards, Sam |
July 22, 2014 Re: How to know whether a file's encoding is ansi or utf8? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Sam Hu | On Tuesday, 22 July 2014 at 09:50:00 UTC, Sam Hu wrote:
> Greetings!
>
> As subjected,how can I know whether a file is in UTF8 encoding or ansi?
>
> Thanks for the help in advance.
>
> Regards,
> Sam
Sorry,I mean by by code,for example,when I try to read a file content and printed to a text control in GUI,or to console,will proceed differently regarding file encoding.
|
July 22, 2014 Re: How to know whether a file's encoding is ansi or utf8? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Sam Hu | On Tuesday, 22 July 2014 at 09:50:00 UTC, Sam Hu wrote:
> Greetings!
>
> As subjected,how can I know whether a file is in UTF8 encoding or ansi?
>
> Thanks for the help in advance.
>
> Regards,
> Sam
By ANSI do you mean Windows code pages? Text editors usually use some heuristics (statistical analysis for example) to determine encoding of file. Note that these methods are not always accurate, so you need to provide ability to choose other encoding for users.
|
July 22, 2014 Re: How to know whether a file's encoding is ansi or utf8? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Sam Hu | Read the BOM ?
module main;
import std.stdio;
enum Encoding
{
UTF7,
UTF8,
UTF32,
Unicode,
BigEndianUnicode,
ASCII
};
Encoding GetFileEncoding(string fileName)
{
import std.file;
auto bom = cast(ubyte[]) read(fileName, 4);
if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76)
return Encoding.UTF7;
if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf)
return Encoding.UTF8;
if (bom[0] == 0xff && bom[1] == 0xfe)
return Encoding.Unicode; //UTF-16LE
if (bom[0] == 0xfe && bom[1] == 0xff)
return Encoding.BigEndianUnicode; //UTF-16BE
if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff)
return Encoding.UTF32;
return Encoding.ASCII;
}
void main(string[] args)
{
if(GetFileEncoding("test.txt") == Encoding.UTF8)
writeln("The file is UTF8");
else
writeln("File is not UTF8 :(");
}
On Tuesday, 22 July 2014 at 09:50:00 UTC, Sam Hu wrote:
> Greetings!
>
> As subjected,how can I know whether a file is in UTF8 encoding or ansi?
>
> Thanks for the help in advance.
>
> Regards,
> Sam
|
July 22, 2014 Re: How to know whether a file's encoding is ansi or utf8? | ||||
---|---|---|---|---|
| ||||
Posted in reply to FreeSlave | On Tuesday, 22 July 2014 at 11:09:36 UTC, FreeSlave wrote:
> On Tuesday, 22 July 2014 at 09:50:00 UTC, Sam Hu wrote:
>> Greetings!
>>
>> As subjected,how can I know whether a file is in UTF8 encoding or ansi?
>>
>> Thanks for the help in advance.
>>
>> Regards,
>> Sam
>
> By ANSI do you mean Windows code pages? Text editors usually use some heuristics (statistical analysis for example) to determine encoding of file. Note that these methods are not always accurate, so you need to provide ability to choose other encoding for users.
Thanks.
Yes.It is Windows related again...I found that writefln() can
print ansi encoding files into console and shows its content
correctly under asia font environment,but this does not work for
files with UTF8 encoding;On the other hand,Tango 4 D2 branch can
print files with UTF8 encoding into console and shows its content
correctly under asia font environment.I tried a 'both-way' with
Tango but failed.So I just have a silly idea when I encountered a
file to be printed to the console,I choose writefln or Tango's
Stdout.formatln depending on the file encoding.
|
July 22, 2014 Re: How to know whether a file's encoding is ansi or utf8? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Alexandre | On Tuesday, 22 July 2014 at 11:59:34 UTC, Alexandre wrote:
> Read the BOM ?
>
> module main;
>
> import std.stdio;
>
> enum Encoding
> {
> UTF7,
> UTF8,
> UTF32,
> Unicode,
> BigEndianUnicode,
> ASCII
> };
>
> Encoding GetFileEncoding(string fileName)
> {
> import std.file;
> auto bom = cast(ubyte[]) read(fileName, 4);
>
> if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76)
> return Encoding.UTF7;
> if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf)
> return Encoding.UTF8;
> if (bom[0] == 0xff && bom[1] == 0xfe)
> return Encoding.Unicode; //UTF-16LE
> if (bom[0] == 0xfe && bom[1] == 0xff)
> return Encoding.BigEndianUnicode; //UTF-16BE
> if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff)
> return Encoding.UTF32;
>
> return Encoding.ASCII;
> }
>
> void main(string[] args)
> {
> if(GetFileEncoding("test.txt") == Encoding.UTF8)
> writeln("The file is UTF8");
> else
> writeln("File is not UTF8 :(");
> }
>
>
>
> On Tuesday, 22 July 2014 at 09:50:00 UTC, Sam Hu wrote:
>> Greetings!
>>
>> As subjected,how can I know whether a file is in UTF8 encoding or ansi?
>>
>> Thanks for the help in advance.
>>
>> Regards,
>> Sam
Thanks. This is exactly what I want at this moment.
|
July 22, 2014 Re: How to know whether a file's encoding is ansi or utf8? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Sam Hu | Note that BOMs are optional and may be not presented in Unicode file. Also presence of leading bytes which look BOM does not necessarily mean that file is encoded in some kind of Unicode. |
July 22, 2014 Re: How to know whether a file's encoding is ansi or utf8? | ||||
---|---|---|---|---|
| ||||
Posted in reply to FreeSlave | http://www.architectshack.com/TextFileEncodingDetector.ashx On Tuesday, 22 July 2014 at 15:53:23 UTC, FreeSlave wrote: > Note that BOMs are optional and may be not presented in Unicode file. Also presence of leading bytes which look BOM does not necessarily mean that file is encoded in some kind of Unicode. There are several difficulties in this case ... |
July 24, 2014 Re: How to know whether a file's encoding is ansi or utf8? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Sam Hu | I first try to load the file as utf8 (or some 8kb at the start of it) with encoding exceptions turned on, if I catch an exception, I reload it as ansi, otherwise I assume it's valid utf8. |
Copyright © 1999-2021 by the D Language Foundation