Thread overview
How to know whether a file's encoding is ansi or utf8?
Jul 22, 2014
Sam Hu
Jul 22, 2014
Sam Hu
Jul 22, 2014
FreeSlave
Jul 22, 2014
Sam Hu
Jul 22, 2014
Alexandre
Jul 22, 2014
Sam Hu
Jul 22, 2014
FreeSlave
Jul 22, 2014
Alexandre
Jul 24, 2014
Kagamin
July 22, 2014
Greetings!

As subjected,how can I know whether a file is in UTF8 encoding or ansi?

Thanks for the help in advance.

Regards,
Sam
July 22, 2014
On Tuesday, 22 July 2014 at 09:50:00 UTC, Sam Hu wrote:
> Greetings!
>
> As subjected,how can I know whether a file is in UTF8 encoding or ansi?
>
> Thanks for the help in advance.
>
> Regards,
> Sam

Sorry,I mean by by code,for example,when I try to read a file content and printed to a text control in GUI,or to console,will proceed differently regarding file encoding.
July 22, 2014
On Tuesday, 22 July 2014 at 09:50:00 UTC, Sam Hu wrote:
> Greetings!
>
> As subjected,how can I know whether a file is in UTF8 encoding or ansi?
>
> Thanks for the help in advance.
>
> Regards,
> Sam

By ANSI do you mean Windows code pages? Text editors usually use some heuristics (statistical analysis for example) to determine encoding of file. Note that these methods are not always accurate, so you need to provide ability to choose other encoding for users.
July 22, 2014
Read the BOM ?

module main;

import std.stdio;

enum Encoding
{
	UTF7,
	UTF8,
	UTF32,
	Unicode,
	BigEndianUnicode,
	ASCII
};

Encoding GetFileEncoding(string fileName)
{
	import std.file;
	auto bom = cast(ubyte[]) read(fileName, 4);

	if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76)
		return Encoding.UTF7;
	if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf)
		return Encoding.UTF8;
	if (bom[0] == 0xff && bom[1] == 0xfe)
		return Encoding.Unicode; //UTF-16LE
	if (bom[0] == 0xfe && bom[1] == 0xff)
		return Encoding.BigEndianUnicode; //UTF-16BE
	if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff)
		return Encoding.UTF32;

	return Encoding.ASCII;
}

void main(string[] args)
{
	if(GetFileEncoding("test.txt") == Encoding.UTF8)
		writeln("The file is UTF8");
	else
		writeln("File is not UTF8 :(");
}



On Tuesday, 22 July 2014 at 09:50:00 UTC, Sam Hu wrote:
> Greetings!
>
> As subjected,how can I know whether a file is in UTF8 encoding or ansi?
>
> Thanks for the help in advance.
>
> Regards,
> Sam

July 22, 2014
On Tuesday, 22 July 2014 at 11:09:36 UTC, FreeSlave wrote:
> On Tuesday, 22 July 2014 at 09:50:00 UTC, Sam Hu wrote:
>> Greetings!
>>
>> As subjected,how can I know whether a file is in UTF8 encoding or ansi?
>>
>> Thanks for the help in advance.
>>
>> Regards,
>> Sam
>
> By ANSI do you mean Windows code pages? Text editors usually use some heuristics (statistical analysis for example) to determine encoding of file. Note that these methods are not always accurate, so you need to provide ability to choose other encoding for users.

Thanks.

Yes.It is Windows related again...I found that writefln() can
print ansi encoding files into console and shows its content
correctly under asia font environment,but this does not work for
files with UTF8 encoding;On the other hand,Tango 4 D2 branch can
print files with UTF8 encoding into console and shows its content
correctly under asia font environment.I tried a 'both-way' with
Tango but failed.So I just have a silly idea when I encountered a
file to be printed to the console,I choose writefln or Tango's
Stdout.formatln depending on the file encoding.
July 22, 2014
On Tuesday, 22 July 2014 at 11:59:34 UTC, Alexandre wrote:
> Read the BOM ?
>
> module main;
>
> import std.stdio;
>
> enum Encoding
> {
> 	UTF7,
> 	UTF8,
> 	UTF32,
> 	Unicode,
> 	BigEndianUnicode,
> 	ASCII
> };
>
> Encoding GetFileEncoding(string fileName)
> {
> 	import std.file;
> 	auto bom = cast(ubyte[]) read(fileName, 4);
>
> 	if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76)
> 		return Encoding.UTF7;
> 	if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf)
> 		return Encoding.UTF8;
> 	if (bom[0] == 0xff && bom[1] == 0xfe)
> 		return Encoding.Unicode; //UTF-16LE
> 	if (bom[0] == 0xfe && bom[1] == 0xff)
> 		return Encoding.BigEndianUnicode; //UTF-16BE
> 	if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff)
> 		return Encoding.UTF32;
>
> 	return Encoding.ASCII;
> }
>
> void main(string[] args)
> {
> 	if(GetFileEncoding("test.txt") == Encoding.UTF8)
> 		writeln("The file is UTF8");
> 	else
> 		writeln("File is not UTF8 :(");
> }
>
>
>
> On Tuesday, 22 July 2014 at 09:50:00 UTC, Sam Hu wrote:
>> Greetings!
>>
>> As subjected,how can I know whether a file is in UTF8 encoding or ansi?
>>
>> Thanks for the help in advance.
>>
>> Regards,
>> Sam

Thanks. This is exactly what I want at this moment.
July 22, 2014
Note that BOMs are optional and may be not presented in Unicode file. Also presence of leading bytes which look BOM does not necessarily mean that file is encoded in some kind of Unicode.
July 22, 2014
http://www.architectshack.com/TextFileEncodingDetector.ashx

On Tuesday, 22 July 2014 at 15:53:23 UTC, FreeSlave wrote:
> Note that BOMs are optional and may be not presented in Unicode file. Also presence of leading bytes which look BOM does not necessarily mean that file is encoded in some kind of Unicode.


There are several difficulties in this case ...
July 24, 2014
I first try to load the file as utf8 (or some 8kb at the start of it) with encoding exceptions turned on, if I catch an exception, I reload it as ansi, otherwise I assume it's valid utf8.