Thread overview
encoding ISO-8859-1 to UTF-8 in std.net.curl
Aug 08, 2016
Alexsej
Aug 08, 2016
ag0aep6g
Aug 08, 2016
Alexsej
Aug 09, 2016
ag0aep6g
Aug 08, 2016
ag0aep6g
Aug 08, 2016
Alexsej
August 08, 2016
import std.stdio;
import std.net.curl;

void main()
{

	string url = "www.site.ru/xml/api.asp";

	string data =
	"<?xml version='1.0' encoding='UTF-8'?>
		<request>
		<category>
			<id>59538</id>
		</category>
                ...
		</request>";

	auto http = HTTP();
	http.clearRequestHeaders();
	http.addRequestHeader("Content-Type", "application/xml");
	//Accept-Charset: utf-8
	http.addRequestHeader("Accept-Charset", "utf-8");
	
	//ISO-8859-1
	//http://www.artlebedev.ru/tools/decoder/
	//ISO-8859-1 → UTF-8
	auto content = post(url, "data", http);
	// content in ISO-8859-1 to UTF-8 encoding but I lose
        //the Cyrillic "<?xml version='1.0' encoding='UTF-8'?>отсутствует или неверно задан параметр"
	// I get it "<?xml version='1.0' encoding='UTF-8'?>отсутствует или неверно задан параметр"
	// How do I change the encoding to UTF-8 in response


	string s = cast(immutable char[])content;
	auto f = File("output.txt","w");  // output.txt file in UTF-8;
	f.write(s);
	f.close;
}
August 08, 2016
On 08/08/2016 09:57 PM, Alexsej wrote:
>     // content in ISO-8859-1 to UTF-8 encoding but I lose
>         //the Cyrillic "<?xml version='1.0'
> encoding='UTF-8'?>отсутствует или неверно задан параметр"
>     // I get it "<?xml version='1.0'
> encoding='UTF-8'?>отсутствует или неверно
> задан параметр"
>     // How do I change the encoding to UTF-8 in response
>
>
>     string s = cast(immutable char[])content;
>     auto f = File("output.txt","w");  // output.txt file in UTF-8;
>     f.write(s);

The server doesn't include the encoding in the Content-Type header, right? So curl assumes the default, which is ISO 8859-1. It interprets the data as that and transcodes to UTF-8. The result is garbage, of course.

I don't see a way to change the default encoding. Maybe that should be added.

Until then you can reverse the wrong transcoding:

----
import std.encoding: Latin1String, transcode;
Latin1String pseudo_latin1;
transcode(content.idup, pseudo_latin1);
string s = cast(string) pseudo_latin1;
----

Tiny rant:

Why on earth does transcode only accept immutable characters for input? Every other post here uncovers some bug/shortcoming :(
August 08, 2016
On Monday, 8 August 2016 at 21:11:26 UTC, ag0aep6g wrote:
> On 08/08/2016 09:57 PM, Alexsej wrote:
>>     // content in ISO-8859-1 to UTF-8 encoding but I lose
>>         //the Cyrillic "<?xml version='1.0'
>> encoding='UTF-8'?>отсутствует или неверно задан параметр"
>>     // I get it "<?xml version='1.0'
>> encoding='UTF-8'?>отсутствует или неверно
>> задан параметр"
>>     // How do I change the encoding to UTF-8 in response
>>
>>
>>     string s = cast(immutable char[])content;
>>     auto f = File("output.txt","w");  // output.txt file in UTF-8;
>>     f.write(s);
>
> The server doesn't include the encoding in the Content-Type header, right? So curl assumes the default, which is ISO 8859-1. It interprets the data as that and transcodes to UTF-8. The result is garbage, of course.
>
> I don't see a way to change the default encoding. Maybe that should be added.
>
> Until then you can reverse the wrong transcoding:
>
> ----
> import std.encoding: Latin1String, transcode;
> Latin1String pseudo_latin1;
> transcode(content.idup, pseudo_latin1);
> string s = cast(string) pseudo_latin1;
> ----
>
> Tiny rant:
>
> Why on earth does transcode only accept immutable characters for input? Every other post here uncovers some bug/shortcoming :(
//header from server
server: nginx
date: Mon, 08 Aug 2016 22:02:15 GMT
content-type: text/xml; Charset=utf-8
content-length: 204
connection: keep-alive
vary: Accept-Encoding
cache-control: private
expires: Mon, 08 Aug 2016 22:02:15 GMT
set-cookie: ASPSESSIONIDSSCCDASA=KIAPMCMDMPEDHPBJNMGFHMEB; path=/
x-powered-by: ASP.NET

August 09, 2016
On 08/08/2016 11:11 PM, ag0aep6g wrote:
> Why on earth does transcode only accept immutable characters for input?

https://github.com/dlang/phobos/pull/4722
August 08, 2016
On Monday, 8 August 2016 at 21:11:26 UTC, ag0aep6g wrote:
> On 08/08/2016 09:57 PM, Alexsej wrote:
>>     // content in ISO-8859-1 to UTF-8 encoding but I lose
>>         //the Cyrillic "<?xml version='1.0'
>> encoding='UTF-8'?>отсутствует или неверно задан параметр"
>>     // I get it "<?xml version='1.0'
>> encoding='UTF-8'?>отсутствует или неверно
>> задан параметр"
>>     // How do I change the encoding to UTF-8 in response
>>
>>
>>     string s = cast(immutable char[])content;
>>     auto f = File("output.txt","w");  // output.txt file in UTF-8;
>>     f.write(s);
>
> The server doesn't include the encoding in the Content-Type header, right? So curl assumes the default, which is ISO 8859-1. It interprets the data as that and transcodes to UTF-8. The result is garbage, of course.
>
> I don't see a way to change the default encoding. Maybe that should be added.
>
> Until then you can reverse the wrong transcoding:
>
> ----
> import std.encoding: Latin1String, transcode;
> Latin1String pseudo_latin1;
> transcode(content.idup, pseudo_latin1);
> string s = cast(string) pseudo_latin1;
> ----
>
> Tiny rant:
>
> Why on earth does transcode only accept immutable characters for input? Every other post here uncovers some bug/shortcoming :(

thanks it works.
August 09, 2016
On 08/09/2016 12:05 AM, Alexsej wrote:
> //header from server
> server: nginx
> date: Mon, 08 Aug 2016 22:02:15 GMT
> content-type: text/xml; Charset=utf-8
> content-length: 204
> connection: keep-alive
> vary: Accept-Encoding
> cache-control: private
> expires: Mon, 08 Aug 2016 22:02:15 GMT
> set-cookie: ASPSESSIONIDSSCCDASA=KIAPMCMDMPEDHPBJNMGFHMEB; path=/
> x-powered-by: ASP.NET

Looks like std.net.curl doesn't handle "Charset" correctly. It only works with lowercase "charset".

https://github.com/dlang/phobos/pull/4723