Transparent ANSI to UTF-8 conversion - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Learn » Transparent ANSI to UTF-8 conversion

Thread overview

Transparent ANSI to UTF-8 conversion
Feb 27, 2013 Lubos Pintes
Feb 27, 2013 monarch_dodra
Feb 27, 2013 Dmitry Olshansky
Feb 27, 2013 Lubos Pintes
Feb 27, 2013 Dmitry Olshansky
Feb 28, 2013 Era Scarecrow
Feb 28, 2013 Lubos Pintes

February 27, 2013

Transparent ANSI to UTF-8 conversion

Posted by Lubos Pintes

Lubos Pintes

Hi,
I would like to transparently convert from ANSI to UTF-8 when dealing with text files. For example here in Slovakia, virtually every text file is in Windows-1250.
If someone opens a text file, he or she expects that it will work properly. So I suppose, that it is not feasible to tell someone "if you want to use my program, please convert every text to UTF-8".

To obtain the mapping from ANSI to Unicode for particular code page is trivial. Maybe even MultibyteToWidechar could help with this.

I however need to know how to do it "D-way". Could I define something like TextReader class? Or perhaps some support already exists somewhere?
Thank

February 27, 2013

Re: Transparent ANSI to UTF-8 conversion

Posted by monarch_dodra
in reply to Lubos Pintes

monarch_dodra

Posted in reply to Lubos Pintes

On Wednesday, 27 February 2013 at 10:56:16 UTC, Lubos Pintes wrote:
> Hi,
> I would like to transparently convert from ANSI to UTF-8 when dealing with text files. For example here in Slovakia, virtually every text file is in Windows-1250.
> If someone opens a text file, he or she expects that it will work properly. So I suppose, that it is not feasible to tell someone "if you want to use my program, please convert every text to UTF-8".
>
> To obtain the mapping from ANSI to Unicode for particular code page is trivial. Maybe even MultibyteToWidechar could help with this.
>
> I however need to know how to do it "D-way". Could I define something like TextReader class? Or perhaps some support already exists somewhere?
> Thank

I'd say the D way would be to simply exploit the fact that UTF is built into the language, and as such, not worry about encoding, and use raw code points.

You get you "Codepage to unicode *codepoint*" table, and then you simply map each character to a dchar. From there, D will itself convert your raw unicode (aka UTF-32) to UTF8 on the fly, when you need it. For example, writing to a file will automatically convert input to UTF-8. You can also simply use std.conv.to!string to convert any UTF scheme to UTF-8 (or any other UTF too for that matter).

This may not be as efficient as a "true" "codepage to UTF8 table" but:
1) Given you'll most probably be IO bound anyways, who cares?
2) Scalability. D does everything but the code page to code point mapping. Why bother doing any more than that?

February 27, 2013

Re: Transparent ANSI to UTF-8 conversion

Posted by Dmitry Olshansky
in reply to monarch_dodra

Dmitry Olshansky

Posted in reply to monarch_dodra

27-Feb-2013 16:20, monarch_dodra пишет:
> On Wednesday, 27 February 2013 at 10:56:16 UTC, Lubos Pintes wrote:
>> Hi,
>> I would like to transparently convert from ANSI to UTF-8 when dealing
>> with text files. For example here in Slovakia, virtually every text
>> file is in Windows-1250.
>> If someone opens a text file, he or she expects that it will work
>> properly. So I suppose, that it is not feasible to tell someone "if
>> you want to use my program, please convert every text to UTF-8".
>>
>> To obtain the mapping from ANSI to Unicode for particular code page is
>> trivial. Maybe even MultibyteToWidechar could help with this.
>>
>> I however need to know how to do it "D-way". Could I define something
>> like TextReader class? Or perhaps some support already exists somewhere?
>> Thank
>
> I'd say the D way would be to simply exploit the fact that UTF is built
> into the language, and as such, not worry about encoding, and use raw
> code points.
>
> You get you "Codepage to unicode *codepoint*" table, and then you simply
> map each character to a dchar. From there, D will itself convert your
> raw unicode (aka UTF-32) to UTF8 on the fly, when you need it. For
> example, writing to a file will automatically convert input to UTF-8.
> You can also simply use std.conv.to!string to convert any UTF scheme to
> UTF-8 (or any other UTF too for that matter).

Making a table that translates ANSI to UTF8 is trivially constructible using CTFE from the static one that does ANSI -> dchar.
>
> This may not be as efficient as a "true" "codepage to UTF8 table" but:
> 1) Given you'll most probably be IO bound anyways, who cares?

With in-memory transcoding you won't be. Text editors are typically all in-memory or mmap-ed.

> 2) Scalability. D does everything but the code page to code point
> mapping. Why bother doing any more than that?


-- 
Dmitry Olshansky

February 27, 2013

Re: Transparent ANSI to UTF-8 conversion

Posted by Lubos Pintes
in reply to Dmitry Olshansky

Lubos Pintes

Posted in reply to Dmitry Olshansky

I don't understand the CTFE usage in this context. I thought about something like
dchar[] windows_1250=[...];
Isn't this enough?
Thank

Dňa 27. 2. 2013 18:32 Dmitry Olshansky  wrote / napísal(a):
> 27-Feb-2013 16:20, monarch_dodra пишет:
>> On Wednesday, 27 February 2013 at 10:56:16 UTC, Lubos Pintes wrote:
>>> Hi,
>>> I would like to transparently convert from ANSI to UTF-8 when dealing
>>> with text files. For example here in Slovakia, virtually every text
>>> file is in Windows-1250.
>>> If someone opens a text file, he or she expects that it will work
>>> properly. So I suppose, that it is not feasible to tell someone "if
>>> you want to use my program, please convert every text to UTF-8".
>>>
>>> To obtain the mapping from ANSI to Unicode for particular code page is
>>> trivial. Maybe even MultibyteToWidechar could help with this.
>>>
>>> I however need to know how to do it "D-way". Could I define something
>>> like TextReader class? Or perhaps some support already exists somewhere?
>>> Thank
>>
>> I'd say the D way would be to simply exploit the fact that UTF is built
>> into the language, and as such, not worry about encoding, and use raw
>> code points.
>>
>> You get you "Codepage to unicode *codepoint*" table, and then you simply
>> map each character to a dchar. From there, D will itself convert your
>> raw unicode (aka UTF-32) to UTF8 on the fly, when you need it. For
>> example, writing to a file will automatically convert input to UTF-8.
>> You can also simply use std.conv.to!string to convert any UTF scheme to
>> UTF-8 (or any other UTF too for that matter).
>
> Making a table that translates ANSI to UTF8 is trivially constructible
> using CTFE from the static one that does ANSI -> dchar.
>>
>> This may not be as efficient as a "true" "codepage to UTF8 table" but:
>> 1) Given you'll most probably be IO bound anyways, who cares?
>
> With in-memory transcoding you won't be. Text editors are typically all
> in-memory or mmap-ed.
>
>> 2) Scalability. D does everything but the code page to code point
>> mapping. Why bother doing any more than that?
>
>

February 27, 2013

Re: Transparent ANSI to UTF-8 conversion

Posted by Dmitry Olshansky
in reply to Lubos Pintes

Dmitry Olshansky

Posted in reply to Lubos Pintes

28-Feb-2013 00:35, Lubos Pintes пишет:
> I don't understand the CTFE usage in this context. I thought about
> something like
> dchar[] windows_1250=[...];
> Isn't this enough?
> Thank

It's fine. What I've meant is if all you want to do is convert ANSI -> UTF8 there is no need to convert to dchar and then to UTF-8 chars.

so the table becomes more like:

char[][] windows_1250_to_UTF8 = [...];

Or rather (far better memory footprint):

char[2][] windows_1250_to_UTF8 = [ ... ];

I think 2 UTF-8 chars should be enough for your codepage.

Then CTFE is just a tool create one table from another :
char[2][] windows_1250UTF = createUTF8Table(windows_1250);

The point is that inside of createUTF8Table you create an array it by using new and simple loops + std.utf.encode just like in normal code but it'll be CTFE-ed.

Same goes for going backwards - you can treat char[2] as ushort and do the tables. Though now it may have gaps due to encoding not being linear but rather having some stride with certain period.

>
> Dňa 27. 2. 2013 18:32 Dmitry Olshansky  wrote / napísal(a):
>> 27-Feb-2013 16:20, monarch_dodra пишет:
>>> On Wednesday, 27 February 2013 at 10:56:16 UTC, Lubos Pintes wrote:
>>>> Hi,
>>>> I would like to transparently convert from ANSI to UTF-8 when dealing
>>>> with text files. For example here in Slovakia, virtually every text
>>>> file is in Windows-1250.
>>>> If someone opens a text file, he or she expects that it will work
>>>> properly. So I suppose, that it is not feasible to tell someone "if
>>>> you want to use my program, please convert every text to UTF-8".
>>>>
>>>> To obtain the mapping from ANSI to Unicode for particular code page is
>>>> trivial. Maybe even MultibyteToWidechar could help with this.
>>>>
>>>> I however need to know how to do it "D-way". Could I define something
>>>> like TextReader class? Or perhaps some support already exists
>>>> somewhere?
>>>> Thank
>>>
>>> I'd say the D way would be to simply exploit the fact that UTF is built
>>> into the language, and as such, not worry about encoding, and use raw
>>> code points.
>>>
>>> You get you "Codepage to unicode *codepoint*" table, and then you simply
>>> map each character to a dchar. From there, D will itself convert your
>>> raw unicode (aka UTF-32) to UTF8 on the fly, when you need it. For
>>> example, writing to a file will automatically convert input to UTF-8.
>>> You can also simply use std.conv.to!string to convert any UTF scheme to
>>> UTF-8 (or any other UTF too for that matter).
>>
>> Making a table that translates ANSI to UTF8 is trivially constructible
>> using CTFE from the static one that does ANSI -> dchar.
>>>
>>> This may not be as efficient as a "true" "codepage to UTF8 table" but:
>>> 1) Given you'll most probably be IO bound anyways, who cares?
>>
>> With in-memory transcoding you won't be. Text editors are typically all
>> in-memory or mmap-ed.
>>
>>> 2) Scalability. D does everything but the code page to code point
>>> mapping. Why bother doing any more than that?
>>
>>
>


-- 
Dmitry Olshansky

February 28, 2013

Re: Transparent ANSI to UTF-8 conversion

Posted by Era Scarecrow
in reply to Lubos Pintes

Era Scarecrow

Posted in reply to Lubos Pintes

On Wednesday, 27 February 2013 at 10:56:16 UTC, Lubos Pintes wrote:
> Hi,
> I would like to transparently convert from ANSI to UTF-8 when dealing with text files. For example here in Slovakia, virtually every text file is in Windows-1250. If someone opens a text file, he or she expects that it will work properly. So I suppose, that it is not feasible to tell someone "if you want to use my program, please convert every text to UTF-8".

 A while back I wrote a little code that effectively does that, mind you it's probably not the right specific encoding, however you should be able to find the code points and replace them. I think this was for iso-8859-1.

See "Reading ASCII file with some codes above 127 (exten ascii)"

http://forum.dlang.org/thread/lehgyzmwewgvkdgraizv@forum.dlang.org

February 28, 2013

Re: Transparent ANSI to UTF-8 conversion

Posted by Lubos Pintes
in reply to Era Scarecrow

Lubos Pintes

Posted in reply to Era Scarecrow

Thank you all. Now I believe I will be able to solve this.
Dňa 28. 2. 2013 5:25 Era Scarecrow  wrote / napísal(a):
> On Wednesday, 27 February 2013 at 10:56:16 UTC, Lubos Pintes wrote:
>> Hi,
>> I would like to transparently convert from ANSI to UTF-8 when dealing
>> with text files. For example here in Slovakia, virtually every text
>> file is in Windows-1250. If someone opens a text file, he or she
>> expects that it will work properly. So I suppose, that it is not
>> feasible to tell someone "if you want to use my program, please
>> convert every text to UTF-8".
>
>   A while back I wrote a little code that effectively does that, mind
> you it's probably not the right specific encoding, however you should be
> able to find the code points and replace them. I think this was for
> iso-8859-1.
>
> See "Reading ASCII file with some codes above 127 (exten ascii)"
>
> http://forum.dlang.org/thread/lehgyzmwewgvkdgraizv@forum.dlang.org

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation