How can I convert a file encode by CP936 to a file with UTF-8 encoding
Thread overview | |||||
---|---|---|---|---|---|
|
July 13, 2022 How can I convert a file encode by CP936 to a file with UTF-8 encoding | ||||
---|---|---|---|---|
| ||||
July 13, 2022 Re: How can I convert a file encode by CP936 to a file with UTF-8 encoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to rocex | On Wednesday, 13 July 2022 at 11:47:56 UTC, rocex wrote: > How can I convert a file encode by CP936 to a file with UTF-8 encoding My lib doesn't have it included but the basic idea is to take this table: https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT and do the conversions. So loop through it, if it is < 128, it stays the same, if it == 128 it is 0x20AC, and greater than that you need to read the second byte too and look it up in that table. It looks like for many of the bytes, they increase in sequence, so you might only need part of the actual lookup table, and the rest you can do with some addition. Looks like from lead byte 83 it is a.... almost sequential offset. Probably safest to just copy the whole table. |
July 13, 2022 Re: How can I convert a file encode by CP936 to a file with UTF-8 encoding | ||||
---|---|---|---|---|
| ||||
Posted in reply to Adam D Ruppe | On Wednesday, 13 July 2022 at 12:00:43 UTC, Adam D Ruppe wrote: > On Wednesday, 13 July 2022 at 11:47:56 UTC, rocex wrote: >> How can I convert a file encode by CP936 to a file with UTF-8 encoding > > My lib doesn't have it included but the basic idea is to take this table: > > https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP936.TXT > > and do the conversions. So loop through it, if it is < 128, it stays the same, if it == 128 it is 0x20AC, and greater than that you need to read the second byte too and look it up in that table. > > It looks like for many of the bytes, they increase in sequence, so you might only need part of the actual lookup table, and the rest you can do with some addition. Looks like from lead byte 83 it is a.... almost sequential offset. Probably safest to just copy the whole table. I found this https://github.com/guotie/gogb2312, the algorithm should be the same |
Copyright © 1999-2021 by the D Language Foundation