Parsing D files with non-unicode characters (page 2)

On Wednesday, 7 November 2018 at 04:43:19 UTC, Arun Chandrasekaran wrote: > On Tuesday, 6 November 2018 at 01:19:17 UTC, Roland Hadinger wrote: >> On Tuesday, 6 November 2018 at 00:48:34 UTC, Arun Chandrasekaran wrote: >>> >>> Thanks! Can't we preserve the comments? Comments are invaluable, especially on the headerfiles. We generate documentation using doxygen. >> >> If by 'preserve' you mean 'keep the non-UTF-8 encoding as-is', then no, what I suggested wouldn't work. > > # convert .h to .d file > dstep $file s/dstep $file/dstep */g

On Tuesday, 6 November 2018 at 17:21:13 UTC, Jonathan Marler wrote: > On Monday, 5 November 2018 at 23:50:46 UTC, Arun Chandrasekaran wrote: >> I'm converting a large amount of header files from C to D using DStep and I'm stuck at https://github.com/jacob-carlborg/dstep/issues/215 >> >> https://dlang.org/spec/intro.html shows that ASCII and UTF char formats are accepted. How do I go about converting a large code base like this? >> >> Is this a bug in D to reject non-unicode chars in comments? >> >> Arun > > So you have code that has characters that are neither ascii nor unicode? What encoding is it using? And what characters does it contain that can't be represented with unicode? I was not able to find the character encoding. file -i said unknown-8bit. Ultimately https://github.com/BYVoid/uchardet helped me to determine the charset, it was SHIFT-JIS.

November 07, 2018

Re: Parsing D files with non-unicode characters

Posted by Jonathan Marler
in reply to Arun Chandrasekaran

Permalink

Jonathan Marler

Posted in reply to Arun Chandrasekaran

Permalink

On Wednesday, 7 November 2018 at 05:33:20 UTC, Arun Chandrasekaran wrote:
> On Tuesday, 6 November 2018 at 17:21:13 UTC, Jonathan Marler wrote:
>> On Monday, 5 November 2018 at 23:50:46 UTC, Arun Chandrasekaran wrote:
>>> I'm converting a large amount of header files from C to D using DStep and I'm stuck at https://github.com/jacob-carlborg/dstep/issues/215
>>>
>>> https://dlang.org/spec/intro.html shows that ASCII and UTF char formats are accepted. How do I go about converting a large code base like this?
>>>
>>> Is this a bug in D to reject non-unicode chars in comments?
>>>
>>> Arun
>>
>> So you have code that has characters that are neither ascii nor unicode?  What encoding is it using?  And what characters does it contain that can't be represented with unicode?
>
> I was not able to find the character encoding. file -i said unknown-8bit. Ultimately https://github.com/BYVoid/uchardet helped me to determine the charset, it was SHIFT-JIS.

I hadn't seen that you provided a link to the file.  After I found it, I played with it a bit.  It looks like if you add a UTF-8 BOM in the beginning then DMD successfully parses it. However, from my quick scan of lexer.d, I didn't see anywhere in the code that actually changes how it decodes the file based on the the presence of the BOM.  Does anyone know if it does?  Is DMD supposed to allow multi-byte UTF-8 characters if there is no BOM?  If so, then this is a bug.

On Wed, 07 Nov 2018 15:51:13 +0000, Jonathan Marler wrote: > I hadn't seen that you provided a link to the file. After I found it, I played with it a bit. It looks like if you add a UTF-8 BOM in the beginning then DMD successfully parses it. However, from my quick scan of lexer.d, I didn't see anywhere in the code that actually changes how it decodes the file based on the the presence of the BOM. Does anyone know if it does? Is DMD supposed to allow multi-byte UTF-8 characters if there is no BOM? If so, then this is a bug. It can handle multibyte UTF8 characters without a byte order mark. Should be straightforward to test this: echo '/* ſ™🅻 */' > file.d dmd -c file.d On dmd 2.081.1, the byte order mark changes nothing for me.

On Wednesday, 7 November 2018 at 17:37:11 UTC, Neia Neutuladh wrote: > On Wed, 07 Nov 2018 15:51:13 +0000, Jonathan Marler wrote: >> I hadn't seen that you provided a link to the file. After I found it, I played with it a bit. It looks like if you add a UTF-8 BOM in the beginning then DMD successfully parses it. However, from my quick scan of lexer.d, I didn't see anywhere in the code that actually changes how it decodes the file based on the the presence of the BOM. Does anyone know if it does? Is DMD supposed to allow multi-byte UTF-8 characters if there is no BOM? If so, then this is a bug. > > It can handle multibyte UTF8 characters without a byte order mark. Should be straightforward to test this: > > echo '/* ſ™🅻 */' > file.d > dmd -c file.d > > On dmd 2.081.1, the byte order mark changes nothing for me. Ok, that matches what I saw in the code. It looks like my editor was actually changing the file. The encoding being used in that file is not valid UTF8, so the solution is to re-encode the files in utf8.

On Wednesday, 7 November 2018 at 15:51:13 UTC, Jonathan Marler wrote: > However, from my quick scan of lexer.d, I didn't see anywhere in the code that actually changes how it decodes the file based on the the presence of the BOM. Does anyone know if it does? https://github.com/dlang/dmd/blob/master/src/dmd/dmodule.d#L652

Forums