Parsing D files with non-unicode characters

Nov 05, 2018

Arun Chandrasekaran

Nov 06, 2018

Nov 06, 2018

Nov 06, 2018

Nov 06, 2018

Nov 07, 2018

Nov 07, 2018

Nov 06, 2018

Nov 06, 2018

Nov 06, 2018

Nov 06, 2018

Nov 07, 2018

Nov 07, 2018

Nov 07, 2018

Nov 07, 2018

Nov 07, 2018

I'm converting a large amount of header files from C to D using DStep and I'm stuck at https://github.com/jacob-carlborg/dstep/issues/215 https://dlang.org/spec/intro.html shows that ASCII and UTF char formats are accepted. How do I go about converting a large code base like this? Is this a bug in D to reject non-unicode chars in comments? Arun

On Monday, 5 November 2018 at 23:50:46 UTC, Arun Chandrasekaran wrote: > I'm converting a large amount of header files from C to D using DStep and I'm stuck at https://github.com/jacob-carlborg/dstep/issues/215 > > https://dlang.org/spec/intro.html shows that ASCII and UTF char formats are accepted. How do I go about converting a large code base like this? Just an idea: if you have 'iconv' available, you could always strip out non-utf-8 characters beforehand. See: https://stackoverflow.com/questions/12999651/how-to-remove-non-utf-8-characters-from-text-file

On Tuesday, 6 November 2018 at 00:38:03 UTC, Roland Hadinger wrote: > On Monday, 5 November 2018 at 23:50:46 UTC, Arun Chandrasekaran wrote: >> I'm converting a large amount of header files from C to D using DStep and I'm stuck at https://github.com/jacob-carlborg/dstep/issues/215 >> >> https://dlang.org/spec/intro.html shows that ASCII and UTF char formats are accepted. How do I go about converting a large code base like this? > > Just an idea: if you have 'iconv' available, you could always strip out non-utf-8 characters beforehand. See: > > https://stackoverflow.com/questions/12999651/how-to-remove-non-utf-8-characters-from-text-file Thanks! Can't we preserve the comments? Comments are invaluable, especially on the headerfiles. We generate documentation using doxygen.

On Tuesday, 6 November 2018 at 00:48:34 UTC, Arun Chandrasekaran wrote: > > Thanks! Can't we preserve the comments? Comments are invaluable, especially on the headerfiles. We generate documentation using doxygen. If by 'preserve' you mean 'keep the non-UTF-8 encoding as-is', then no, what I suggested wouldn't work.

November 05, 2018

Re: Parsing D files with non-unicode characters

Posted by Jonathan M Davis
in reply to Roland Hadinger

Permalink

Jonathan M Davis

Posted in reply to Roland Hadinger

Permalink

On Monday, November 5, 2018 6:19:17 PM MST Roland Hadinger via Digitalmars-d wrote:
> On Tuesday, 6 November 2018 at 00:48:34 UTC, Arun Chandrasekaran
>
> wrote:
> > Thanks! Can't we preserve the comments? Comments are invaluable, especially on the headerfiles. We generate documentation using doxygen.
>
> If by 'preserve' you mean 'keep the non-UTF-8 encoding as-is', then no, what I suggested wouldn't work.

If I understand correctly, non-UTF isn't legal in D source files, so that's just plain not possible period. They will have to be converted to Unicode in order to be in a D source file even in comments. If the characters are legal in some other encoding, then the encoding will need to be correctly detected and converted to Unicode somehow. If they're just invalid, then arguably, there really isn't anything to preserve anyway.

- Jonathan M Davis

On Monday, 5 November 2018 at 23:50:46 UTC, Arun Chandrasekaran wrote: > I'm converting a large amount of header files from C to D using DStep and I'm stuck at https://github.com/jacob-carlborg/dstep/issues/215 > > https://dlang.org/spec/intro.html shows that ASCII and UTF char formats are accepted. How do I go about converting a large code base like this? I may be missing something, but isn’t it possible to open these files one for one in their current encoding and save them in UTF-8 encoding using an editor that supports that, e.g., Sublime Text or Kate?

On Tuesday, 6 November 2018 at 07:25:09 UTC, Bastiaan Veelo wrote: > > I may be missing something, but isn’t it possible to open these files one for one in their current encoding and save them in UTF-8 encoding using an editor that supports that, e.g., Sublime Text or Kate? Yes. Better text editors are capable of automatically inferring (guess) the source encoding, although not always correctly. Guessing the source encoding is something 'iconv' cannot do. I forgot to mention that 'iconv' can also convert text between different encodings, but only when the source encoding is known. When it isn't known or when a file contains a mix of different encodings, iconv can only be used to filter out byte sequences that are invalid in the target encoding.

On Monday, 5 November 2018 at 23:50:46 UTC, Arun Chandrasekaran wrote: > I'm converting a large amount of header files from C to D using DStep and I'm stuck at https://github.com/jacob-carlborg/dstep/issues/215 > > https://dlang.org/spec/intro.html shows that ASCII and UTF char formats are accepted. How do I go about converting a large code base like this? > > Is this a bug in D to reject non-unicode chars in comments? > > Arun So you have code that has characters that are neither ascii nor unicode? What encoding is it using? And what characters does it contain that can't be represented with unicode?

On Tue, Nov 06, 2018 at 05:21:13PM +0000, Jonathan Marler via Digitalmars-d wrote: > On Monday, 5 November 2018 at 23:50:46 UTC, Arun Chandrasekaran wrote: > > I'm converting a large amount of header files from C to D using DStep and I'm stuck at https://github.com/jacob-carlborg/dstep/issues/215 > > > > https://dlang.org/spec/intro.html shows that ASCII and UTF char formats are accepted. How do I go about converting a large code base like this? [...] If you're on Linux, you could use the recode utility: https://github.com/rrthomas/recode/ T -- The easy way is the wrong way, and the hard way is the stupid way. Pick one.

On Tuesday, 6 November 2018 at 01:19:17 UTC, Roland Hadinger wrote: > On Tuesday, 6 November 2018 at 00:48:34 UTC, Arun Chandrasekaran wrote: >> >> Thanks! Can't we preserve the comments? Comments are invaluable, especially on the headerfiles. We generate documentation using doxygen. > > If by 'preserve' you mean 'keep the non-UTF-8 encoding as-is', then no, what I suggested wouldn't work. This did the trick. It uses https://github.com/BYVoid/uchardet to determine the character set. for dir in $(find <DIR> -name include -type d); do pushd $dir for file in $(ls); do iconv -f $(uchardet $file) -t UTF-8 $file > t /bin/mv t $file done # if the encoding is SHIFT-JIS iconv converts \ to ¥. Restore it back. sed -i 's,¥,\\,g' * # convert .h to .d file dstep $file popd done

Forums