Jump to page: 1 2
Thread overview
Parsing D files with non-unicode characters
Nov 06, 2018
Roland Hadinger
Nov 06, 2018
Roland Hadinger
Nov 06, 2018
Jonathan M Davis
Nov 06, 2018
Bastiaan Veelo
Nov 06, 2018
Roland Hadinger
Nov 06, 2018
Jonathan Marler
Nov 06, 2018
H. S. Teoh
Nov 07, 2018
Jonathan Marler
Nov 07, 2018
Neia Neutuladh
Nov 07, 2018
Jonathan Marler
Nov 07, 2018
Stanislav Blinov
November 05, 2018
I'm converting a large amount of header files from C to D using DStep and I'm stuck at https://github.com/jacob-carlborg/dstep/issues/215

https://dlang.org/spec/intro.html shows that ASCII and UTF char formats are accepted. How do I go about converting a large code base like this?

Is this a bug in D to reject non-unicode chars in comments?

Arun
November 06, 2018
On Monday, 5 November 2018 at 23:50:46 UTC, Arun Chandrasekaran wrote:
> I'm converting a large amount of header files from C to D using DStep and I'm stuck at https://github.com/jacob-carlborg/dstep/issues/215
>
> https://dlang.org/spec/intro.html shows that ASCII and UTF char formats are accepted. How do I go about converting a large code base like this?

Just an idea: if you have 'iconv' available, you could always strip out non-utf-8 characters beforehand. See:

https://stackoverflow.com/questions/12999651/how-to-remove-non-utf-8-characters-from-text-file

November 06, 2018
On Tuesday, 6 November 2018 at 00:38:03 UTC, Roland Hadinger wrote:
> On Monday, 5 November 2018 at 23:50:46 UTC, Arun Chandrasekaran wrote:
>> I'm converting a large amount of header files from C to D using DStep and I'm stuck at https://github.com/jacob-carlborg/dstep/issues/215
>>
>> https://dlang.org/spec/intro.html shows that ASCII and UTF char formats are accepted. How do I go about converting a large code base like this?
>
> Just an idea: if you have 'iconv' available, you could always strip out non-utf-8 characters beforehand. See:
>
> https://stackoverflow.com/questions/12999651/how-to-remove-non-utf-8-characters-from-text-file

Thanks! Can't we preserve the comments? Comments are invaluable, especially on the headerfiles. We generate documentation using doxygen.
November 06, 2018
On Tuesday, 6 November 2018 at 00:48:34 UTC, Arun Chandrasekaran wrote:
>
> Thanks! Can't we preserve the comments? Comments are invaluable, especially on the headerfiles. We generate documentation using doxygen.

If by 'preserve' you mean 'keep the non-UTF-8 encoding as-is', then no, what I suggested wouldn't work.
November 05, 2018
On Monday, November 5, 2018 6:19:17 PM MST Roland Hadinger via Digitalmars-d wrote:
> On Tuesday, 6 November 2018 at 00:48:34 UTC, Arun Chandrasekaran
>
> wrote:
> > Thanks! Can't we preserve the comments? Comments are invaluable, especially on the headerfiles. We generate documentation using doxygen.
>
> If by 'preserve' you mean 'keep the non-UTF-8 encoding as-is', then no, what I suggested wouldn't work.

If I understand correctly, non-UTF isn't legal in D source files, so that's just plain not possible period. They will have to be converted to Unicode in order to be in a D source file even in comments. If the characters are legal in some other encoding, then the encoding will need to be correctly detected and converted to Unicode somehow. If they're just invalid, then arguably, there really isn't anything to preserve anyway.

- Jonathan M Davis



November 06, 2018
On Monday, 5 November 2018 at 23:50:46 UTC, Arun Chandrasekaran wrote:
> I'm converting a large amount of header files from C to D using DStep and I'm stuck at https://github.com/jacob-carlborg/dstep/issues/215
>
> https://dlang.org/spec/intro.html shows that ASCII and UTF char formats are accepted. How do I go about converting a large code base like this?

I may be missing something, but isn’t it possible to open these files one for one in their current encoding and save them in UTF-8 encoding using an editor that supports that, e.g., Sublime Text or Kate?



November 06, 2018
On Tuesday, 6 November 2018 at 07:25:09 UTC, Bastiaan Veelo wrote:
>
> I may be missing something, but isn’t it possible to open these files one for one in their current encoding and save them in UTF-8 encoding using an editor that supports that, e.g., Sublime Text or Kate?

Yes. Better text editors are capable of automatically inferring (guess) the source encoding, although not always correctly. Guessing the source encoding is something 'iconv' cannot do.

I forgot to mention that 'iconv' can also convert text between different encodings, but only when the source encoding is known. When it isn't known or when a file contains a mix of different encodings, iconv can only be used to filter out byte sequences that are invalid in the target encoding.

November 06, 2018
On Monday, 5 November 2018 at 23:50:46 UTC, Arun Chandrasekaran wrote:
> I'm converting a large amount of header files from C to D using DStep and I'm stuck at https://github.com/jacob-carlborg/dstep/issues/215
>
> https://dlang.org/spec/intro.html shows that ASCII and UTF char formats are accepted. How do I go about converting a large code base like this?
>
> Is this a bug in D to reject non-unicode chars in comments?
>
> Arun

So you have code that has characters that are neither ascii nor unicode?  What encoding is it using?  And what characters does it contain that can't be represented with unicode?
November 06, 2018
On Tue, Nov 06, 2018 at 05:21:13PM +0000, Jonathan Marler via Digitalmars-d wrote:
> On Monday, 5 November 2018 at 23:50:46 UTC, Arun Chandrasekaran wrote:
> > I'm converting a large amount of header files from C to D using DStep and I'm stuck at https://github.com/jacob-carlborg/dstep/issues/215
> > 
> > https://dlang.org/spec/intro.html shows that ASCII and UTF char formats are accepted. How do I go about converting a large code base like this?
[...]

If you're on Linux, you could use the recode utility:

	https://github.com/rrthomas/recode/


T

-- 
The easy way is the wrong way, and the hard way is the stupid way. Pick one.
November 07, 2018
On Tuesday, 6 November 2018 at 01:19:17 UTC, Roland Hadinger wrote:
> On Tuesday, 6 November 2018 at 00:48:34 UTC, Arun Chandrasekaran wrote:
>>
>> Thanks! Can't we preserve the comments? Comments are invaluable, especially on the headerfiles. We generate documentation using doxygen.
>
> If by 'preserve' you mean 'keep the non-UTF-8 encoding as-is', then no, what I suggested wouldn't work.

This did the trick. It uses https://github.com/BYVoid/uchardet to determine the character set.

for dir in $(find <DIR> -name include -type d); do
    pushd $dir

    for file in $(ls); do
	iconv -f $(uchardet $file) -t UTF-8 $file > t
	/bin/mv t $file
    done
    # if the encoding is SHIFT-JIS iconv converts \ to ¥. Restore it back.
    sed -i 's,¥,\\,g' *

    # convert .h to .d file
    dstep $file

    popd
done
« First   ‹ Prev
1 2