Parsing a UTF-16LE file line by line?

Hi, I was just trying to parse a UTF-16LE file using byLine, but apparently this function doesn't work with anything other than UTF-8, because I get this error: "Invalid UTF-8 sequence (at index 1)" How can I achieve what I want, without loading the entire file into memory? Thanks in advance.

Nestor via Digitalmars-d-learn <digitalmars-d-learn@puremagic.com> napsal St, led 4, 2017 v 12∶03 : > Hi, > > I was just trying to parse a UTF-16LE file using byLine, but apparently this function doesn't work with anything other than UTF-8, because I get this error: > > "Invalid UTF-8 sequence (at index 1)" > > How can I achieve what I want, without loading the entire file into memory? > > Thanks in advance. can you show your code, byLine should works ok, and post some example of utf16-le file which does not works

Daniel Kozák <kozzi11@gmail.com> napsal St, led 4, 2017 v 6∶33 : > > Nestor via Digitalmars-d-learn <digitalmars-d-learn@puremagic.com> napsal St, led 4, 2017 v 12∶03 : >> Hi, >> >> I was just trying to parse a UTF-16LE file using byLine, but apparently this function doesn't work with anything other than UTF-8, because I get this error: >> >> "Invalid UTF-8 sequence (at index 1)" >> >> How can I achieve what I want, without loading the entire file into memory? >> >> Thanks in advance. > can you show your code, byLine should works ok, and post some example of utf16-le file which does not works Ok, I've done some testing and you are right byLine is broken, so please fill a bug

On Wednesday, 4 January 2017 at 18:48:59 UTC, Daniel Kozák wrote: > Ok, I've done some testing and you are right byLine is broken, so please fill a bug A bug? I was under the impression that this function was *intended* to work only with UTF-8 encoded files.

On Wednesday, 4 January 2017 at 19:20:31 UTC, Nestor wrote: > On Wednesday, 4 January 2017 at 18:48:59 UTC, Daniel Kozák wrote: >> Ok, I've done some testing and you are right byLine is broken, so please fill a bug > > A bug? I was under the impression that this function was *intended* to work only with UTF-8 encoded files. I'm not sure if this works quite as intended, but I was at least able to produce a UTF-16 decode error rather than a UTF-8 decode error by setting the file orientation before reading it. import std.stdio; import core.stdc.wchar_ : fwide; void main(){ auto file = File("UTF-16LE encoded file.txt"); fwide(file.getFP(), 1); foreach(line; file.byLine){ writeln(file.readln); } }

January 04, 2017

Re: Parsing a UTF-16LE file line by line, BUG?

Posted by Daniel Kozák
in reply to Nestor

Permalink

Daniel Kozák

Posted in reply to Nestor

Attachments:

text/html part

Permalink

Nestor via Digitalmars-d-learn <digitalmars-d-learn@puremagic.com> napsal St, led 4, 2017 v 8∶20 :
> On Wednesday, 4 January 2017 at 18:48:59 UTC, Daniel Kozák wrote:
>> Ok, I've done some testing and you are right byLine is broken, so please fill a bug
> 
> A bug? I was under the impression that this function was *intended* to work only with UTF-8 encoded files.

Impression is nice but there is nothing about it, so anyone who will
read doc will expect it to work on any encoding.
And from doc I see there is a way how one can select encoding and even
select Terminator and its type, and this does not works so I expect it
is a bug.

Another wierd behaviour is when you read file as wstring it will try to decode it as utf8, then encode it to utf16, but even if it works (for utf8 files), and you end up with wstring lines (wstring[]) and you try to save it, it will automaticly save it as utf8. WTF this is really wrong and if it is intended it should be documentet better. Right now it is really hard to work with dlang stdio.

But I hoppe it will be deprecated someday and replace with something what support ranges and async io

On 1/4/17 6:03 AM, Nestor wrote: > Hi, > > I was just trying to parse a UTF-16LE file using byLine, but apparently > this function doesn't work with anything other than UTF-8, because I get > this error: > > "Invalid UTF-8 sequence (at index 1)" > > How can I achieve what I want, without loading the entire file into memory? > > Thanks in advance. I have not tested much with UTF16 and std.stdio, but I don't believe the underlying FILE * being used by phobos has good support for it. In my testing, for instance, byLine with a non-ascii delimeter didn't work at all. On Windows 64-bit, MSVC simply ignores any attempts to change the width of the stream. I wouldn't hold out much hope for this to be fixed. -Steve

> > I'm not sure if this works quite as intended, but I was at least able to produce a UTF-16 decode error rather than a UTF-8 decode error by setting the file orientation before reading it. > > import std.stdio; > import core.stdc.wchar_ : fwide; > void main(){ > auto file = File("UTF-16LE encoded file.txt"); > fwide(file.getFP(), 1); > foreach(line; file.byLine){ > writeln(file.readln); > } > } fwide is not implemented in Windows: https://msdn.microsoft.com/en-us/library/aa985619.aspx

On Friday, 6 January 2017 at 06:24:12 UTC, rumbu wrote: >> >> I'm not sure if this works quite as intended, but I was at least able to produce a UTF-16 decode error rather than a UTF-8 decode error by setting the file orientation before reading it. >> >> import std.stdio; >> import core.stdc.wchar_ : fwide; >> void main(){ >> auto file = File("UTF-16LE encoded file.txt"); >> fwide(file.getFP(), 1); >> foreach(line; file.byLine){ >> writeln(file.readln); >> } >> } > > fwide is not implemented in Windows: https://msdn.microsoft.com/en-us/library/aa985619.aspx That's odd. It was on Windows 7 64-bit that I put together and tested that example, and calling fwide definitely had an effect on program behavior.

On 01/06/2017 11:33 AM, pineapple wrote: > On Friday, 6 January 2017 at 06:24:12 UTC, rumbu wrote: >>> >>> I'm not sure if this works quite as intended, but I was at least able >>> to produce a UTF-16 decode error rather than a UTF-8 decode error by >>> setting the file orientation before reading it. >>> >>> import std.stdio; >>> import core.stdc.wchar_ : fwide; >>> void main(){ >>> auto file = File("UTF-16LE encoded file.txt"); >>> fwide(file.getFP(), 1); >>> foreach(line; file.byLine){ >>> writeln(file.readln); >>> } >>> } >> >> fwide is not implemented in Windows: >> https://msdn.microsoft.com/en-us/library/aa985619.aspx > > That's odd. It was on Windows 7 64-bit that I put together and tested > that example, and calling fwide definitely had an effect on program > behavior. Are you compiling a 32bit binary? Because in that case you would be using the digital mars c runtime which might have an implementation for fwide. -- Mike Wey

Forums