Jump to page: 1 2 3
Thread overview
Parsing a UTF-16LE file line by line?
Jan 04, 2017
Nestor
Jan 04, 2017
Daniel Kozák
Jan 04, 2017
Daniel Kozák
Re: Parsing a UTF-16LE file line by line, BUG?
Jan 04, 2017
Nestor
Jan 04, 2017
pineapple
Jan 06, 2017
rumbu
Jan 06, 2017
pineapple
Jan 06, 2017
Mike Wey
Jan 15, 2017
Nestor
Jan 15, 2017
Nestor
Jan 15, 2017
Daniel Kozák
Jan 15, 2017
Nestor
Jan 16, 2017
Era Scarecrow
Jan 17, 2017
Nestor
Jan 27, 2017
Era Scarecrow
Jan 28, 2017
Nestor
Jan 29, 2017
Patrick Schluter
Jan 27, 2017
Jack Applegame
Jan 27, 2017
Era Scarecrow
Jan 04, 2017
Daniel Kozák
January 04, 2017
Hi,

I was just trying to parse a UTF-16LE file using byLine, but apparently this function doesn't work with anything other than UTF-8, because I get this error:

"Invalid UTF-8 sequence (at index 1)"

How can I achieve what I want, without loading the entire file into memory?

Thanks in advance.
January 04, 2017
Nestor via Digitalmars-d-learn <digitalmars-d-learn@puremagic.com> napsal St, led 4, 2017 v 12∶03 :
> Hi,
> 
> I was just trying to parse a UTF-16LE file using byLine, but apparently this function doesn't work with anything other than UTF-8, because I get this error:
> 
> "Invalid UTF-8 sequence (at index 1)"
> 
> How can I achieve what I want, without loading the entire file into memory?
> 
> Thanks in advance.
can you show your code, byLine should works ok, and post some example of utf16-le file which does not works


January 04, 2017
Daniel Kozák <kozzi11@gmail.com> napsal St, led 4, 2017 v 6∶33 :
> 
> Nestor via Digitalmars-d-learn <digitalmars-d-learn@puremagic.com> napsal St, led 4, 2017 v 12∶03 :
>> Hi,
>> 
>> I was just trying to parse a UTF-16LE file using byLine, but apparently this function doesn't work with anything other than UTF-8, because I get this error:
>> 
>> "Invalid UTF-8 sequence (at index 1)"
>> 
>> How can I achieve what I want, without loading the entire file into memory?
>> 
>> Thanks in advance.
> can you show your code, byLine should works ok, and post some example of utf16-le file which does not works

Ok, I've done some testing and you are right byLine is broken, so please fill a bug




January 04, 2017
On Wednesday, 4 January 2017 at 18:48:59 UTC, Daniel Kozák wrote:
> Ok, I've done some testing and you are right byLine is broken, so please fill a bug

A bug? I was under the impression that this function was *intended* to work only with UTF-8 encoded files.
January 04, 2017
On Wednesday, 4 January 2017 at 19:20:31 UTC, Nestor wrote:
> On Wednesday, 4 January 2017 at 18:48:59 UTC, Daniel Kozák wrote:
>> Ok, I've done some testing and you are right byLine is broken, so please fill a bug
>
> A bug? I was under the impression that this function was *intended* to work only with UTF-8 encoded files.

I'm not sure if this works quite as intended, but I was at least able to produce a UTF-16 decode error rather than a UTF-8 decode error by setting the file orientation before reading it.

    import std.stdio;
    import core.stdc.wchar_ : fwide;
    void main(){
        auto file = File("UTF-16LE encoded file.txt");
        fwide(file.getFP(), 1);
        foreach(line; file.byLine){
            writeln(file.readln);
        }
    }
January 04, 2017
Nestor via Digitalmars-d-learn <digitalmars-d-learn@puremagic.com> napsal St, led 4, 2017 v 8∶20 :
> On Wednesday, 4 January 2017 at 18:48:59 UTC, Daniel Kozák wrote:
>> Ok, I've done some testing and you are right byLine is broken, so please fill a bug
> 
> A bug? I was under the impression that this function was *intended* to work only with UTF-8 encoded files.

Impression is nice but there is nothing about it, so anyone who will
read doc will expect it to work on any encoding.
And from doc I see there is a way how one can select encoding and even
select Terminator and its type, and this does not works so I expect it
is a bug.

Another wierd behaviour is when you read file as wstring it will try to decode it as utf8, then encode it to utf16, but even if it works (for utf8 files), and you end up with wstring lines (wstring[]) and you try to save it, it will automaticly save it as utf8. WTF this is really wrong and if it is intended it should be documentet better. Right now it is really hard to work with dlang stdio.

But I hoppe it will be deprecated someday and replace with something what support ranges and async io


January 05, 2017
On 1/4/17 6:03 AM, Nestor wrote:
> Hi,
>
> I was just trying to parse a UTF-16LE file using byLine, but apparently
> this function doesn't work with anything other than UTF-8, because I get
> this error:
>
> "Invalid UTF-8 sequence (at index 1)"
>
> How can I achieve what I want, without loading the entire file into memory?
>
> Thanks in advance.

I have not tested much with UTF16 and std.stdio, but I don't believe the underlying FILE * being used by phobos has good support for it.

In my testing, for instance, byLine with a non-ascii delimeter didn't work at all.

On Windows 64-bit, MSVC simply ignores any attempts to change the width of the stream.

I wouldn't hold out much hope for this to be fixed.

-Steve
January 06, 2017
>
> I'm not sure if this works quite as intended, but I was at least able to produce a UTF-16 decode error rather than a UTF-8 decode error by setting the file orientation before reading it.
>
>     import std.stdio;
>     import core.stdc.wchar_ : fwide;
>     void main(){
>         auto file = File("UTF-16LE encoded file.txt");
>         fwide(file.getFP(), 1);
>         foreach(line; file.byLine){
>             writeln(file.readln);
>         }
>     }

fwide is not implemented in Windows: https://msdn.microsoft.com/en-us/library/aa985619.aspx


January 06, 2017
On Friday, 6 January 2017 at 06:24:12 UTC, rumbu wrote:
>>
>> I'm not sure if this works quite as intended, but I was at least able to produce a UTF-16 decode error rather than a UTF-8 decode error by setting the file orientation before reading it.
>>
>>     import std.stdio;
>>     import core.stdc.wchar_ : fwide;
>>     void main(){
>>         auto file = File("UTF-16LE encoded file.txt");
>>         fwide(file.getFP(), 1);
>>         foreach(line; file.byLine){
>>             writeln(file.readln);
>>         }
>>     }
>
> fwide is not implemented in Windows: https://msdn.microsoft.com/en-us/library/aa985619.aspx

That's odd. It was on Windows 7 64-bit that I put together and tested that example, and calling fwide definitely had an effect on program behavior.
January 06, 2017
On 01/06/2017 11:33 AM, pineapple wrote:
> On Friday, 6 January 2017 at 06:24:12 UTC, rumbu wrote:
>>>
>>> I'm not sure if this works quite as intended, but I was at least able
>>> to produce a UTF-16 decode error rather than a UTF-8 decode error by
>>> setting the file orientation before reading it.
>>>
>>>     import std.stdio;
>>>     import core.stdc.wchar_ : fwide;
>>>     void main(){
>>>         auto file = File("UTF-16LE encoded file.txt");
>>>         fwide(file.getFP(), 1);
>>>         foreach(line; file.byLine){
>>>             writeln(file.readln);
>>>         }
>>>     }
>>
>> fwide is not implemented in Windows:
>> https://msdn.microsoft.com/en-us/library/aa985619.aspx
>
> That's odd. It was on Windows 7 64-bit that I put together and tested
> that example, and calling fwide definitely had an effect on program
> behavior.

Are you compiling a 32bit binary? Because in that case you would be using the digital mars c runtime which might have an implementation for fwide.

-- 
Mike Wey
« First   ‹ Prev
1 2 3