Thread overview | |||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
January 04, 2017 Parsing a UTF-16LE file line by line? | ||||
---|---|---|---|---|
| ||||
Hi, I was just trying to parse a UTF-16LE file using byLine, but apparently this function doesn't work with anything other than UTF-8, because I get this error: "Invalid UTF-8 sequence (at index 1)" How can I achieve what I want, without loading the entire file into memory? Thanks in advance. |
January 04, 2017 Re: Parsing a UTF-16LE file line by line? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Nestor Attachments:
| Nestor via Digitalmars-d-learn <digitalmars-d-learn@puremagic.com> napsal St, led 4, 2017 v 12∶03 : > Hi, > > I was just trying to parse a UTF-16LE file using byLine, but apparently this function doesn't work with anything other than UTF-8, because I get this error: > > "Invalid UTF-8 sequence (at index 1)" > > How can I achieve what I want, without loading the entire file into memory? > > Thanks in advance. can you show your code, byLine should works ok, and post some example of utf16-le file which does not works |
January 04, 2017 Re: Parsing a UTF-16LE file line by line? | ||||
---|---|---|---|---|
| ||||
Attachments:
| Daniel Kozák <kozzi11@gmail.com> napsal St, led 4, 2017 v 6∶33 :
>
> Nestor via Digitalmars-d-learn <digitalmars-d-learn@puremagic.com> napsal St, led 4, 2017 v 12∶03 :
>> Hi,
>>
>> I was just trying to parse a UTF-16LE file using byLine, but apparently this function doesn't work with anything other than UTF-8, because I get this error:
>>
>> "Invalid UTF-8 sequence (at index 1)"
>>
>> How can I achieve what I want, without loading the entire file into memory?
>>
>> Thanks in advance.
> can you show your code, byLine should works ok, and post some example of utf16-le file which does not works
Ok, I've done some testing and you are right byLine is broken, so please fill a bug
|
January 04, 2017 Re: Parsing a UTF-16LE file line by line, BUG? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Daniel Kozák | On Wednesday, 4 January 2017 at 18:48:59 UTC, Daniel Kozák wrote:
> Ok, I've done some testing and you are right byLine is broken, so please fill a bug
A bug? I was under the impression that this function was *intended* to work only with UTF-8 encoded files.
|
January 04, 2017 Re: Parsing a UTF-16LE file line by line, BUG? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Nestor | On Wednesday, 4 January 2017 at 19:20:31 UTC, Nestor wrote:
> On Wednesday, 4 January 2017 at 18:48:59 UTC, Daniel Kozák wrote:
>> Ok, I've done some testing and you are right byLine is broken, so please fill a bug
>
> A bug? I was under the impression that this function was *intended* to work only with UTF-8 encoded files.
I'm not sure if this works quite as intended, but I was at least able to produce a UTF-16 decode error rather than a UTF-8 decode error by setting the file orientation before reading it.
import std.stdio;
import core.stdc.wchar_ : fwide;
void main(){
auto file = File("UTF-16LE encoded file.txt");
fwide(file.getFP(), 1);
foreach(line; file.byLine){
writeln(file.readln);
}
}
|
January 04, 2017 Re: Parsing a UTF-16LE file line by line, BUG? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Nestor Attachments:
| Nestor via Digitalmars-d-learn <digitalmars-d-learn@puremagic.com> napsal St, led 4, 2017 v 8∶20 : > On Wednesday, 4 January 2017 at 18:48:59 UTC, Daniel Kozák wrote: >> Ok, I've done some testing and you are right byLine is broken, so please fill a bug > > A bug? I was under the impression that this function was *intended* to work only with UTF-8 encoded files. Impression is nice but there is nothing about it, so anyone who will read doc will expect it to work on any encoding. And from doc I see there is a way how one can select encoding and even select Terminator and its type, and this does not works so I expect it is a bug. Another wierd behaviour is when you read file as wstring it will try to decode it as utf8, then encode it to utf16, but even if it works (for utf8 files), and you end up with wstring lines (wstring[]) and you try to save it, it will automaticly save it as utf8. WTF this is really wrong and if it is intended it should be documentet better. Right now it is really hard to work with dlang stdio. But I hoppe it will be deprecated someday and replace with something what support ranges and async io |
January 05, 2017 Re: Parsing a UTF-16LE file line by line? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Nestor | On 1/4/17 6:03 AM, Nestor wrote:
> Hi,
>
> I was just trying to parse a UTF-16LE file using byLine, but apparently
> this function doesn't work with anything other than UTF-8, because I get
> this error:
>
> "Invalid UTF-8 sequence (at index 1)"
>
> How can I achieve what I want, without loading the entire file into memory?
>
> Thanks in advance.
I have not tested much with UTF16 and std.stdio, but I don't believe the underlying FILE * being used by phobos has good support for it.
In my testing, for instance, byLine with a non-ascii delimeter didn't work at all.
On Windows 64-bit, MSVC simply ignores any attempts to change the width of the stream.
I wouldn't hold out much hope for this to be fixed.
-Steve
|
January 06, 2017 Re: Parsing a UTF-16LE file line by line, BUG? | ||||
---|---|---|---|---|
| ||||
Posted in reply to pineapple | > > I'm not sure if this works quite as intended, but I was at least able to produce a UTF-16 decode error rather than a UTF-8 decode error by setting the file orientation before reading it. > > import std.stdio; > import core.stdc.wchar_ : fwide; > void main(){ > auto file = File("UTF-16LE encoded file.txt"); > fwide(file.getFP(), 1); > foreach(line; file.byLine){ > writeln(file.readln); > } > } fwide is not implemented in Windows: https://msdn.microsoft.com/en-us/library/aa985619.aspx |
January 06, 2017 Re: Parsing a UTF-16LE file line by line, BUG? | ||||
---|---|---|---|---|
| ||||
Posted in reply to rumbu | On Friday, 6 January 2017 at 06:24:12 UTC, rumbu wrote:
>>
>> I'm not sure if this works quite as intended, but I was at least able to produce a UTF-16 decode error rather than a UTF-8 decode error by setting the file orientation before reading it.
>>
>> import std.stdio;
>> import core.stdc.wchar_ : fwide;
>> void main(){
>> auto file = File("UTF-16LE encoded file.txt");
>> fwide(file.getFP(), 1);
>> foreach(line; file.byLine){
>> writeln(file.readln);
>> }
>> }
>
> fwide is not implemented in Windows: https://msdn.microsoft.com/en-us/library/aa985619.aspx
That's odd. It was on Windows 7 64-bit that I put together and tested that example, and calling fwide definitely had an effect on program behavior.
|
January 06, 2017 Re: Parsing a UTF-16LE file line by line, BUG? | ||||
---|---|---|---|---|
| ||||
Posted in reply to pineapple | On 01/06/2017 11:33 AM, pineapple wrote: > On Friday, 6 January 2017 at 06:24:12 UTC, rumbu wrote: >>> >>> I'm not sure if this works quite as intended, but I was at least able >>> to produce a UTF-16 decode error rather than a UTF-8 decode error by >>> setting the file orientation before reading it. >>> >>> import std.stdio; >>> import core.stdc.wchar_ : fwide; >>> void main(){ >>> auto file = File("UTF-16LE encoded file.txt"); >>> fwide(file.getFP(), 1); >>> foreach(line; file.byLine){ >>> writeln(file.readln); >>> } >>> } >> >> fwide is not implemented in Windows: >> https://msdn.microsoft.com/en-us/library/aa985619.aspx > > That's odd. It was on Windows 7 64-bit that I put together and tested > that example, and calling fwide definitely had an effect on program > behavior. Are you compiling a 32bit binary? Because in that case you would be using the digital mars c runtime which might have an implementation for fwide. -- Mike Wey |
Copyright © 1999-2021 by the D Language Foundation