Parsing a UTF-16LE file line by line, BUG? (page 2)

On Friday, 6 January 2017 at 11:42:17 UTC, Mike Wey wrote: > On 01/06/2017 11:33 AM, pineapple wrote: >> On Friday, 6 January 2017 at 06:24:12 UTC, rumbu wrote: >>>> >>>> I'm not sure if this works quite as intended, but I was at least able >>>> to produce a UTF-16 decode error rather than a UTF-8 decode error by >>>> setting the file orientation before reading it. >>>> >>>> import std.stdio; >>>> import core.stdc.wchar_ : fwide; >>>> void main(){ >>>> auto file = File("UTF-16LE encoded file.txt"); >>>> fwide(file.getFP(), 1); >>>> foreach(line; file.byLine){ >>>> writeln(file.readln); >>>> } >>>> } >>> >>> fwide is not implemented in Windows: >>> https://msdn.microsoft.com/en-us/library/aa985619.aspx >> >> That's odd. It was on Windows 7 64-bit that I put together and tested >> that example, and calling fwide definitely had an effect on program >> behavior. > > Are you compiling a 32bit binary? Because in that case you would be using the digital mars c runtime which might have an implementation for fwide. After some testing I realized that byLine was not the one failing, but any string manipulation done to the obtained line. Compile the following example with and without -debug and run to see what I mean: import std.stdio, std.string; enum EXIT_SUCCESS = 0, EXIT_FAILURE = 1; int main() { version(Windows) { import core.sys.windows.wincon; SetConsoleOutputCP(65001); } auto f = File("utf16le.txt", "r"); foreach (line; f.byLine()) try { string s; debug s = cast(string)strip(line); // this is the one causing problems if (1 > s.length) continue; writeln(s); } catch(Exception e) { writefln("Error. %s\nFile \"%s\", line %s.", e.msg, e.file, e.line); return EXIT_FAILURE; } return EXIT_SUCCESS; }

On Sunday, 15 January 2017 at 14:48:12 UTC, Nestor wrote: > After some testing I realized that byLine was not the one failing, but any string manipulation done to the obtained line. Compile the following example with and without -debug and run to see what I mean: > > import std.stdio, std.string; > > enum > EXIT_SUCCESS = 0, > EXIT_FAILURE = 1; > > int main() { > version(Windows) { > import core.sys.windows.wincon; > SetConsoleOutputCP(65001); > } > auto f = File("utf16le.txt", "r"); > foreach (line; f.byLine()) try { > string s; > debug s = cast(string)strip(line); // this is the one causing problems > if (1 > s.length) continue; > writeln(s); > } catch(Exception e) { > writefln("Error. %s\nFile \"%s\", line %s.", e.msg, e.file, e.line); > return EXIT_FAILURE; > } > return EXIT_SUCCESS; > } By the way, when caught, the exception says it's in file src/phobos/std/utf.d line 1217, but that file only has 784 lines. That's quite odd. (I am compiling with dmd 2.072.2)

January 15, 2017

Re: Parsing a UTF-16LE file line by line, BUG?

Posted by Daniel Kozák
in reply to Nestor

Permalink

Daniel Kozák

Posted in reply to Nestor

Permalink

V Sun, 15 Jan 2017 14:48:12 +0000
Nestor via Digitalmars-d-learn <digitalmars-d-learn@puremagic.com> napsáno:

> On Friday, 6 January 2017 at 11:42:17 UTC, Mike Wey wrote:
> > On 01/06/2017 11:33 AM, pineapple wrote:
> >> On Friday, 6 January 2017 at 06:24:12 UTC, rumbu wrote:
> >>>>
> >>>> I'm not sure if this works quite as intended, but I was at
> >>>> least able
> >>>> to produce a UTF-16 decode error rather than a UTF-8 decode
> >>>> error by
> >>>> setting the file orientation before reading it.
> >>>>
> >>>>     import std.stdio;
> >>>>     import core.stdc.wchar_ : fwide;
> >>>>     void main(){
> >>>>         auto file = File("UTF-16LE encoded file.txt");
> >>>>         fwide(file.getFP(), 1);
> >>>>         foreach(line; file.byLine){
> >>>>             writeln(file.readln);
> >>>>         }
> >>>>     }
> >>>
> >>> fwide is not implemented in Windows: https://msdn.microsoft.com/en-us/library/aa985619.aspx
> >>
> >> That's odd. It was on Windows 7 64-bit that I put together and
> >> tested
> >> that example, and calling fwide definitely had an effect on
> >> program
> >> behavior.
> >
> > Are you compiling a 32bit binary? Because in that case you would be using the digital mars c runtime which might have an implementation for fwide.
> 
> After some testing I realized that byLine was not the one failing, but any string manipulation done to the obtained line. Compile the following example with and without -debug and run to see what I mean:
> 
> import std.stdio, std.string;
> 
> enum
>    EXIT_SUCCESS = 0,
>    EXIT_FAILURE = 1;
> 
> int main() {
>    version(Windows) {
>      import core.sys.windows.wincon;
>      SetConsoleOutputCP(65001);
>    }
>    auto f = File("utf16le.txt", "r");
>    foreach (line; f.byLine()) try {
>      string s;
>      debug s = cast(string)strip(line); // this is the one causing
> problems
>      if (1 > s.length) continue;
>      writeln(s);
>    } catch(Exception e) {
>      writefln("Error. %s\nFile \"%s\", line %s.", e.msg, e.file,
> e.line);
>      return EXIT_FAILURE;
>    }
>    return EXIT_SUCCESS;
> }

This is because byLine does return range, so until you do something with that it does not cause any harm :)

On Sunday, 15 January 2017 at 16:29:23 UTC, Daniel Kozák wrote: > This is because byLine does return range, so until you do something with that it does not cause any harm :) I see. So correcting my original doubt: How could I parse an UTF16LE file line by line (producing a proper string in each iteration) without loading the entire file into memory?

On Sunday, 15 January 2017 at 19:48:04 UTC, Nestor wrote: > I see. So correcting my original doubt: > > How could I parse an UTF16LE file line by line (producing a proper string in each iteration) without loading the entire file into memory? Could... roll your own? Although if you wanted it to be UTF-8 output instead would require a second pass or better yet changing how the i iterated. char[] getLine16LE(File inp = stdin) { static char[1024*4] buffer; //4k reusable buffer, NOT thread safe int i; while(inp.rawRead(buffer[i .. i+2]) != null) { if (buffer[i] == '\n') break; i+=2; } return buffer[0 .. i]; }

On Monday, 16 January 2017 at 14:47:23 UTC, Era Scarecrow wrote: > On Sunday, 15 January 2017 at 19:48:04 UTC, Nestor wrote: >> I see. So correcting my original doubt: >> >> How could I parse an UTF16LE file line by line (producing a proper string in each iteration) without loading the entire file into memory? > > Could... roll your own? Although if you wanted it to be UTF-8 output instead would require a second pass or better yet changing how the i iterated. > > char[] getLine16LE(File inp = stdin) { > static char[1024*4] buffer; //4k reusable buffer, NOT thread safe > int i; > while(inp.rawRead(buffer[i .. i+2]) != null) { > if (buffer[i] == '\n') > break; > > i+=2; > } > > return buffer[0 .. i]; > } Thanks, but unfortunately this function does not produce proper UTF8 strings, as a matter of fact the output even starts with the BOM. Also it doen't handle CRLF, and even for LF terminated lines it doesn't seem to work for lines other than the first. I guess I have to code encoding detection, buffered read, and transcoding by hand, the only problem is that the result could be sub-optimal, which is why I was looking for a built-in solution.

On Tuesday, 17 January 2017 at 11:40:15 UTC, Nestor wrote: > Thanks, but unfortunately this function does not produce proper UTF8 strings, as a matter of fact the output even starts with the BOM. Also it doesn't handle CRLF, and even for LF terminated lines it doesn't seem to work for lines other than the first. I thought you wanted to get line by line of contents, which would then remain as UTF-16. Translating between the two types shouldn't be hard, probably to!string or a foreach with appending to code-units on chars would convert to UTF-8. Skipping the BOM is just a matter of skipping the first two bytes identifying it... > I guess I have to code encoding detection, buffered read, and transcoding by hand, the only problem is that the result could be sub-optimal, which is why I was looking for a built-in solution. Maybe. Honestly I'm not nearly as familiar with the library or functions as I would love to be, so often home-made solutions seem more prevalent until I learn the lingo. A disadvantage of being self taught.

On Monday, 16 January 2017 at 14:47:23 UTC, Era Scarecrow wrote: > static char[1024*4] buffer; //4k reusable buffer, NOT thread safe Maybe I'm wrong, but I think it's thread safe. Because static mutable non-shared variables are stored in TLS.

On Friday, 27 January 2017 at 07:02:52 UTC, Jack Applegame wrote: > On Monday, 16 January 2017 at 14:47:23 UTC, Era Scarecrow wrote: >> static char[1024*4] buffer; //4k reusable buffer, NOT thread safe > > Maybe I'm wrong, but I think it's thread safe. Because static mutable non-shared variables are stored in TLS. Perhaps, but fibers or other instances of sharing the buffer wouldn't be safe/reliable, at least not for long.

On Friday, 27 January 2017 at 04:26:31 UTC, Era Scarecrow wrote: > Skipping the BOM is just a matter of skipping the first two bytes identifying it... AFAIK in some cases the BOM takes up to 4 bytes (FOR UTF-32), so when input encoding is unknown one must perform some kind of detection in order to apply the correct transcoding later. I thought by now dmd had this functionality built-in and exposed, since the compiler itself seems to do it for source code units.

Forums