Jump to page: 1 2
Thread overview
[Issue 17229] File.byChunk (ubyte) w/ stdout.lockingTextWriter corrupts utf-8 data (and is very slow)
[Issue 17229] File.byChunk w/ stdout.lockingTextWriter is very slow
Feb 27, 2017
Jon Degenhardt
Mar 01, 2017
Jon Degenhardt
Mar 01, 2017
Jon Degenhardt
Mar 01, 2017
anonymous4
February 27, 2017
https://issues.dlang.org/show_bug.cgi?id=17229

--- Comment #1 from Jon Degenhardt <jrdemail2000-dlang@yahoo.com> ---
(In reply to Jon Degenhardt from comment #0)

> Results for the count 9's program, against the 2.7, 14 million line file:
>    byLine:  8.98 seconds
>    byChunk: 1.64 seconds

Update: The byLine Count 9s program given above is being affected by autodecode. Changing counting line as follows:

  from: chunkedStream.each!(x => count += x.count('9'));
  to:   chunkedStream.each!(x => count += (cast(ubyte[])x).count('9'));

Gives updated times:
  byLine:  3.13
  byChunk: 1.64

This is more consistent. byChunk is faster than byLine in the Count 9s task (read and access data), by not by 5x. The 15x performance deficit of byChunk relative to byLine on the file copy task remains.

--
February 28, 2017
https://issues.dlang.org/show_bug.cgi?id=17229

Steven Schveighoffer <schveiguy@yahoo.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |schveiguy@yahoo.com

--- Comment #2 from Steven Schveighoffer <schveiguy@yahoo.com> ---
The issue is that LockingTextWriter accepts anything but character range. What is happening (I think) is that lockingTextWriter.put(someUbyteArray) is actually putting each ubyte into the stream one at a time after converting each ubyte into a dchar through integer promotion.

This can't be the intention. I think instead of the constraint

is(ElementType!A : const(dchar))

what we need is something more like:

is(Unqual!(ElementType!A) == dchar)

This might break code, but I think code that is already broken, such as the example, no?

--
March 01, 2017
https://issues.dlang.org/show_bug.cgi?id=17229

--- Comment #3 from Jon Degenhardt <jrdemail2000-dlang@yahoo.com> ---
I've confirmed that File.byChunk with lockingTextWriter corrupts utf-8 encoded files.

I used the unicode test file:
    http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt

and the example given with the File.byChunk documentation:

    // Efficient file copy, 1MB at a time.
    import std.algorithm, std.stdio;
    void main()
    {
        stdin.byChunk(1024 * 1024).copy(stdout.lockingTextWriter());
    }

This file copy program corrupts the unicode characters as described in Steven's comment. This is a quite problematic, both because of character corruption and because it is an example in the documentation.

The new method, lockingBinaryWriter, copies the file correctly. It is available starting with 2.073.1. lockingBinaryWriter also copies the file quickly, eliminating the performance issue.

It is appears from the PR for lockingBinaryWriter (https://github.com/dlang/phobos/pull/2011) that there was discussion of the roles of Binary and Text writer.

Regardless of availability of the lockingBinaryWriter, the lockingTextWriter certainly looks broken when used with the ubyte data type. Personally, I think it makes sense for lockingTextWriter to assume ubyte arrays are correctly encoded, or perhaps are utf-8 encoded. This would potentially allow newline translation, something that the lockingBinaryWriter would presumably not do.

--
March 01, 2017
https://issues.dlang.org/show_bug.cgi?id=17229

Jon Degenhardt <jrdemail2000-dlang@yahoo.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|File.byChunk w/             |File.byChunk (ubyte) w/
                   |stdout.lockingTextWriter is |stdout.lockingTextWriter
                   |very slow                   |corrupts utf-8 data (and is
                   |                            |very slow)

--- Comment #4 from Jon Degenhardt <jrdemail2000-dlang@yahoo.com> ---
Changing the subject to reflect the more serious problem, corruption of utf-8 encoded data. As described in one of the comments, this corruption occurs in a file copy example in the documentation.

--
March 01, 2017
https://issues.dlang.org/show_bug.cgi?id=17229

--- Comment #5 from anonymous4 <dfj1esp02@sneakemail.com> ---
static assert(is(ubyte:dchar));
This assert succeeds. Is it intended? It's why LockingTextWriter accepts bytes
event though it's a text writer.

--
March 01, 2017
https://issues.dlang.org/show_bug.cgi?id=17229

--- Comment #6 from Steven Schveighoffer <schveiguy@yahoo.com> ---
(In reply to anonymous4 from comment #5)
> static assert(is(ubyte:dchar));
> This assert succeeds. Is it intended? It's why LockingTextWriter accepts
> bytes event though it's a text writer.

Yes, in the compiler, ubyte and dchar are fundamentally unsigned integer types. You can implicitly convert via integer promotion.

Technically, char can integer promote to dchar as well. However, foreach(dchar d; someCharArray) is specialized in the compiler to go through auto decoding.

The forums contain volumes of battles on auto-decoding and how the choice has affected D, I don't think we need to rehash it here. What we need to do is make it so obviously wrong code is not accepted for this function. I'll try and get a PR together.

--
March 02, 2017
https://issues.dlang.org/show_bug.cgi?id=17229

Steven Schveighoffer <schveiguy@yahoo.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |pull
           Hardware|x86                         |All
                 OS|Mac OS X                    |All
           Severity|enhancement                 |major

--- Comment #7 from Steven Schveighoffer <schveiguy@yahoo.com> ---
PR: https://github.com/dlang/phobos/pull/5229

--
March 17, 2017
https://issues.dlang.org/show_bug.cgi?id=17229

--- Comment #8 from github-bugzilla@puremagic.com ---
Commits pushed to master at https://github.com/dlang/phobos

https://github.com/dlang/phobos/commit/9af2d0267f054503dc5f73410b1ab0d0e418f006
Fix issue 17229 - do not use integer promotion to translate ubyte ranges
to dchar ranges.

https://github.com/dlang/phobos/commit/ca20f0b667fa7f9602eee7ed18c23f56b2f06f34 Add unittest for Bug 17229

https://github.com/dlang/phobos/commit/3568cb38e709d03dcc894358a977c042248aab2d Merge pull request #5229 from schveiguy/fixtextwriter

Fix issue 17229 - File.byChunk (ubyte) w/ stdout.lockingTextWriter corrupts
utf-8 data (and is very slow)
merged-on-behalf-of: Jack Stouffer <jack@jackstouffer.com>

--
March 22, 2017
https://issues.dlang.org/show_bug.cgi?id=17229

--- Comment #9 from github-bugzilla@puremagic.com ---
Commits pushed to stable at https://github.com/dlang/phobos

https://github.com/dlang/phobos/commit/9af2d0267f054503dc5f73410b1ab0d0e418f006 Fix issue 17229 - do not use integer promotion to translate ubyte ranges

https://github.com/dlang/phobos/commit/ca20f0b667fa7f9602eee7ed18c23f56b2f06f34 Add unittest for Bug 17229

https://github.com/dlang/phobos/commit/3568cb38e709d03dcc894358a977c042248aab2d Merge pull request #5229 from schveiguy/fixtextwriter

--
August 07, 2017
https://issues.dlang.org/show_bug.cgi?id=17229

--- Comment #10 from github-bugzilla@puremagic.com ---
Commits pushed to newCTFE at https://github.com/dlang/phobos

https://github.com/dlang/phobos/commit/9af2d0267f054503dc5f73410b1ab0d0e418f006 Fix issue 17229 - do not use integer promotion to translate ubyte ranges

https://github.com/dlang/phobos/commit/ca20f0b667fa7f9602eee7ed18c23f56b2f06f34 Add unittest for Bug 17229

https://github.com/dlang/phobos/commit/3568cb38e709d03dcc894358a977c042248aab2d Merge pull request #5229 from schveiguy/fixtextwriter

--
« First   ‹ Prev
1 2