| Thread overview | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
November 05, 2015 [Issue 15293] File().byLine().map!toUpper throws UnicodeException@src\rt\util\utf.d(290): invalid UTF-8 sequence on pure ASCII file | ||||
|---|---|---|---|---|
| ||||
https://issues.dlang.org/show_bug.cgi?id=15293 --- Comment #1 from HeiHon <heiko.honrath@web.de> --- Created attachment 1562 --> https://issues.dlang.org/attachment.cgi?id=1562&action=edit The D source code generator.d The D source code file generator.d -- | ||||
November 05, 2015 [Issue 15293] File().byLine().map!toUpper throws UnicodeException@src\rt\util\utf.d(290): invalid UTF-8 sequence on pure ASCII file | ||||
|---|---|---|---|---|
| ||||
https://issues.dlang.org/show_bug.cgi?id=15293 ag0aep6g@gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |ag0aep6g@gmail.com Severity|normal |regression --- Comment #2 from ag0aep6g@gmail.com --- A first reduction: ---- static import std.file; import std.stdio: File; import std.string: toUpper; void main() { std.file.write("gentest.txt", "aaaaaaaa\naaaaaaa\n00000000a"); foreach(ln; File("gentest.txt").byLine()) { toUpper(ln); } } ---- Works with 2.068.2. -- | ||||
November 06, 2015 [Issue 15293] File().byLine().map!toUpper throws UnicodeException@src\rt\util\utf.d(290): invalid UTF-8 sequence on pure ASCII file | ||||
|---|---|---|---|---|
| ||||
https://issues.dlang.org/show_bug.cgi?id=15293 --- Comment #3 from HeiHon <heiko.honrath@web.de> --- Thanks for this reduction! It also crashes with: static import std.file; import std.stdio: File, writeln; import std.string: toUpper, toLower; void main() { std.file.write("gen.txt", "12345678\n1234567\n12345678a"); foreach(ln; File("gen.txt").byLine()) { writeln(ln); writeln("> ", toLower(ln)); } foreach(ln; File("gen.txt").byLine()) { writeln(ln); writeln("> ", toUpper(ln)); } } If the last character is lowercase then toUpper crashes. If it is uppercase then toLower crashes. -- | ||||
November 06, 2015 [Issue 15293] File().byLine().map!toUpper throws UnicodeException@src\rt\util\utf.d(290): invalid UTF-8 sequence on pure ASCII file | ||||
|---|---|---|---|---|
| ||||
https://issues.dlang.org/show_bug.cgi?id=15293 --- Comment #4 from HeiHon <heiko.honrath@web.de> --- The shortest reductions I could find are with: std.file.write("gen.txt", "12\n\n1x\n"); or std.file.write("gen.txt", "1\n\n1x"); So it does not depend on the lacking \n at the end. -- | ||||
November 06, 2015 [Issue 15293] File().byLine().map!toUpper throws UnicodeException@src\rt\util\utf.d(290): invalid UTF-8 sequence on pure ASCII file | ||||
|---|---|---|---|---|
| ||||
https://issues.dlang.org/show_bug.cgi?id=15293 --- Comment #5 from ag0aep6g@gmail.com --- Reduced further: ---- static import std.file; import std.stdio: File, writeln; void main() { std.file.write("gen.txt", "a\n\naa"); auto file = File("gen.txt"); char[] buffer; char[] line; file.readln(buffer, '\n'); line = buffer; file.readln(line, '\n'); line = buffer; file.readln(line, '\n'); writeln(line.capacity); /* "0" */ writeln(line[0 .. 1].capacity); /* "255" -- nonsense */ } ---- This reminds me more and more of issue 13856. -- | ||||
November 10, 2015 [Issue 15293] File().byLine().map!toUpper throws UnicodeException@src\rt\util\utf.d(290): invalid UTF-8 sequence on pure ASCII file | ||||
|---|---|---|---|---|
| ||||
https://issues.dlang.org/show_bug.cgi?id=15293 ag0aep6g@gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Keywords| |pull Component|druntime |phobos Assignee|nobody@puremagic.com |ag0aep6g@gmail.com --- Comment #6 from ag0aep6g@gmail.com --- https://github.com/D-Programming-Language/phobos/pull/3802 -- | ||||
November 12, 2015 [Issue 15293] [REG2.069.0] std.stdio.readln(buffer) messes up buffer's capacity | ||||
|---|---|---|---|---|
| ||||
https://issues.dlang.org/show_bug.cgi?id=15293 ag0aep6g@gmail.com changed: What |Removed |Added ---------------------------------------------------------------------------- Summary|File().byLine().map!toUpper |[REG2.069.0] |throws |std.stdio.readln(buffer) |UnicodeException@src\rt\uti |messes up buffer's capacity |l\utf.d(290): invalid UTF-8 | |sequence on pure ASCII file | -- | ||||
November 12, 2015 [Issue 15293] [REG2.069.0] std.stdio.readln(buffer) messes up buffer's capacity | ||||
|---|---|---|---|---|
| ||||
https://issues.dlang.org/show_bug.cgi?id=15293 Steven Schveighoffer <schveiguy@yahoo.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |schveiguy@yahoo.com --- Comment #7 from Steven Schveighoffer <schveiguy@yahoo.com> --- (In reply to ag0aep6g from comment #5) > writeln(line.capacity); /* "0" */ > writeln(line[0 .. 1].capacity); /* "255" -- nonsense */ Not nonsense. When you are passing in empty buffer, it is allocated to 256 bytes with a capacity of 255. The second call is given a buffer which readln can see that it has full control over (it has capacity, which means all data after it is unused). Since passing a buffer into readln disowns the data (readln now owns and can resize the used portion down if necessary), the error is continuing to use 'buffer' as the buffer -- buffer contains 2 bytes, the second byte is essentially garbage after assumeSafeAppend is called since it's not part of the capacity. readln is meant to be passed in the same buffer over and over, this is why it uses assumeSafeAppend. Doing that, the final readln call sees that it has no capacity, so it treats this as an un-extendable buffer. It can't resize the buffer, nor claim that some of the buffer space is now usable (if the full line fits in less bytes). Basically, it treats this like a stack buffer. On the last read, the line is shorter, but assumeSafeAppend isn't called because it doesn't know where the buffer comes from! > This reminds me more and more of issue 13856. That issue was fixed by the addition of the ReadlnAppender. Also, I'll note that on my test of using dmd 2.069.0 or 2.069.1 on Macos X, I do not get an error with the first reduction, or HeiHon's additional issue. The changes you have in your pull are not windows specific, so I'm not sure if they are required to fix this issue. -- | ||||
November 12, 2015 [Issue 15293] [REG2.069.0] std.stdio.readln(buffer) messes up buffer's capacity | ||||
|---|---|---|---|---|
| ||||
https://issues.dlang.org/show_bug.cgi?id=15293 --- Comment #8 from ag0aep6g@gmail.com --- (In reply to Steven Schveighoffer from comment #7) > Since passing a buffer into readln disowns the data (readln now owns and can resize the used portion down if necessary), the error is continuing to use 'buffer' as the buffer -- buffer contains 2 bytes, the second byte is essentially garbage after assumeSafeAppend is called since it's not part of the capacity. readln is meant to be passed in the same buffer over and over, this is why it uses assumeSafeAppend. Sorry, but that's ridiculous. It's unnecessarily unsafe, overly complicated, and not documented. The other readlnImpl variants don't work like that, and it apparently wasn't thought of when byLine was implemented. If readln wants to take the same buffer over and over, then it should not shrink it, but return a slice of it. But breaking the API is probably not an option now. So we're stuck with either 1) fundamentally flawed, surprising behavior, or 2) more allocations when readln is used naively. In my opinion, 2 is the way to go. Also, byLine is nicer anyway, and doesn't have the problem. And if std.io ever gets done, we have an opportunity not to fall in the same trap. -- | ||||
November 12, 2015 [Issue 15293] [REG2.069.0] std.stdio.readln(buffer) messes up buffer's capacity | ||||
|---|---|---|---|---|
| ||||
https://issues.dlang.org/show_bug.cgi?id=15293 --- Comment #9 from Steven Schveighoffer <schveiguy@yahoo.com> --- The behavior is expected, and actually intentional. The original behavior was that regardless of current capacity, if a GC buffer was passed, ALL the space was claimed by readln. In this case, what is passed is not a buffer that ends on an appendable boundary, and so it chooses to do nothing as far as updating the appendability (the right choice). The reason it was done this way is to allow for a common usage of readln (namely, reading into a buffer over and over again) to continue to have good performance. If we took the "safe" route (and believe me, it was debated and considered), then this code would reallocate every time a larger line was read. When it was originally written, this is what byLine did, but now it's changed. > The other readlnImpl variants don't work like that looking at it now, the other variants do exactly as I said above -- they reallocate when the line is longer. So technically we could move to this, we just have to accept a drop in performance for the code. Have you tested the performance of your new code vs. the current implementation? -- | ||||
Copyright © 1999-2021 by the D Language Foundation
Permalink
Reply