Thread overview | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
November 03, 2014 Reading unicode string with readf ("%s") | ||||
---|---|---|---|---|
| ||||
Hi! The following code does not correctly handle Unicode strings. ----- import std.stdio; void main () { string s; readf ("%s", &s); write (s); } ----- Example input ("Test." in cyrillic): ----- Тест. ----- (hex: D0 A2 D0 B5 D1 81 D1 82 2E 0D 0A) Example output: ----- ТеÑÑ. ----- (hex: C3 90 C2 A2 C3 90 C2 B5 C3 91 C2 81 C3 91 C2 82 2E 0D 0A) Here, the input bytes are handled separately: D0 -> C3 90, A2 -> C2 A2, etc. On the bright side, reading the file with readln works properly. Is this an expected shortcoming of "%s"-reading a string? Could it be made to work somehow? Is it worth a bug report? Ivan Kazmenko. |
November 03, 2014 Re: Reading unicode string with readf ("%s") | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ivan Kazmenko | On Monday, 3 November 2014 at 19:37:20 UTC, Ivan Kazmenko wrote:
> readf ("%s", &s);
Worth noting: this reads to end-of-file (not end-of-line or whitespace), and reading the whole file into a string was what I indeed expected it to do.
So, if there is an idiomatic way to read the whole file into a string which is Unicode-compatible, it would be great to learn that, too.
|
November 03, 2014 Re: Reading unicode string with readf ("%s") | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ivan Kazmenko | On 11/03/2014 11:47 AM, Ivan Kazmenko wrote:
> On Monday, 3 November 2014 at 19:37:20 UTC, Ivan Kazmenko wrote:
>> readf ("%s", &s);
>
> Worth noting: this reads to end-of-file (not end-of-line or whitespace),
> and reading the whole file into a string was what I indeed expected it
> to do.
>
> So, if there is an idiomatic way to read the whole file into a string
> which is Unicode-compatible, it would be great to learn that, too.
I don't know the answer to the Unicode issue with readf but you can read the file by chunks:
import std.stdio;
import std.array;
import std.exception;
string readAll(File file)
{
char[666] buffer;
char[] contents;
char[] piece;
do {
piece = file.rawRead(buffer);
contents ~= piece;
} while (!piece.empty);
return assumeUnique(contents);
}
void main () {
string s = stdin.readAll();
write (s);
}
Ali
|
November 03, 2014 Re: Reading unicode string with readf ("%s") | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ivan Kazmenko | On Monday, 3 November 2014 at 19:47:17 UTC, Ivan Kazmenko wrote:
> So, if there is an idiomatic way to read the whole file into a string which is Unicode-compatible, it would be great to learn that, too.
Maybe something like this:
import std.stdio;
import std.array;
import std.conv;
string text = stdin
.byLine(KeepTerminator.yes)
.join()
.to!(string);
|
November 04, 2014 Re: Reading unicode string with readf ("%s") | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ivan Kazmenko | https://issues.dlang.org/show_bug.cgi?id=12990 this? |
November 04, 2014 Re: Reading unicode string with readf ("%s") | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ivan Kazmenko | On Monday, 3 November 2014 at 19:37:20 UTC, Ivan Kazmenko wrote: > Hi! > > The following code does not correctly handle Unicode strings. > ----- > import std.stdio; > void main () { > string s; > readf ("%s", &s); > write (s); > } > ----- > > Example input ("Test." in cyrillic): > ----- > Тест. > ----- > (hex: D0 A2 D0 B5 D1 81 D1 82 2E 0D 0A) > > Example output: > ----- > ТеÑÑ. > ----- > (hex: C3 90 C2 A2 C3 90 C2 B5 C3 91 C2 81 C3 91 C2 82 2E 0D 0A) > > Here, the input bytes are handled separately: D0 -> C3 90, A2 -> C2 A2, etc. > > On the bright side, reading the file with readln works properly. > > Is this an expected shortcoming of "%s"-reading a string? No. > Could it be made to work somehow? Yes. std.stdio.LockingTextReader is to blame: void main() { import std.stdio; auto ltr = LockingTextReader(std.stdio.stdin); write(ltr); } ---- $ echo Тест | rdmd test.d ТеÑÑ LockingTextReader has a dchar front. But it doesn't do any decoding. The dchar front is really a char front. > Is it worth a bug report? Yes. > Ivan Kazmenko. |
November 04, 2014 Re: Reading unicode string with readf ("%s") | ||||
---|---|---|---|---|
| ||||
Posted in reply to Ali Çehreli | On Monday, 3 November 2014 at 20:03:03 UTC, Ali Çehreli wrote:
> On 11/03/2014 11:47 AM, Ivan Kazmenko wrote:
>> On Monday, 3 November 2014 at 19:37:20 UTC, Ivan Kazmenko wrote:
>>> readf ("%s", &s);
>>
>> Worth noting: this reads to end-of-file (not end-of-line or whitespace),
>> and reading the whole file into a string was what I indeed expected it
>> to do.
>>
>> So, if there is an idiomatic way to read the whole file into a string
>> which is Unicode-compatible, it would be great to learn that, too.
>
> I don't know the answer to the Unicode issue with readf but you can read the file by chunks:
>
> import std.stdio;
> import std.array;
> import std.exception;
>
> string readAll(File file)
> {
> char[666] buffer;
> char[] contents;
> char[] piece;
>
> do {
> piece = file.rawRead(buffer);
> contents ~= piece;
>
> } while (!piece.empty);
>
> return assumeUnique(contents);
> }
>
> void main () {
> string s = stdin.readAll();
>
> write (s);
> }
>
> Ali
Thank you for suggesting an alternative!
Looks like it would be an efficient one, too.
I believe it can be made a bit more efficient if using an appender, right?
Still, that's a lot of code for a minute scripting task, albeit one has to write the readAll function only once.
|
November 04, 2014 Re: Reading unicode string with readf ("%s") | ||||
---|---|---|---|---|
| ||||
Posted in reply to Gary Willoughby | On Monday, 3 November 2014 at 20:10:02 UTC, Gary Willoughby wrote:
> On Monday, 3 November 2014 at 19:47:17 UTC, Ivan Kazmenko wrote:
>> So, if there is an idiomatic way to read the whole file into a string which is Unicode-compatible, it would be great to learn that, too.
>
> Maybe something like this:
>
> import std.stdio;
> import std.array;
> import std.conv;
>
> string text = stdin
> .byLine(KeepTerminator.yes)
> .join()
> .to!(string);
And thanks for a short alternative!
At first glance, looks like it sacrifices a bit of efficiency on the way: the "remove-line-breaks, then add-line-breaks" path looks redundant.
Still, it does not store intermediate splitted representation, so the inefficiency is in fact not catastrophic, right?
|
November 04, 2014 Re: Reading unicode string with readf ("%s") | ||||
---|---|---|---|---|
| ||||
Posted in reply to Kagamin | On Tuesday, 4 November 2014 at 11:46:24 UTC, Kagamin wrote:
> https://issues.dlang.org/show_bug.cgi?id=12990 this?
Similar, but not quite that. Bugs 12990 and 1448 (linked from there) seem to have Windows console as an important part of the process. For me, the example does not work even with files, either redirected via "test.exe <one.txt >two.txt" or using File structs inside D program.
Still, thank you for the link!
|
November 04, 2014 Re: Reading unicode string with readf ("%s") | ||||
---|---|---|---|---|
| ||||
Posted in reply to anonymous | On Tuesday, 4 November 2014 at 13:01:48 UTC, anonymous wrote: > On Monday, 3 November 2014 at 19:37:20 UTC, Ivan Kazmenko wrote: >> Hi! >> >> The following code does not correctly handle Unicode strings. >> ----- >> import std.stdio; >> void main () { >> string s; >> readf ("%s", &s); >> write (s); >> } >> ----- >> >> Example input ("Test." in cyrillic): >> ----- >> Тест. >> ----- >> (hex: D0 A2 D0 B5 D1 81 D1 82 2E 0D 0A) >> >> Example output: >> ----- >> ТеÑÑ. >> ----- >> (hex: C3 90 C2 A2 C3 90 C2 B5 C3 91 C2 81 C3 91 C2 82 2E 0D 0A) >> >> Here, the input bytes are handled separately: D0 -> C3 90, A2 -> C2 A2, etc. >> >> On the bright side, reading the file with readln works properly. >> >> Is this an expected shortcoming of "%s"-reading a string? > > No. > >> Could it be made to work somehow? > > Yes. std.stdio.LockingTextReader is to blame: > > void main() > { > import std.stdio; > auto ltr = LockingTextReader(std.stdio.stdin); > write(ltr); > } > ---- > $ echo Тест | rdmd test.d > ТеÑÑ > > LockingTextReader has a dchar front. But it doesn't do any decoding. The dchar front is really a char front. > >> Is it worth a bug report? > > Yes. > >> Ivan Kazmenko. You nailed it! Reported the bug in original form: https://issues.dlang.org/show_bug.cgi?id=13686 Perhaps your reduction would be useful. |
Copyright © 1999-2021 by the D Language Foundation