Reading text (I mean "real" text...)

Jun 20, 2020

Denis

Jun 20, 2020

Paul Backus

Jun 20, 2020

Denis

Jun 20, 2020

Denis

June 20, 2020

Reading text (I mean "real" text...)

Posted by Denis

Permalink

Denis

Permalink

THE PROBLEM

UTF-8 validation alone is insufficient for ensuring that a file contains only human-readable text, because control characters are UTF-8 valid. Apart from tab, newline, carriage return, and a few less commonly used others considered to be whitespace, human-readable text files should not normally contain embedded control characters.

In the standard library, the read functions for "text" files (as opposed to binary files) that I looked at are not actually based on "human readable text", but on "characters". For example:

 - In std.stdio, readln accepts everything. Lines are simply separated by the occurrence of a newline or user-designated character.
 - In std.file, readText accepts all valid UTF-8 characters.

This means, for example, that all of these functions will happily try to read an enormous file of zeroes in its entirety (something that should not even be considered "text") into a string variable, on the very first call to the read function. Not good... Whereas a function that reads only "human-readable text" should instead generate an exception immediately upon encountering an invalid control character or invalid UTF-8 character.

THE OBJECTIVE

The objective is to read a file one line at a time (reading each line into a string), while checking for human-readable text character by character. Invalid characters (control and UTF-8) should generate an exception.

Unless there's already an existing function that works as described, I'd like to write one. I expect that this will require combining an existing read-by-UTF8-char or read-by-byte function with the additional validation.

Q1: Which existing functions (D or C) would you suggest leveraging? For example, there are quite a few variants of "read" and in different libraries too. For a newcomer, it can be difficult to intuit which one is best suited for what.

Q2: Any source code (D or C) you might suggest I look at, to get ideas for how parts of this could be written?

Thanks for your help.

On Saturday, 20 June 2020 at 01:35:56 UTC, Denis wrote: > > THE OBJECTIVE > > The objective is to read a file one line at a time (reading each line into a string), while checking for human-readable text character by character. Invalid characters (control and UTF-8) should generate an exception. > > Unless there's already an existing function that works as described, I'd like to write one. I expect that this will require combining an existing read-by-UTF8-char or read-by-byte function with the additional validation. It sounds like maybe what you are looking for is Unicode character categories: https://en.wikipedia.org/wiki/Unicode_character_property#General_Category

On Saturday, 20 June 2020 at 01:41:50 UTC, Paul Backus wrote: > It sounds like maybe what you are looking for is Unicode character categories: > > https://en.wikipedia.org/wiki/Unicode_character_property#General_Category The character validation step could indeed be expressed using Unicode properties: Allow Unicode White_Space Reject Unicode Control Allow everything else

Digging into this a bit further -- POSIX defines a "print" class, which I believe is an exact fit. The Unicode spec doesn't define this class, which I presume is why D's std.uni library also omits it. But there is an isprint() function in libc, which I should be able to use (POSIX here). This function refers to the system locale, so it isn't limited to ASCII characters (unlike std.ascii:isPrintable). So that's one down, two to go: Loop until newline or EOF (1) Read bytes or character } Possibly (2) Decode UTF-8, exception if invalid } together (3) Call isprint(), exception if invalid Return line (This simplified outline obviously doesn't show how to deal with the complications arising from using buffers, handling codepoints that straddle the end of the buffer, etc.) Where I'm still stuck is the read or read-and-auto-decode: this is where the waters get really muddy for me. Three different techniques for reading characters are suggested in this thread (iopipe, ranges, rawRead): https://forum.dlang.org/thread/cgteipqqfxejngtpgbbt@forum.dlang.org I'd like to stick with standard D or C libraries initially, so that rules out iopipe for now. What would really help is some details about what one read technique does particularly well vs. another. And is there a technique that seems more suited to this use case than the rest? Thanks again

Forums