Thread overview
Prevent opening binary/other garbage files
Sep 29, 2018
helxi
Sep 29, 2018
Adam D. Ruppe
Sep 29, 2018
helxi
Sep 30, 2018
Adam D. Ruppe
Oct 01, 2018
helxi
Oct 01, 2018
Adam D. Ruppe
Sep 30, 2018
bauss
September 29, 2018
I'm writing a utility that checks for specific keyword(s) found in the files in a given directory recursively. What's the best strategy to avoid opening a bin file or some sort of garbage dump? Check encoding of the given file?

If so, what are the most popular encodings (in POSIX if that matters) and how do I detect them?
September 29, 2018
On Saturday, 29 September 2018 at 15:52:30 UTC, helxi wrote:
> I'm writing a utility that checks for specific keyword(s) found in the files in a given directory recursively. What's the best strategy to avoid opening a bin file or some sort of garbage dump? Check encoding of the given file?

Simplest might be to read the first few bytes (like couple hundred probably) and if any of them are < 32 && != '\t' && != '\r' && != '\n' && != 0, there's a good chance it is a binary file.

Text files are frequently going to have tabs and newlines, but not so frequently other low bytes.

If you do find a bunch of 0's, but not the other values, you might have a utf-16 file.

> If so, what are the most popular encodings (in POSIX if that matters) and how do I detect them?

for text on posix computers they are likely going to be utf8, and you can try using Phobos' readText function. It will throw if it encounters non-utf8, so you catch that and go on to the next one.

But the simpler check described above will also probably work and can read less of the file.
September 29, 2018
On Saturday, 29 September 2018 at 16:01:18 UTC, Adam D. Ruppe wrote:
> On Saturday, 29 September 2018 at 15:52:30 UTC, helxi wrote:
>> I'm writing a utility that checks for specific keyword(s) found in the files in a given directory recursively. What's the best strategy to avoid opening a bin file or some sort of garbage dump? Check encoding of the given file?
>
> Simplest might be to read the first few bytes (like couple hundred probably) and if any of them are < 32 && != '\t' && != '\r' && != '\n' && != 0, there's a good chance it is a binary file.
>
> Text files are frequently going to have tabs and newlines, but not so frequently other low bytes.
>
> If you do find a bunch of 0's, but not the other values, you might have a utf-16 file.
>
Thanks. Would you say https://dlang.org/library/std/encoding/get_bom.html is useful in this context?
September 30, 2018
On Saturday, 29 September 2018 at 23:46:26 UTC, helxi wrote:
> Thanks. Would you say https://dlang.org/library/std/encoding/get_bom.html is useful in this context?

Eh, not really, most text files will not have one.
September 30, 2018
On Saturday, 29 September 2018 at 15:52:30 UTC, helxi wrote:
> I'm writing a utility that checks for specific keyword(s) found in the files in a given directory recursively. What's the best strategy to avoid opening a bin file or some sort of garbage dump? Check encoding of the given file?
>
> If so, what are the most popular encodings (in POSIX if that matters) and how do I detect them?

What I would do is read the frist 512 bytes and the last 512 bytes and if over 50% of those bytes are below 32 and not 8, 9, 10, 11, 12 or 13 then chances are you have a binary file, but there is nothing that stops someone from writing "invalid" bytes into a text file. There are no limitations on what a file can hold and generally the system treats all files the same.

The reason I recommend to read the first 512 and last 512 bytes is because some binary files may contain legit text strings etc. so by picking two places chances are you won't have two segments with text.
October 01, 2018
On Sunday, 30 September 2018 at 03:19:11 UTC, Adam D. Ruppe wrote:
> On Saturday, 29 September 2018 at 23:46:26 UTC, helxi wrote:
>> Thanks. Would you say https://dlang.org/library/std/encoding/get_bom.html is useful in this context?
>
> Eh, not really, most text files will not have one.

Hi,

I tried out https://dlang.org/library/std/utf/validate.html before manually checking for encoding myself so I ended up with the code below. I was fairly surprised that "*.o" (object) files are UTF encoded! Is it normal?

import std.stdio : File, lines, stdout;

void panic(in string message, int exitCode = 1) {
	import core.stdc.stdlib : exit;
	import std.stdio : stderr, writeln;

	stderr.writeln(message);
	exit(exitCode);
}

void writeFunc(ulong occerenceNumber, ulong lineNumber, in ref string fileName,
		in ref string line, File ofile = stdout) {
	import std.stdio : writef;

	ofile.writef("%s: L:%s: F:\"%s\":\n%s\n", occerenceNumber, lineNumber, fileName, line);
}

void treverseDirectories(in string path, in string term)
in {
	import std.file : isDir;

	if (!isDir(path))
		panic("Cannot access directory: " ~ path);
}
do {
	import std.file : dirEntries, SpanMode;

	ulong occerenceNumber, filesChecked, filesIgnored; // = 0;
	File currentFile;
	foreach (string fileName; dirEntries(path, SpanMode.breadth)) {
		try {
			currentFile = File(fileName, "r");
			++filesChecked;
			foreach (ulong lineNumber, string currentLine; lines(currentFile)) {
				if (lineNumber == 0) {
					// check if the file is encoded with proper UTF
					// if Line 0 is not UTF encoded, move on to the next file

					// I hope the compiler unrolls this if condition
					import std.utf : validate;

					validate(currentLine);
                                        // throws exception if the file is not UTF encoded
				}
				import std.algorithm : canFind;

				if (canFind(currentLine, term)) {
					writeFunc(++occerenceNumber, lineNumber, fileName, currentLine);
				}
			}
		}
		catch (Exception e) {
			filesIgnored++;
		}
	}
	//summarize
	import std.stdio : writefln;

	writefln("Total match found:\t%s\nTotal files checked:\t%s\nTotal files ignored:\t%s\n",
			occerenceNumber, filesChecked, filesIgnored);
}

void main(string[] args) {
	import std.getopt : getopt;

	string term, directory;
	getopt(args, "term|t", &term, "directory|d", &directory);

	if (!directory) {
		// if directory not specified, start working with the current directory
		import std.file : getcwd;

		directory = getcwd();
	}

	if (!term)
		panic("Term not specified.");

	treverseDirectories(directory, term);
}


/*

Output:  https://pastebin.com/PZ8nCaYf
October 01, 2018
On Monday, 1 October 2018 at 15:21:24 UTC, helxi wrote:
> I tried out https://dlang.org/library/std/utf/validate.html before manually checking for encoding myself so I ended up with the code below. I was fairly surprised that "*.o" (object) files are UTF encoded! Is it normal?

Yes. Any random collection of bytes <= 127 is valid utf-8. Lines will read until it sees a byte 10, and cut off from there.

Quite a few file formats have a 10 early on to detect text/binary transmission corruption, but even if they don't, it is a fairly common byte to see before too long and that cuts off your scan for later bytes.


You really are better off looking for those <32 bytes like I described earlier - a .o file will likely have some 1's and 3's early on which that will quickly detect, but those will also pass the validate test.