Jump to page: 1 2 3
Thread overview
Splitting up large dirty file
May 15
Dennis
May 16
Dennis
May 16
drug
May 16
Dennis
May 17
Dennis
May 21
Dennis
May 21
Dennis
May 16
Dennis
May 17
Dennis
May 18
Kagamin
May 17
ag0aep6g
May 15
I have a file with two problems:
- It's too big to fit in memory (apparently, I thought 1.5 Gb would fit but I get an out of memory error when using std.file.read)
- It is dirty (contains invalid Unicode characters, null bytes in the middle of lines)

I want to write a program that splits it up into multiple files, with the splits happening every n lines. I keep encountering roadblocks though:

- You can't give Yes.useReplacementChar to `byLine` and `byLine` (or `readln`) throws an Exception upon encountering an invalid character.
- decodeFront doesn't work on inputRanges like `byChunk(4096).joiner`
- std.algorithm.splitter doesn't work on inputRanges either
- When you convert chunks to arrays, you have the risk of a split being in the middle of a character with multiple code units

Is there a simple way to do this?

May 15
On 5/15/18 4:36 PM, Dennis wrote:
> I have a file with two problems:
> - It's too big to fit in memory (apparently, I thought 1.5 Gb would fit but I get an out of memory error when using std.file.read)
> - It is dirty (contains invalid Unicode characters, null bytes in the middle of lines)
> 
> I want to write a program that splits it up into multiple files, with the splits happening every n lines. I keep encountering roadblocks though:
> 
> - You can't give Yes.useReplacementChar to `byLine` and `byLine` (or `readln`) throws an Exception upon encountering an invalid character.
> - decodeFront doesn't work on inputRanges like `byChunk(4096).joiner`
> - std.algorithm.splitter doesn't work on inputRanges either
> - When you convert chunks to arrays, you have the risk of a split being in the middle of a character with multiple code units
> 
> Is there a simple way to do this?
> 

Using iopipe, you can split on N lines (iopipe doesn't autodecode when searching for newlines), or split on a pre-determined chunk size (and ensure you don't split a code point).

Splitting on N lines:

import iopipe.bufpipe;
import iopipe.textpipe;

auto infile = openDev("filename").bufd.assumeText.byLine;

foreach(i; 0 .. N) infile.extend(0); // ensure N lines in the buffer

Splitting on pre-determined chunk size

auto infile = openDev("filename")
    .bufd!(ubyte, chunkSize) // use chunkSize as minimum read size
    .assumeText // it's text, not ubyte
    .ensureDecodeable; // do not end in the middle of a codepoint

The output isn't as straightforward. Ideally you would want to simply create an output pipe that split into multiple files, and process the whole thing at once. I haven't created such a thing yet though (will add an enhancement request to do so).

Easiest thing to do is to write the entire window of the input pipe into an output pipe, or cast it back to ubyte[] and write directly to an output device.

e.g.:

auto infile = ... // one of the above ideas
   .encodeText; // convert to ubyte

auto outfile = openDev("outputFilename1", "w");
outfile.write(infile.window);
outfile.close;
infile.release(infile.window.length); // flush the input buffer
... // refill the buffer using the chosen technique above.

-Steve
May 15
On Tuesday, May 15, 2018 20:36:21 Dennis via Digitalmars-d-learn wrote:
> I have a file with two problems:
> - It's too big to fit in memory (apparently, I thought 1.5 Gb
> would fit but I get an out of memory error when using
> std.file.read)
> - It is dirty (contains invalid Unicode characters, null bytes in
> the middle of lines)
>
> I want to write a program that splits it up into multiple files, with the splits happening every n lines. I keep encountering roadblocks though:
>
> - You can't give Yes.useReplacementChar to `byLine` and `byLine`
> (or `readln`) throws an Exception upon encountering an invalid
> character.
> - decodeFront doesn't work on inputRanges like
> `byChunk(4096).joiner`
> - std.algorithm.splitter doesn't work on inputRanges either
> - When you convert chunks to arrays, you have the risk of a split
> being in the middle of a character with multiple code units
>
> Is there a simple way to do this?

If you're on a *nix systime, and you're simply looking for a solution to split files and don't necessarily care about writing one, I'd suggest trying the split utility:

https://linux.die.net/man/1/split

If I had to write it in D, I'd probably just use std.mmap and operate on the files as a dynamic array of ubytes, since if what you care about is '\n', that can easily be searched for without needing any decoding, and using mmap avoids having to chunk anything.

- Jonathan M Davis

May 16
On Tuesday, 15 May 2018 at 20:36:21 UTC, Dennis wrote:
> I have a file with two problems:
> - It's too big to fit in memory (apparently, I thought 1.5 Gb would fit but I get an out of memory error when using std.file.read)
> - It is dirty (contains invalid Unicode characters, null bytes in the middle of lines)
>
> I want to write a program that splits it up into multiple files, with the splits happening every n lines. I keep encountering roadblocks though:
>
> - You can't give Yes.useReplacementChar to `byLine` and `byLine` (or `readln`) throws an Exception upon encountering an invalid character.

Can you show the program you are using that throws when using byLine? I tried a very simple program that reads and outputs line-by-line, then fed it a file that contained invalid utf-8. I did not see an exception. The invalid utf-8 was created by taking part of this file: http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt (a commonly used file with utf-8 edge cases), plus adding a number of random hex characters, including null. I don't see exceptions thrown.

The program I used:

int main(string[] args)
{
    import std.stdio;
    import std.conv : to;
    try
    {
        auto inputStream = (args.length < 2 || args[1] == "-") ? stdin : args[1].File;
        foreach (line; inputStream.byLine(KeepTerminator.yes)) write(line);
    }
    catch (Exception e)
    {
        stderr.writefln("Error [%s]: %s", args[0], e.msg);
        return 1;
    }
    return 0;
}



May 16
On Wednesday, 16 May 2018 at 02:47:50 UTC, Jon Degenhardt wrote:
> Can you show the program you are using that throws when using byLine?

Here's a version that only outputs the first chunk:
```
import std.stdio;
import std.range;
import std.algorithm;
import std.file;
import std.exception;

void main(string[] args) {
	enforce(args.length == 2, "Pass one filename as argument");
	auto lineChunks = File(args[1], "r").byLine.drop(4).chunks(10_000_000/10);
	new File("output.txt", "w").write(lineChunks.front.joiner);
}
```

dmd splitFile -g
./splitFile.exe UTF-8-test.txt

std.utf.UTFException@C:\D\dmd2\windows\bin\..\..\src\phobos\std\utf.d(1380): Invalid UTF-8 sequence (at index 4)
----------------
0x004038D2 in pure dchar std.utf.decodeImpl!(true, 0, char[]).decodeImpl(ref char[], ref uint) at C:\D\dmd2\windows\bin\..\..\src\phobos\std\utf.d(1529)
0x00403677 in pure @trusted dchar std.utf.decode!(0, char[]).decode(ref char[], ref uint) at C:\D\dmd2\windows\bin\..\..\src\phobos\std\utf.d(1076)
0x00403575 in pure @property @safe dchar std.range.primitives.front!(char).front(char[]) at C:\D\dmd2\windows\bin\..\..\src\phobos\std\range\primitives.d(2333)
0x0040566D in pure @property dchar std.algorithm.iteration.joiner!(std.range.Chunks!(std.stdio.File.ByLineImpl!(char, char).ByLineImpl).Chunks.Chunk).joiner(std.range
.Chunks!(std.stdio.File.ByLineImpl!(char, char).ByLineImpl).Chunks.Chunk).Result.front() at C:\D\dmd2\windows\bin\..\..\src\phobos\std\algorithm\iteration.d(2491)
May 16
16.05.2018 10:06, Dennis пишет:
> 
> Here's a version that only outputs the first chunk:
> ```
> import std.stdio;
> import std.range;
> import std.algorithm;
> import std.file;
> import std.exception;
> 
> void main(string[] args) {
>      enforce(args.length == 2, "Pass one filename as argument");
>      auto lineChunks = File(args[1], "r").byLine.drop(4).chunks(10_000_000/10);
>      new File("output.txt", "w").write(lineChunks.front.joiner);
> }
> ```
> 
> dmd splitFile -g
> ./splitFile.exe UTF-8-test.txt
> 
> std.utf.UTFException@C:\D\dmd2\windows\bin\..\..\src\phobos\std\utf.d(1380): Invalid UTF-8 sequence (at index 4)
> ----------------
> 0x004038D2 in pure dchar std.utf.decodeImpl!(true, 0, char[]).decodeImpl(ref char[], ref uint) at C:\D\dmd2\windows\bin\..\..\src\phobos\std\utf.d(1529)
> 0x00403677 in pure @trusted dchar std.utf.decode!(0, char[]).decode(ref char[], ref uint) at C:\D\dmd2\windows\bin\..\..\src\phobos\std\utf.d(1076)
> 0x00403575 in pure @property @safe dchar std.range.primitives.front!(char).front(char[]) at C:\D\dmd2\windows\bin\..\..\src\phobos\std\range\primitives.d(2333)
> 0x0040566D in pure @property dchar std.algorithm.iteration.joiner!(std.range.Chunks!(std.stdio.File.ByLineImpl!(char, char).ByLineImpl).Chunks.Chunk).joiner(std.range
> .Chunks!(std.stdio.File.ByLineImpl!(char, char).ByLineImpl).Chunks.Chunk).Result.front() at C:\D\dmd2\windows\bin\..\..\src\phobos\std\algorithm\iteration.d(2491)

What is the purpose of `.drop(4)`? I'm pretty sure this is the reason of the exception.
May 16
On Wednesday, 16 May 2018 at 08:20:06 UTC, drug wrote:
> What is the purpose of `.drop(4)`? I'm pretty sure this is the reason of the exception.

The file in question is a .json database dump with an array "rows" of 10 million 8-line objects. The newlines in the string fields are escaped, but they still contain other invalid characters which makes std.json reject it.

The first 4 lines of the file are basically "header" and the last 2 lines are a closing ] and }, so I want to split every 4 + 8*(10_000_000/amountOfFiles) n lines and also remove trailing the comma, add brackets, drop the last 2 lines etc.

I thought it wouldn't be hard to crudely split this file using D's range functions and basic string manipulation, but the combination of being to large for a string and having invalid encoding seems to defeat most simple solutions. For now I decided to use Git Bash and do:
tail -n80000002 inputfile.json | split -l 8000000 - outputfile

And now I have files that do fit in memory. I'm still interested in complete D solutions though, thanks for the iopipe and memory mapped file suggestions Steven and Jonathan. I will check those out.
May 16
On Wednesday, 16 May 2018 at 08:20:06 UTC, drug wrote:
> What is the purpose of `.drop(4)`? I'm pretty sure this is the reason of the exception.

The file in question is a .json database dump with an array "rows" of 10 million 8-line objects. The newlines in the string fields are escaped, but they still contain other invalid characters which makes std.json reject it.

The first 4 lines of the file are basically "header" and the last 2 lines are a closing ] and }, so I want to split every 4 + 8*(10_000_000/amountOfFiles) lines and also remove trailing the comma, add brackets, drop the last 2 lines etc.

I thought it wouldn't be hard to crudely split this file using D's range functions and basic string manipulation, but the combination of being to large for a string and having invalid encoding seems to defeat most simple solutions. For now I decided to use Git Bash and do:
```
tail -n80000002 inputfile.json | split -l 8000000 - outputfile
```
And now I have files that do fit in memory. I'm still interested in complete D solutions though, thanks for the iopipe and memory mapped file suggestions Steven and Jonathan. I will check those out.
May 16
On Wednesday, May 16, 2018 08:57:10 Dennis via Digitalmars-d-learn wrote:
> I thought it wouldn't be hard to crudely split this file using D's range functions and basic string manipulation, but the combination of being to large for a string and having invalid encoding seems to defeat most simple solutions.

D is designed with the idea that a string is valid UTF-8, a wstring is valid UTF-16, and dstring is valid UTF-32. For various reasons, that doesn't always hold true like it should, but pretty much all of Phobos is written with that assumption and will generally throw an exception if it isn't. If you're ever dealing with a different encoding (or with invalid Unicode), you really need to use integral types like ubyte (e.g. by using std.string.representation or by reading the data in as ubytes rather than as a string) and not try to use character types like char or string. If you try to use char or string with invalid UTF-8 without having it throw any exceptions, you're pretty much guaranteed to fail.

- Jonathan M Davis

May 16
On Wednesday, 16 May 2018 at 07:06:45 UTC, Dennis wrote:
> On Wednesday, 16 May 2018 at 02:47:50 UTC, Jon Degenhardt wrote:
>> Can you show the program you are using that throws when using byLine?
>
> Here's a version that only outputs the first chunk:
> ```
> import std.stdio;
> import std.range;
> import std.algorithm;
> import std.file;
> import std.exception;
>
> void main(string[] args) {
> 	enforce(args.length == 2, "Pass one filename as argument");
> 	auto lineChunks = File(args[1], "r").byLine.drop(4).chunks(10_000_000/10);
> 	new File("output.txt", "w").write(lineChunks.front.joiner);
> }
> ```

If you write it in the style of my earlier example and use counters and if-tests it will work. byLine by itself won't try to interpret the characters (won't auto-decode them), so it won't trigger an exception if there are invalid utf-8 characters.

« First   ‹ Prev
1 2 3