March 13, 2008
BCS Wrote:

> Sean Kelly wrote:
> > == Quote from N/A (NA@NA.na)'s article
> > 
> >>I generally handle files that are a few hundred MB to a few gigs and I noticed that the input is a char[], do you also plan on adding file streams as input?
> > 
> > 
> > I believe the suggested approach in this case is to access the input as a memory mapped file.  This does place some restrictions on file size in 32-bit applications, but then those are ideally in decline.
> > 
> > 
> > Sean
> 
> what might be interesting is to make a version that works with slices of the file rather than ram. (make the current version into a template specialized on char[] and the new one on some new type?) That way only the parsed meta data needs to stay in ram. It would take a lot of games mapping stuff in and out of ram but it would be interesting to see if it could be done.

It would be interesting, but isn't that kinda what memory-mapped files provides for? You can operate with files up to 4GB in size (on a 32bit system), even with DOM, where the slices are virtual addresses within paged file-blocks. Effectively, each paged segment of the file is a lower-level slice?
March 14, 2008
Reply to kris,

> BCS Wrote:
> 
>> what might be interesting is to make a version that works with slices
>> of the file rather than ram. (make the current version into a
>> template specialized on char[] and the new one on some new type?)
>> That way only the parsed meta data needs to stay in ram. It would
>> take a lot of games mapping stuff in and out of ram but it would be
>> interesting to see if it could be done.
>> 
> It would be interesting, but isn't that kinda what memory-mapped files
> provides for? You can operate with files up to 4GB in size (on a 32bit
> system), even with DOM, where the slices are virtual addresses within
> paged file-blocks. Effectively, each paged segment of the file is a
> lower-level slice?
> 

Not as I understand it (I looked this up about a year ago so I'm a bit rusty). on 32bits, you can't map in 4GB because you need space for the programs code (and on windows you only get 3GB of address space as the OS gets that last GB) Also what about a 10GB file? My idea is to make some sort of lib that lest you handle larges data sets (64bit?) You would ask for a file to be "mapped in" and then you would get an object that syntactically looks like an array. Indexes ops would actually map in pieces, slices would generate new objects (with ref to the parent) that would, on demand, map stuff in. Some sort of GCish thing would start un mapping/moving stings when space gets tight. If you never have to actual convert the data to a "real" array you don't ever need to copy the stuff, you can just leave it in the file. I'm not sure it's even possible or how it would work, but it would be cool. (and highly useful)


March 14, 2008
Reply to BCS:

"BCS" <ao@pathlink.com> wrote in message news:55391cb32a6178ca5358fd65a320@news.digitalmars.com...
> Reply to kris,
>
>> BCS Wrote:
>>
>>> what might be interesting is to make a version that works with slices of the file rather than ram. (make the current version into a template specialized on char[] and the new one on some new type?) That way only the parsed meta data needs to stay in ram. It would take a lot of games mapping stuff in and out of ram but it would be interesting to see if it could be done.
>>>
>> It would be interesting, but isn't that kinda what memory-mapped files provides for? You can operate with files up to 4GB in size (on a 32bit system), even with DOM, where the slices are virtual addresses within paged file-blocks. Effectively, each paged segment of the file is a lower-level slice?
>>
>
> Not as I understand it (I looked this up about a year ago so I'm a bit rusty). on 32bits, you can't map in 4GB because you need space for the programs code (and on windows you only get 3GB of address space as the OS gets that last GB)

Doh. You're right, of course. Thank goodness for 64bit machines :)



March 14, 2008
BCS wrote:
> Reply to kris,
> 
>> BCS Wrote:
>>
>>> what might be interesting is to make a version that works with slices
>>> of the file rather than ram. (make the current version into a
>>> template specialized on char[] and the new one on some new type?)
>>> That way only the parsed meta data needs to stay in ram. It would
>>> take a lot of games mapping stuff in and out of ram but it would be
>>> interesting to see if it could be done.
>>>
>> It would be interesting, but isn't that kinda what memory-mapped files
>> provides for? You can operate with files up to 4GB in size (on a 32bit
>> system), even with DOM, where the slices are virtual addresses within
>> paged file-blocks. Effectively, each paged segment of the file is a
>> lower-level slice?
>>
> 
> Not as I understand it (I looked this up about a year ago so I'm a bit rusty). on 32bits, you can't map in 4GB because you need space for the programs code (and on windows you only get 3GB of address space as the OS gets that last GB) Also what about a 10GB file? My idea is to make some sort of lib that lest you handle larges data sets (64bit?) You would ask for a file to be "mapped in" and then you would get an object that syntactically looks like an array. Indexes ops would actually map in pieces, slices would generate new objects (with ref to the parent) that would, on demand, map stuff in. Some sort of GCish thing would start un mapping/moving stings when space gets tight. If you never have to actual convert the data to a "real" array you don't ever need to copy the stuff, you can just leave it in the file. I'm not sure it's even possible or how it would work, but it would be cool. (and highly useful)


I've got this strange feeling in my stomach that shouts out "WTF?!" when I read about >3-4GB XML files. I know, it's about the "if" and "whens", but still; if you find yourself needing such a beast of an XML file, you might possibly think of other forms of data structuring (a database, perhaps?).

March 14, 2008
On Fri, 14 Mar 2008 11:40:20 +0300, Alexander Panek <alexander.panek@brainsware.org> wrote:

> BCS wrote:
>> Reply to kris,
>>
>>> BCS Wrote:
>>>
>>>> what might be interesting is to make a version that works with slices
>>>> of the file rather than ram. (make the current version into a
>>>> template specialized on char[] and the new one on some new type?)
>>>> That way only the parsed meta data needs to stay in ram. It would
>>>> take a lot of games mapping stuff in and out of ram but it would be
>>>> interesting to see if it could be done.
>>>>
>>> It would be interesting, but isn't that kinda what memory-mapped files
>>> provides for? You can operate with files up to 4GB in size (on a 32bit
>>> system), even with DOM, where the slices are virtual addresses within
>>> paged file-blocks. Effectively, each paged segment of the file is a
>>> lower-level slice?
>>>
>>  Not as I understand it (I looked this up about a year ago so I'm a bit rusty). on 32bits, you can't map in 4GB because you need space for the programs code (and on windows you only get 3GB of address space as the OS gets that last GB) Also what about a 10GB file? My idea is to make some sort of lib that lest you handle larges data sets (64bit?) You would ask for a file to be "mapped in" and then you would get an object that syntactically looks like an array. Indexes ops would actually map in pieces, slices would generate new objects (with ref to the parent) that would, on demand, map stuff in. Some sort of GCish thing would start un mapping/moving stings when space gets tight. If you never have to actual convert the data to a "real" array you don't ever need to copy the stuff, you can just leave it in the file. I'm not sure it's even possible or how it would work, but it would be cool. (and highly useful)
>
>
> I've got this strange feeling in my stomach that shouts out "WTF?!" when I read about >3-4GB XML files. I know, it's about the "if" and "whens", but still; if you find yourself needing such a beast of an XML file, you might possibly think of other forms of data structuring (a database, perhaps?).
>


It sounds strange, but even large companies like Google or Yahoo store their temporary search indexes in ULTRA large XML files, and many of them can easily be tens or even hundreds of GBs in size (just ordinary daily index) before they get "repacked" into compacter format.
March 14, 2008
Koroskin Denis wrote:
> On Fri, 14 Mar 2008 11:40:20 +0300, Alexander Panek <alexander.panek@brainsware.org> wrote:
>>
>> I've got this strange feeling in my stomach that shouts out "WTF?!" when I read about >3-4GB XML files. I know, it's about the "if" and "whens", but still; if you find yourself needing such a beast of an XML file, you might possibly think of other forms of data structuring (a database, perhaps?).
> 
> It sounds strange, but even large companies like Google or Yahoo store their temporary search indexes in ULTRA large XML files, and many of them can easily be tens or even hundreds of GBs in size (just ordinary daily index) before they get "repacked" into compacter format.

That does, indeed, sound strange. :X
March 14, 2008
== Quote from Alexander Panek (alexander.panek@brainsware.org)'s article
> BCS wrote:
> > Reply to kris,
> >
> >> BCS Wrote:
> >>
> >>> what might be interesting is to make a version that works with slices of the file rather than ram. (make the current version into a template specialized on char[] and the new one on some new type?) That way only the parsed meta data needs to stay in ram. It would take a lot of games mapping stuff in and out of ram but it would be interesting to see if it could be done.
> >>>
> >> It would be interesting, but isn't that kinda what memory-mapped files provides for? You can operate with files up to 4GB in size (on a 32bit system), even with DOM, where the slices are virtual addresses within paged file-blocks. Effectively, each paged segment of the file is a lower-level slice?
> >>
> >
> > Not as I understand it (I looked this up about a year ago so I'm a bit rusty). on 32bits, you can't map in 4GB because you need space for the programs code (and on windows you only get 3GB of address space as the OS gets that last GB) Also what about a 10GB file? My idea is to make some sort of lib that lest you handle larges data sets (64bit?) You would ask for a file to be "mapped in" and then you would get an object that syntactically looks like an array. Indexes ops would actually map in pieces, slices would generate new objects (with ref to the parent) that would, on demand, map stuff in. Some sort of GCish thing would start un mapping/moving stings when space gets tight. If you never have to actual convert the data to a "real" array you don't ever need to copy the stuff, you can just leave it in the file. I'm not sure it's even possible or how it would work, but it would be cool. (and highly useful)
> I've got this strange feeling in my stomach that shouts out "WTF?!" when I read about >3-4GB XML files. I know, it's about the "if" and "whens", but still; if you find yourself needing such a beast of an XML file, you might possibly think of other forms of data structuring (a database, perhaps?).

It's quite possible that an XML stream could be used as the transport mechanism for the result of a database query.  In such an instance, I wouldn't be at all surprised if a response were more than 3-4GB. In fact, I've designed such a system and the proper query would definitely have produced such a dataset.


Sean
March 14, 2008
Koroskin Denis wrote:
> It sounds strange, but even large companies like Google or Yahoo store their temporary search indexes in ULTRA large XML files, and many of them can easily be tens or even hundreds of GBs in size (just ordinary daily index) before they get "repacked" into compacter format.

It's a shame the "O RLY?" owl died out years ago...
March 14, 2008
"Robert Fraser" <fraserofthenight@gmail.com> wrote in message news:freg27$1m7l$1@digitalmars.com...
> Koroskin Denis wrote:
>> It sounds strange, but even large companies like Google or Yahoo store their temporary search indexes in ULTRA large XML files, and many of them can easily be tens or even hundreds of GBs in size (just ordinary daily index) before they get "repacked" into compacter format.
>
> It's a shame the "O RLY?" owl died out years ago...

O RLY?

Good internet memes never die, they just go into hibernation ;)


March 15, 2008
Reply to Alexander,

> BCS wrote:
> 
>> Not as I understand it (I looked this up about a year ago so I'm a
>> bit rusty). on 32bits, you can't map in 4GB because you need space
>> for the programs code (and on windows you only get 3GB of address
>> space as the OS gets that last GB) Also what about a 10GB file? My
>> idea is to make some sort of lib that lest you handle larges data
>> sets (64bit?) You would ask for a file to be "mapped in" and then you
>> would get an object that syntactically looks like an array. Indexes
>> ops would actually map in pieces, slices would generate new objects
>> (with ref to the parent) that would, on demand, map stuff in. Some
>> sort of GCish thing would start un mapping/moving stings when space
>> gets tight. If you never have to actual convert the data to a "real"
>> array you don't ever need to copy the stuff, you can just leave it in
>> the file. I'm not sure it's even possible or how it would work, but
>> it would be cool. (and highly useful)
>> 
> I've got this strange feeling in my stomach that shouts out "WTF?!"
> when I read about >3-4GB XML files. I know, it's about the "if" and
> "whens", but still; if you find yourself needing such a beast of an
> XML file, you might possibly think of other forms of data structuring
> (a database, perhaps?).
> 

Truth be told, I'm not that far from agreeing with you (on seeing that I'd think: "WTF?!?!.... Um... OoooK.... well..."). I can't think of a justification for the lib I described if the only thing it would be used for would be a XML parser. It might be used for managing parts of something like... a database table. <G>