October 03, 2008
On Sat, Oct 4, 2008 at 1:10 AM, Benji Smith <dlanguage@benjismith.net> wrote:
> Fawzi Mohamed wrote:
>> Probably if you want least square approximation (that is what you meant
>> with LSA?) some kind of direct minimizer is the way to go (but be careful
>> the problem might ill conditioned).
>> Fawzi
> Sorry. I meant "Latent Semantic Analysis". I'm building a table of synonyms from a large document repository.
> There are about 100,000 documents in the repository, and the total corpus has about 100,000 unique words (after stemming). The LSA algorithm uses a matrix, where each row represents a single term, and each column represents a single document. Each cell contains the TF-IDF (Term Frequency / Inverse Document Frequency).
> Performing an SVD on the matrix yields a collelation matrix, representing the synonymy of all terms, in all documents.

How many nonzeros?  Assuming each document has maybe 10K terms that's
still a billion nonzero entries, and the SVD's U and V matrices will
be dense.
That is a lot of data.  Fawzi's right, you probably need some bigger
hammers for that, like a distributed or out of core SVD routine.

But given the application I'm also guessing you only care about the vectors associated with the largest few singular values, so you should be looking for a routine that can find a few singular values at a time.  The Arnoldi methods in ARPACK can do that kind of thing.   I can't remember if that handles sparse matrices too, but I think it does.  I seem to recall that you provide your own Mat*Vector calculation (and maybe dot or vector addition ops too?) and it does the rest.

Next ›   Last »
1 2