October 03, 2008 Re: Matrix/Linear Algebra Library? | |
---|---|

Posted in reply to Benji Smith | ```
On Sat, Oct 4, 2008 at 1:10 AM, Benji Smith <dlanguage@benjismith.net> wrote:
> Fawzi Mohamed wrote:
>>
>> Probably if you want least square approximation (that is what you meant
>> with LSA?) some kind of direct minimizer is the way to go (but be careful
>> the problem might ill conditioned).
>> Fawzi
>
> Sorry. I meant "Latent Semantic Analysis". I'm building a table of synonyms
> from a large document repository.
>
> There are about 100,000 documents in the repository, and the total corpus
> has about 100,000 unique words (after stemming). The LSA algorithm uses a
> matrix, where each row represents a single term, and each column represents
> a single document. Each cell contains the TF-IDF (Term Frequency / Inverse
> Document Frequency).
>
> Performing an SVD on the matrix yields a collelation matrix, representing
> the synonymy of all terms, in all documents.
>
How many nonzeros? Assuming each document has maybe 10K terms that's
still a billion nonzero entries, and the SVD's U and V matrices will
be dense.
That is a lot of data. Fawzi's right, you probably need some bigger
hammers for that, like a distributed or out of core SVD routine.
But given the application I'm also guessing you only care about the
vectors associated with the largest few singular values, so you should
be looking for a routine that can find a few singular values at a
time. The Arnoldi methods in ARPACK can do that kind of thing. I
can't remember if that handles sparse matrices too, but I think it
does. I seem to recall that you provide your own Mat*Vector
calculation (and maybe dot or vector addition ops too?) and it does
the rest.
--bb
``` |

Next › Last » 1 2 |
---|