On 2016-01-01 13:22, Nicholas Wilson via Digitalmars-d-learn wrote:

On Friday, 1 January 2016 at 00:41:56 UTC, brian wrote:
I have a large list, B, of string items. For each item in that large list, I need to see if it is in the smaller list, A. I have been using a simple string array for the storage of A string[] A and then using foreach to go through all the items of B and check they are in A foreach(string;B) /* this looks hacky but wasn't working without the !=0 bit ) if(find(A,string) != 0) writeln("Found a line: ", string); While this works for small datasets, but when either A or B get large (A could be up to 150k records, B in the millions) it takes quite a while to run. I'd like to know what is the best way to store lists of text for searching? Is there a better container than a simply array? Neither A nor B need to be ordered for my purpose, but would sorting help the search? Would it help enough to be worth the CPU expense? Regards B
Your problem is that your algorithm is O(m*n). (B is m ,A is n)
A simple speedup would be to sort A and then use a binary search (O(m*log(n)))
(partially sorting B may also help but only for cache)

Fully sorting may be worth doing if you have other steps after that also benefit in speed when working on sorted data (like searching does).

Changing A to a trie is another possibility.

But as always it help to know your data. How long are the strings(are they of the same length)? Are there any similarities (e.g. are they all email addresses)? Are there duplicates allowed in B?

Nic

Thanks. I'll try your suggestions.

The data I am looking at is the Microsoft Academic Graph data.
https://academicgraph.blob.core.windows.net/graph-2015-11-06/index.html

I have picked a topic of interest, and am pulling the IDs for papers of those topics.
There will be up to 150,000 in the list (A) and each id will look something like:

000241C7
0018BC77

I then have a "Papers.txt" file, which has a list (B) of (a couple of hundred million) papers and details about the papers.
I am then searching through this list for the items in A.The Papers.txt file has records that look like:

007D0259    Business Intelligence analytics without conventional data warehousing    business intelligence analytics without conventional data warehousing    2010    2010                        19542
0018BC77    Data mining email to discover social networks and communities of practice    data mining email to discover social networks and communities of practice    2003                            17584

So I need to iterate through B, and see if the first entry, is one of the papers I want, by comparing it to list A.
In the above example, 007D0259 I don't want, 0018BC77 I do want, and will write that record to another file.

I have started to delete entries from list A as I find them, so that the search doesn't have as many records to look through each time, but I amn't sure that that will decrease time, because it'll take time to find the record on each delete.