movups is not good. It'll be a lot faster (and portable) if you use movaps.
Process looks something like:
* do the first few from a[0] until a's alignment interval as scalar
* load the left of b's aligned pair
* loop for each aligned vector in a
- load a[n..n+4] aligned
- load the right of b's pair
- combine left~right and shift left to match elements against a
- left = right
* perform stragglers as scalar
Your benchmark is probably misleading too, because I suspect you are passing directly alloc-ed arrays into the function (which are 16 byte aligned).
movups will be significantly slower if the pointers supplied are not 16 byte aligned.
Also, results vary significantly between chip manufacturers and revisions.