I don't have time right now to do the full research I would need to, to answer completely so a lot of this will be conjecture. I appologize. I have seen the gollum example you are referring to though so I do have a good understanding of the quality and scenario you describe.
I think the two products could probably be made to coexist with some elbow greese and perhaps a little coding. To take full advantage of movimento you'd probably want to just write a little app for the optitrack camera, that writes the objects (blips) to disk in a simple format and then offline, turns it into an AVI or MOV of white dots on black at high res. Then, you'd just import those movies into movimento as if they were any video source from say a camcorder. Movimento's human assisted tracking would eat up the virtual video like candy and you'd be able to move forward from there. That's the kludgy way to do it anyway I'd think. It might also be possible to write some code on the movimento end to skip the video step but I'd need to see their SDK docs to know more.
I would guess that the gollum example is a combination of point driven deformation (skinned mesh) and shape animation. The shape animation could be driven by the facial joints using links (driven keys). If it were a VFX shot rather than a purist's tech demo, there would also be a layer of hand animated fixes/enhancements on top.
Motionbuilder's "actor face" is a strange beast that doesn't make sense to me. I think its just an old solution from long ago that they have not replaced or removed yet. It expects a fairly simple set of markers and it uses them to drive a specific set of blendshapes. I would think you'd approach it like you would the human body, where you attempt to "retarget" the data geometrically and heirarchically. Instead, it tries to boil it down to parameters like "smile" and "smirk".
I think attempts at using optitrack cameras to do face and body and hands all at once are a little beyond the hardware's specs in typical usage scenarios. If you're talking about setting up the cameras on tripods and capturing all at once, they simply don't have the resolution to cover the space for a full body capture and also have the fine detail to capture the minutia of the face and a detailed hand setup at one time in one volume of reasonable size. You simply need higher res cameras or something like 50-100 v100s. Don't get me wrong, I'd rather have to build a 50 camera v100 setup than a 16 camera vicon MX40 setup because the price difference between those two setups is huge. However, the fact of the matter is, Arena and NP are not quite ready for that kind of scale yet IMHO. Though I do know that scalability is high on the priority list and they want to get there. And I think they can. It will take a little time though.
I could see some specialized scenarios where you have a normal volume but you also have a camera or two head mounted to do face.
I could see a very complex scenario where you have a normal volume and some cameras on auto pan/tilt mounts with telephoto lenses that follow an actor's head and/or hands to get better resolution for those body parts.
Both of those uncommon setups would require some pretty hefty custom software though. Year or so of development I'd think.
My immediate pragmatic approach would be to try to break out the face and hands as separate passes of capture if I simply needed to get the job done and keep the budget low.
If I really needed facial at capture time, I'd have a couple of hidef witness cams (sony handcams, nothing too serious) and subcontract to image metrics
http://www.image-metrics.com/
If high res hands were particularly critical I'd look into a data glove or shape tape solution.
All in all though, I don't think movimento will gain you much here. Remember, that gollum piece was done as a facial capture, not as a full body capture. Facial capture is MUCH easier for the mocap system, and much harder for the rigger/animator. Full body is harder on the system and easier on the rigger/animator. Arena doesn't support it yet but there has already been some suggestions that simply setting up a rigid body with all your facial markers and a high slack factor in Arena might be enough to get some facial capture going. I've not tried it yet myself. I've been focusing more on my own software.
If you have a specific project or scenario you are thinking about, feel free to contact me further, though unfortunately, my schedule just got really tight in the past few days and threatens to stay that way for the next year. Also, you may want to contact NaturalPoint directly as they are pretty good about helping productions with needs (and it sounds like you've got something specific in mind). They are actively working on features and fixes, and some of those may be applicable to your project.
Obviously, if you need full body facial and hands right now, you need to go to House of Moves or Giant studios. They're the vendors that have done it and can do it for you for the going market rate, assuming they're not booked solid.