On 2017-04-13 19:10, Ivo Cavalcante wrote:
1. Our software used to prepare the datasets to be processed directly on shared filesystem (NAS), what used to take a long time. So we've changed to using workstations local disks on the process of generating datasets, what gave a great improvement on time spent. OTOH, we had to move this data into a place where execution nodes could see them, and decide to use their local disks also - since shared NAS could be a bottleneck again.
At one point we were preparing the search dataset on an ssd to deal with the first bottleneck. If you're copying it to worker nodes afterwards, there is no reason to prepare in on a network share.
I tried various ways to push it out to worker nodes once ready, and so far I failed to come up with a good way to make a node advertise "I have the complete and up-to-date dataset and can run jobs that need it". :(
FWIW Dimitri