I am trying to run an executable using hadoop streaming 2.4
My executable is my mapper which is a groovy script. This script uses a class from a jar file which I am sending via -libjars argument.
The hadoop streaming is made to span maps via an input file, each line feeds to one map. The question is, though the hadoop successfully executes the use case, but, I see that some maps failed and restarted later. The failure was due to failing to locate the class. The script has some imports and they are not found. However, they are all in jar file.
I am tempted to think that when hadoop executes the first few map tasks, the jar file is not "prepared yet" to be made available to maps and hence the initial maps failed to locate the class, and later, when they are restarted, it is able to locate the class and executes smoothly.
Is this correct? If not, can someone tell me why this behavior? How can I get around this issue? Because of this, the use case takes little more time to execute. I fear, when I expand the use case, this will surely cause performance delay.