top button
Flag Notify
    Connect to us
      Site Registration

Site Registration

Execute hadoop job remotely and programmatically

+1 vote
1,220 views

My project required to execute a hadoop job remotely and the job requires some third-part libraries (jar files). I tried:
1. Copy these jar files to hdfs.
2. Copy them into the distributed cache using DistributedCache.addFileToClassPath so that hadoop can spread these jar files to each of the slave nodes.

However, my program still throws ClassNotFoundException. Indicating that some of the classes cannot be found when the job is running.

So I am lookinh:
1. What is the correct way to run a job remotely and programmatically while the job requires some third-party jar files.
2. I found DistributedCache is deprecated (Im using hadoop 1.2.0), what is the alternative class?

posted Dec 9, 2013 by Meenal Mishra

Share this question
Facebook Share Button Twitter Share Button LinkedIn Share Button
out them in  a lib directory in the jar you pass to Hadoop and they will be found...

1 Answer

+1 vote

Please have a look at the -libjars option of the hadoop cmd. It tells the system what additional libs have to be sent to the cluster before the job can start. Each time you submit the job, this kind of distribution happens again. So its not a good idea for really large libs, those you should deploy on all nodes and than you have to configure the classpath of the JVMs running the tasks.

answer Dec 10, 2013 by Sonu Jindal
Similar Questions
+1 vote

I have a requirement where I need to kill one process on remote windows machine. Following command just works fine if i have to kill process on local machine

os.system('taskkill /f /im processName.exe')

However I am not able to figure out how to execute this command on remote windows machine. So is there any way I can execute command from windows machine on remote windows machine ?
Note: my local machine is also windows (machine from where i have to execute command)

0 votes

I want to ask, what's the best way implementing a Job which is importing files into the HDFS?

I have an external System offering data accessible through a Rest API. My goal is to have a job running in Hadoop which is periodical (maybe started by chron?) looking into the Rest API if new data is available.

It would be nice if also this job could run on multiple data nodes. But in difference to all the MapReduce examples I found, is my job looking for new Data or changed data from an external interface and compares the data with existing one.

This is a conceptual example of the job:

  • The job ask the Rest API if there are new files
  • if so, the job imports the first file in the list
  • look if the file already exits

  • if not, the job imports the file

  • if yes, the job compares the data with the data already stored

  • if changed the job updates the file

  • if more file exits the job continues with 2 -

  • otherwise ends.

Can anybody give me a little help how to start (its my first job I write...) ?

...