Python Hadoop streaming with import package that are not installed on data nodes -
i tried import scikit image in python hadoop streaming, , i've tried out existing posts on stackoverflow here , here, none of them solve problem.
the real question is, if distributed using -file zip/mod file packaged scikit-image folder, how python scripts running on data nodes know how extract packages , import code? note i've installed python scikit-image on name node , , able run local experiments.
my script trivial: classic word count example python streaming, "import skimage" in mapper.py. http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python
my command :
hadoop jar hadoop-streaming.jar \ -file mapper.py -mapper mapper.py \ -file reducer.py -reducer reducer.py \ -file ./skimage.mod \ -input /user/text/* \ -output /user/textoutput/
screen printouts:
packagejobjar: [mapper.py, reducer.py, ./skimage.zip] [/usr/lib/gphd/hadoop-mapreduce-2.0.2_alpha_gphd_2_0_1_0/hadoop-streaming-2.0.2-alpha-gphd-2.0.1.0.jar] /tmp/streamjob6159562120374599467.jar tmpdir=null 14/04/04 18:00:02 info service.abstractservice: service:org.apache.hadoop.yarn.client.yarnclientimpl inited. 14/04/04 18:00:02 info service.abstractservice: service:org.apache.hadoop.yarn.client.yarnclientimpl started. 14/04/04 18:00:03 info service.abstractservice: service:org.apache.hadoop.yarn.client.yarnclientimpl inited. 14/04/04 18:00:03 info service.abstractservice: service:org.apache.hadoop.yarn.client.yarnclientimpl started. 14/04/04 18:00:03 warn snappy.loadsnappy: snappy native library not loaded 14/04/04 18:00:03 info mapred.fileinputformat: total input paths process : 1 14/04/04 18:00:03 info mapreduce.jobsubmitter: number of splits:2 14/04/04 18:00:03 warn conf.configuration: mapred.jar deprecated. instead, use mapreduce.job.jar 14/04/04 18:00:03 warn conf.configuration: mapred.cache.files deprecated. instead, use mapreduce.job.cache.files 14/04/04 18:00:03 warn conf.configuration: mapred.output.value.class deprecated. instead, use mapreduce.job.output.value.class 14/04/04 18:00:03 warn conf.configuration: mapred.mapoutput.value.class deprecated. instead, use mapreduce.map.output.value.class 14/04/04 18:00:03 warn conf.configuration: mapred.job.name deprecated. instead, use mapreduce.job.name 14/04/04 18:00:03 warn conf.configuration: mapred.input.dir deprecated. instead, use mapreduce.input.fileinputformat.inputdir 14/04/04 18:00:03 warn conf.configuration: mapred.output.dir deprecated. instead, use mapreduce.output.fileoutputformat.outputdir 14/04/04 18:00:03 warn conf.configuration: mapred.map.tasks deprecated. instead, use mapreduce.job.maps 14/04/04 18:00:03 warn conf.configuration: mapred.cache.files.timestamps deprecated. instead, use mapreduce.job.cache.files.timestamps 14/04/04 18:00:03 warn conf.configuration: mapred.output.key.class deprecated. instead, use mapreduce.job.output.key.class 14/04/04 18:00:03 warn conf.configuration: mapred.mapoutput.key.class deprecated. instead, use mapreduce.map.output.key.class 14/04/04 18:00:03 warn conf.configuration: mapred.working.dir deprecated. instead, use mapreduce.job.working.dir 14/04/04 18:00:03 info mapreduce.jobsubmitter: submitting tokens job: job_1384839777050_0106 14/04/04 18:00:04 info client.yarnclientimpl: submitted application application_1384839777050_0106 resourcemanager @ hdm3.gphd.local/172.28.9.252:8032 14/04/04 18:00:04 info mapreduce.job: url track job: http://hdm3.gphd.local:8088/proxy/application_1384839777050_0106/ 14/04/04 18:00:04 info mapreduce.job: running job: job_1384839777050_0106 14/04/04 18:00:08 info mapreduce.job: job job_1384839777050_0106 running in uber mode : false 14/04/04 18:00:08 info mapreduce.job: map 0% reduce 0% 14/04/04 18:00:12 info mapreduce.job: task id : attempt_1384839777050_0106_m_000001_0, status : failed error: java.lang.runtimeexception: pipemapred.waitoutputthreads(): subprocess failed code 1 @ org.apache.hadoop.streaming.pipemapred.waitoutputthreads(pipemapred.java:320) @ org.apache.hadoop.streaming.pipemapred.mapredfinished(pipemapred.java:533) @ org.apache.hadoop.streaming.pipemapper.close(pipemapper.java:130)
i checked error log in hadoop job, it's complaining cannot find "import skimage" means it's not picked data nodes.
have tried zipimport
solution?
here example: hadoop: how include third party library in python mapreduce
Comments
Post a Comment