Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

How to run distributed training using yarn? #2959

Open
zhangshiyu01 opened this issue Aug 8, 2016 · 6 comments
Open

How to run distributed training using yarn? #2959

zhangshiyu01 opened this issue Aug 8, 2016 · 6 comments

Comments

@zhangshiyu01
Copy link

zhangshiyu01 commented Aug 8, 2016

../../tools/launch.py -n 2 --launcher yarn python train_mnist.py --network lenet --kv-store dist_sync

Traceback (most recent call last):
File "/data0/ads/chenglei/mxnet/dmlc-core/tracker/dmlc_tracker/launcher.py", line 81, in
main()
File "/data0/ads/chenglei/mxnet/dmlc-core/tracker/dmlc_tracker/launcher.py", line 30, in main
assert cluster is not None, 'need to have DMLC_JOB_CLUSTER'
AssertionError: need to have DMLC_JOB_CLUSTER
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(_self.__args, *_self.__kwargs)
File "/data0/ads/chenglei/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/tracker.py", line 365, in
target=(lambda: subprocess.check_call(self.cmd, env=env, shell=True)), args=())
File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '/data0/ads/chenglei/mxnet/dmlc-core/tracker/dmlc_tracker/launcher.py python train_mnist.py --network lenet --kv-store dist_sync' returned non-zero exit status 1

yarn 2 -------------/usr/local/jdk1.6.0_45/bin/java -cp /usr/local/hadoop-2.4.0/etc/hadoop:/usr/local/hadoop-2.4.0/share/hadoop/common/lib/:/usr/local/hadoop-2.4.0/share/hadoop/common/:/usr/local/hadoop-2.4.0/share/hadoop/hdfs:/usr/local/hadoop-2.4.0/share/hadoop/hdfs/lib/:/usr/local/hadoop-2.4.0/share/hadoop/hdfs/:/usr/local/hadoop-2.4.0/share/hadoop/yarn/lib/:/usr/local/hadoop-2.4.0/share/hadoop/yarn/:/usr/local/hadoop-2.4.0/share/hadoop/mapreduce/lib/:/usr/local/hadoop-2.4.0/share/hadoop/mapreduce/:/usr/local/hadoop-2.4.0/contrib/capacity-scheduler/*.jar:/data0/ads/chenglei/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/../yarn/dmlc-yarn.jar org.apache.hadoop.yarn.dmlc.Client -file /data0/ads/chenglei/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/../yarn/dmlc-yarn.jar -file train_mnist.py -file /data0/ads/chenglei/mxnet/dmlc-core/tracker/dmlc_tracker/launcher.py -jobname DMLC[nworker=2,nsever=2]:python -tempdir /tmp -queue default ./launcher.py python ./train_mnist.py --network lenet --kv-store dist_sync

16/08/08 15:59:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" org.apache.hadoop.security.AccessControlException: Permission denied: user=ads, access=WRITE, inode="/tmp":hadoop:supergroup:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:274)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:260)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:241)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:185)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:5546)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:5528)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:5493)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:3632)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:3602)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:3576)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:760)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:560)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1550)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2567)
at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:2536)
at org.apache.hadoop.hdfs.DistributedFileSystem$16.doCall(DistributedFileSystem.java:835)
at org.apache.hadoop.hdfs.DistributedFileSystem$16.doCall(DistributedFileSystem.java:831)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:831)
at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:824)
at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1815)
at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:595)
at org.apache.hadoop.yarn.dmlc.Client.setupCacheFiles(Client.java:134)
at org.apache.hadoop.yarn.dmlc.Client.run(Client.java:282)
at org.apache.hadoop.yarn.dmlc.Client.main(Client.java:348)

Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=ads, access=WRITE, inode="/tmp":hadoop:supergroup:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:274)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:260)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:241)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:185)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:5546)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:5528)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAncestorAccess(FSNamesystem.java:5493)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:3632)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:3602)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:3576)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:760)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:560)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1550)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)

at org.apache.hadoop.ipc.Client.call(Client.java:1410)
at org.apache.hadoop.ipc.Client.call(Client.java:1363)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy14.mkdirs(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:502)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
at com.sun.proxy.$Proxy15.mkdirs(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2565)
... 11 more

Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(self.__args, *self.__kwargs)
File "/data0/ads/chenglei/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/yarn.py", line 114, in run
subprocess.check_call(cmd, shell=True, env=env)
File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '/usr/local/jdk1.6.0_45/bin/java -cp /usr/local/hadoop-2.4.0/etc/hadoop:/usr/local/hadoop-2.4.0/share/hadoop/common/lib/
:/usr/local/hadoop-2.4.0/share/hadoop/common/
:/usr/local/hadoop-2.4.0/share/hadoop/hdfs:/usr/local/hadoop-2.4.0/share/hadoop/hdfs/lib/:/usr/local/hadoop-2.4.0/share/hadoop/hdfs/:/usr/local/hadoop-2.4.0/share/hadoop/yarn/lib/:/usr/local/hadoop-2.4.0/share/hadoop/yarn/:/usr/local/hadoop-2.4.0/share/hadoop/mapreduce/lib/:/usr/local/hadoop-2.4.0/share/hadoop/mapreduce/:/usr/local/hadoop-2.4.0/contrib/capacity-scheduler/*.jar:/data0/ads/chenglei/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/../yarn/dmlc-yarn.jar org.apache.hadoop.yarn.dmlc.Client -file /data0/ads/chenglei/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/../yarn/dmlc-yarn.jar -file train_mnist.py -file /data0/ads/chenglei/mxnet/dmlc-core/tracker/dmlc_tracker/launcher.py -jobname DMLC[nworker=2,nsever=2]:python -tempdir /tmp -queue default ./launcher.py python ./train_mnist.py --network lenet --kv-store dist_sync' returned non-zero exit status 1

@zhangshiyu01
Copy link
Author

zhangshiyu01 commented Aug 9, 2016

Should I add the hosts file?
I see
if args.cluster == 'local' or args.host_file is None args.host_file == 'None':
from dmlc_tracker import local
local.submit(args)
if args.cluster == 'sge':
from dmlc_tracker import sge
sge.submit(args)
elif args.cluster == 'yarn':
from dmlc_tracker import yarn
print '------- go yarn ---'
yarn.submit(args)
elif args.cluster == 'ssh':
from dmlc_tracker import ssh
ssh.submit(args)
elif args.cluster == 'mpi':
from dmlc_tracker import mpi
mpi.submit(args)
else:
raise RuntimeError('Unknown submission cluster type %s' % args.cluster)

Does it mean I need to add the hosts file when I use sge, yarn, ssh OR mpi?

@pono pono unassigned mli Jul 28, 2017
@szha
Copy link
Member

szha commented Oct 30, 2017

This issue is closed due to lack of activity in the last 90 days. Feel free to ping me to reopen if this is still an active issue. Thanks!
Also, do please check out our forum (and Chinese version) for general "how-to" questions.

@szha szha closed this as completed Oct 30, 2017
@everwind
Copy link

I meet the same error. how to fix it

@szha szha reopened this Jan 24, 2018
@szha
Copy link
Member

szha commented Jan 24, 2018

@zhangshiyu01 were you able to resolve the issue?

@lanking520
Copy link
Member

@zhangshiyu01 @everwind I know it's kind of late, but did you resolve the issues?

@PayneJoe
Copy link

this error still happens, have you got it done? @lanking520 @everwind @zhangshiyu01

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants