Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DLrover support ps failure #392

Merged
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
test worker failure
  • Loading branch information
hxdtest committed May 8, 2023
commit f2a7af003eaa0c0da251326fd33c3d12430e9682
1 change: 1 addition & 0 deletions dlrover/examples/deepctr_auto_scale_job.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -64,4 +64,5 @@ spec:
restartPolicy: Never
containers:
- name: main
image: registry.cn-hangzhou.aliyuncs.com/intell-ai/dlrover:test
imagePullPolicy: Always
15 changes: 6 additions & 9 deletions dlrover/examples/deepctr_manual_scale_job.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ spec:
replicaSpecs:
ps:
autoScale: False
replicas: 3
replicas: 2
template:
spec:
restartPolicy: Never
Expand All @@ -30,8 +30,7 @@ spec:
command:
- /bin/bash
- -c
- "pip install pyhocon \
&& cd /home/model_zoo/tf_estimator/criteo_deeprec \
- "cd /home/model_zoo/tf_estimator/criteo_deeprec \
&& python -m dlrover.trainer.entry.local_entry \
--platform=Kubernetes --conf=train_conf.TrainConf \
--enable_auto_scaling=True"
Expand All @@ -44,7 +43,7 @@ spec:
claimName: pvc-nas
worker:
autoScale: False
replicas: 3
replicas: 2
template:
spec:
restartPolicy: Never
Expand All @@ -63,8 +62,7 @@ spec:
command:
- /bin/bash
- -c
- "pip install pyhocon \
&& cd /home/model_zoo/tf_estimator/criteo_deeprec \
- "cd /home/model_zoo/tf_estimator/criteo_deeprec \
&& python -m dlrover.trainer.entry.local_entry \
--platform=Kubernetes --conf=train_conf.TrainConf \
--enable_auto_scaling=True"
Expand All @@ -81,6 +79,5 @@ spec:
restartPolicy: Never
containers:
- name: main
imagePullPolicy: Always
# yamllint disable-line rule:line-length
image: registry.cn-hangzhou.aliyuncs.com/dlrover_deeprec/deeprec:v11
image: registry.cn-hangzhou.aliyuncs.com/intell-ai/dlrover:test
imagePullPolicy: Always
8 changes: 4 additions & 4 deletions dlrover/examples/scale_plan.yaml
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
apiVersion: elastic.iml.github.io/v1alpha1
kind: ScalePlan
metadata:
name: deepctr-auto-scaling-job-i22
name: deepctr-auto-scaling-job-01
labels:
elasticjob-name: deepctr-auto-scale
scale-type: manual
spec:
ownerJob: deepctr-auto-scaling-job
ownerJob: deepctr-auto-scale
replicaResourceSpecs:
worker:
replicas: 4
ps:
replicas: 3
7 changes: 3 additions & 4 deletions dlrover/trainer/constants/tf_constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,6 @@ class TFConstants(object):
KeepCheckpointMax = Constant("keep_checkpoint_max", 5)
DataShardClient = Constant("data_shard_client", None)
ExitRecoverableSession = Constant("exit_recoverable_session", None)
DataShardCheckpoint = Constant("data_shard_checkpoint", "data_shard_checkpoint.json")



DataShardCheckpoint = Constant(
"data_shard_checkpoint", "data_shard_checkpoint.json"
)
Original file line number Diff line number Diff line change
Expand Up @@ -209,6 +209,7 @@ def _prepare_estimator_config_and_params(self):
def _prepare_train_dataset(self):
"""prepare_train_dataset"""
train_set = self._task_conf.get(TFConstants.TrainSet.name)
logger.info("Prepare training dataset with {}".format(train_set))
self.train_dataset = DatasetUtil.create(train_set)

def _prepare_eval_dataset(self):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ def _start_failover_monitor(self):
def monitor_fun():
logger.info("Successfully to start failover monitor!")
while True:
logger.info("querying master for ps cluster info")
hxdtest marked this conversation as resolved.
Show resolved Hide resolved
ps_address_changed, change_type = self.ps_addresses_changed()
if ps_address_changed:
self.refresh_env()
Expand Down
15 changes: 9 additions & 6 deletions dlrover/trainer/tensorflow/reader/file_reader.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,20 +17,23 @@

class FileReader(ElasticReader):
def __init__(self, path=None, skip_header=True):
print("FileReader is initiating path is {}".format(path))
self._skip_header = skip_header
self._file_handler = open(path, "r")
self.data = self._file_handler.readlines()
self._file_name = path
super().__init__(
path=path,
)
self._data_nums = None

def count_data(self):
self.data = self._file_handler.readlines()
if self._skip_header:
self._data_nums = len(self.data) - 1
self.data = self.data[1:]
else:
self._data_nums = len(self.data)
if self._data_nums is None:
if self._skip_header:
self._data_nums = len(self.data) - 1
self.data = self.data[1:]
else:
self._data_nums = len(self.data)

def read_data_by_index_range(self, start_index, end_index):
for i in range(start_index, end_index):
Expand Down