update dev-weight-sharing to latest master #391

leckie-chn · 2018-11-23T07:16:19Z

will rebase leckie-chn/nni: dev-weight-sharing to Microsoft/nni: dev-weight-sharing later

* fix port bug

* Exp stop refactor (#161) * Update RemoteMachineMode.md (#63) * Remove unused classes for SQuAD QA example. * Remove more unused functions for SQuAD QA example. * Fix default dataset config. * Add Makefile README (#64) * update document (#92) * Edit readme.md * updated a word * Update GetStarted.md * Update GetStarted.md * refact readme, getstarted and write your trial md. * Update README.md * Update WriteYourTrial.md * Update WriteYourTrial.md * Update WriteYourTrial.md * Update WriteYourTrial.md * Fix nnictl bugs and add new feature (#75) * fix nnictl bug * fix nnictl create bug * add experiment status logic * add more information for nnictl * fix Evolution Tuner bug * refactor code * fix code in updater.py * fix nnictl --help * fix classArgs bug * update check response.status_code logic * remove Buffer warning (#100) * update readme in ga_squad * update readme * fix typo * Update README.md * Update README.md * Update README.md * Add support for debugging mode * fix setup.py (#115) * Add DAG model configuration format for SQuAD example. * Explain config format for SQuAD QA model. * Add more detailed introduction about the evolution algorithm. * Fix install.sh add add trial log path (#109) * fix nnictl bug * fix nnictl create bug * add experiment status logic * add more information for nnictl * fix Evolution Tuner bug * refactor code * fix code in updater.py * fix nnictl --help * fix classArgs bug * update check response.status_code logic * show trial log path * update document * fix install.sh * set default vallue for maxTrialNum and maxExecDuration * fix nnictl * Dev smac (#116) * support package install (#91) * fix nnictl bug * support package install * update * update package install logic * Fix package install issue (#95) * fix nnictl bug * fix pakcage install * support SMAC as a tuner on nni (#81) * update doc * update doc * update doc * update hyperopt installation * update doc * update doc * update description in setup.py * update setup.py * modify encoding * encoding * add encoding * remove pymc3 * update doc * update builtin tuner spec * support smac in sdk, fix logging issue * support smac tuner * add optimize_mode * update config in nnictl * add __init__.py * update smac * update import path * update setup.py: remove entry_point * update rest server validation * fix bug in nnictl launcher * support classArgs: optimize_mode * quick fix bug * test travis * add dependency * add dependency * add dependency * add dependency * create smac python package * fix trivial points * optimize import of tuners, modify nnictl accordingly * fix bug: incorrect algorithm_name * trivial refactor * for debug * support virtual * update doc of SMAC * update smac requirements * update requirements * change debug mode * update doc * update doc * refactor based on comments * fix comments * modify example config path to relative path and increase maxTrialNum (#94) * modify example config path to relative path and increase maxTrialNum * add document * support conda (#90) (#110) * support install from venv and travis CI * support install from venv and travis CI * support install from venv and travis CI * support conda * support conda * modify example config path to relative path and increase maxTrialNum * undo messy commit * undo messy commit * Support pip install as root (#77) * Typo on #58 (#122) * PAI Training Service implementation (#128) * PAI Training service implementation **1. Implement PAITrainingService **2. Add trial-keeper python module, and modify setup.py to install the module **3. Add PAItrainingService rest server to collect metrics from PAI container. * fix datastore for multiple final result (#129) * Update NNI v0.2 release notes (#132) Update NNI v0.2 release notes * Update setup.py Makefile and documents (#130) * update makefile and setup.py * update makefile and setup.py * update document * update document * Update Makefile no travis * update doc * update doc * fix convert from ss to pcs (#133) * Fix bugs about webui (#131) * Fix webui bugs * Fix tslint * webui logpath and document (#135) * Add webui document and logpath as a href * fix tslint * fix comments by Chengmin * Pai training service bug fix and enhancement (#136) * Add NNI installation scripts * Update pai script, update NNI_out_dir * Update NNI dir in nni sdk local.py * Create .nni folder in nni sdk local.py * Add check before creating .nni folder * Fix typo for PAI_INSTALL_NNI_SHELL_FORMAT * Improve annotation (#138) * Improve annotation * Minor bugfix * Selectively install through pip (#139) Selectively install through pip * update setup.py * fix paiTrainingService bugs (#137) * fix nnictl bug * add hdfs host validation * fix bugs * fix dockerfile * fix install.sh * update install.sh * fix dockerfile * Set timeout for HDFSUtility exists function * remove unused TODO * fix sdk * add optional for outputDir and dataDir * refactor dockerfile.base * Remove unused import in hdfsclientUtility * Add documentation for NNI PAI mode experiment (#141) * Add documentation for NNI PAI mode * Fix typo based on PR comments * Exit with subprocess return code of trial keeper * Remove additional exit code * Fix typo based on PR comments * update doc for smac tuner (#140) * Revert "Selectively install through pip (#139)" due to potential pip install issue (#142) * Revert "Selectively install through pip (#139)" This reverts commit 1d17483. * Add exit code of subprocess for trial_keeper * Update README, add link to PAImode doc * Merge branch V0.2 to Master (#143) * webui logpath and document (#135) * Add webui document and logpath as a href * fix tslint * fix comments by Chengmin * Pai training service bug fix and enhancement (#136) * Add NNI installation scripts * Update pai script, update NNI_out_dir * Update NNI dir in nni sdk local.py * Create .nni folder in nni sdk local.py * Add check before creating .nni folder * Fix typo for PAI_INSTALL_NNI_SHELL_FORMAT * Improve annotation (#138) * Improve annotation * Minor bugfix * Selectively install through pip (#139) Selectively install through pip * update setup.py * fix paiTrainingService bugs (#137) * fix nnictl bug * add hdfs host validation * fix bugs * fix dockerfile * fix install.sh * update install.sh * fix dockerfile * Set timeout for HDFSUtility exists function * remove unused TODO * fix sdk * add optional for outputDir and dataDir * refactor dockerfile.base * Remove unused import in hdfsclientUtility * Add documentation for NNI PAI mode experiment (#141) * Add documentation for NNI PAI mode * Fix typo based on PR comments * Exit with subprocess return code of trial keeper * Remove additional exit code * Fix typo based on PR comments * update doc for smac tuner (#140) * Revert "Selectively install through pip (#139)" due to potential pip install issue (#142) * Revert "Selectively install through pip (#139)" This reverts commit 1d17483. * Add exit code of subprocess for trial_keeper * Update README, add link to PAImode doc * fix bug (#147) * Refactor nnictl and add config_pai.yml (#144) * fix nnictl bug * add hdfs host validation * fix bugs * fix dockerfile * fix install.sh * update install.sh * fix dockerfile * Set timeout for HDFSUtility exists function * remove unused TODO * fix sdk * add optional for outputDir and dataDir * refactor dockerfile.base * Remove unused import in hdfsclientUtility * add config_pai.yml * refactor nnictl create logic and add colorful print * fix nnictl stop logic * add annotation for config_pai.yml * add document for start experiment * fix config.yml * fix document * Fix trial keeper wrongly exit issue (#152) * Fix trial keeper bug, use actual exitcode to exit rather than 1 * Fix bug of table sort (#145) * Update doc for PAIMode and v0.2 release notes (#153) * Update v0.2 documentation regards to release note and PAI training service * Update document to describe NNI docker image * fix antd (#159) * refactor experiment stopping logic * support change concurrency * remove trialJobs.ts * trivial changes * fix bugs * fix bug * support updating maxTrialNum * Modify IT scripts for supporting multiple experiments * Update ci (#175) * Update RemoteMachineMode.md (#63) * Remove unused classes for SQuAD QA example. * Remove more unused functions for SQuAD QA example. * Fix default dataset config. * Add Makefile README (#64) * update document (#92) * Edit readme.md * updated a word * Update GetStarted.md * Update GetStarted.md * refact readme, getstarted and write your trial md. * Update README.md * Update WriteYourTrial.md * Update WriteYourTrial.md * Update WriteYourTrial.md * Update WriteYourTrial.md * Fix nnictl bugs and add new feature (#75) * fix nnictl bug * fix nnictl create bug * add experiment status logic * add more information for nnictl * fix Evolution Tuner bug * refactor code * fix code in updater.py * fix nnictl --help * fix classArgs bug * update check response.status_code logic * remove Buffer warning (#100) * update readme in ga_squad * update readme * fix typo * Update README.md * Update README.md * Update README.md * Add support for debugging mode * modify CI cuz of refracting exp stop * update CI for expstop * update CI for expstop * update CI for expstop * update CI for expstop * update CI for expstop * update CI for expstop * update CI for expstop * update CI for expstop * update CI for expstop * file saving * fix issues from code merge * remove $(INSTALL_PREFIX)/nni/nni_manager before install * fix indent * fix merge issue * socket close * update port * fix merge error * modify ci logic in nnimanager * fix ci * fix bug * change suspended to done * update ci (#229) * update ci * update ci * update ci (#232) * update ci * update ci * update azure-pipelines * update azure-pipelines * update ci (#233) * update ci * update ci * update azure-pipelines * update azure-pipelines * update azure-pipelines * run.py (#238) * Nnupdate ci (#239) * run.py * test ci * Nnupdate ci (#240) * run.py * test ci * test ci * Udci (#241) * run.py * test ci * test ci * test ci * update ci (#242) * run.py * test ci * test ci * test ci * update ci * revert install.sh (#244) * run.py * test ci * test ci * test ci * update ci * revert install.sh * add comments * remove assert * trivial change * trivial change

* update Makefile * update Makefile

* update Makefile * update Makefile * add builtin-tuner test * add builtin-tuner test * refractor ci * update azure.yml * add built-in tuner test * fix bugs

* doc refactor * image name refactor

Refactor nnictl to support listing stopped experiments.

* add pycharm project files to .gitignore list * update pylintrc to conform vscode settings * fix RemoteMachineMode for wrong trainingServicePlatform

* fix bug about execDuration and endTime * modify time interval to 30 seconds * refactor based on Gems's suggestion * for triggering ci

* refactor Dockerfile

support tensorboard

* Rename get_parameters to get_next_parameter * annotations add get_next_parameter * updates * updates * updates * updates * updates

* fix paramiko install

* plus minor proposals

* TGS salt example * updates * updates

* add pycharm project files to .gitignore list * update pylintrc to conform vscode settings * fix RemoteMachineMode for wrong trainingServicePlatform * add python cache files to gitignore list * move extract scalar reward logic from dispatcher to tuner * update tuner code corresponding to last commit * update doc for receive_trial_result api change * add numpy to package whitelist of pylint * distinguish param value from return reward for tuner.extract_scalar_reward * update pylintrc * add comments to dispatcher.handle_report_metric_data * refactor extract reward from dict by tuner

added License badge

* Update nnictl.py * modify help message for nnictl stop

* update doc for docker image * update

* Change base image from devel to runtime, to reduce docker image size * Support running multiple experiment for PAI * Fix a bug regarding to recuisively reference between paiRestServer and paiTrainingService

* update makefile * update launcher.py to fix the problem of finding main.js * remove duplicated lib

* update local demo doc and configuration * change folder name * Update tutorial_1_CR_exp_local_api.md no need to have a new training file * Delete mnist_gpu.py no need to have a new training file * Update config_gpu.yml no need to have a new training file

* update local demo doc and configuration * change folder name * Update tutorial_1_CR_exp_local_api.md no need to have a new training file * Delete mnist_gpu.py no need to have a new training file * Update config_gpu.yml no need to have a new training file * add PyTorch to Dockerfile

1.Set scikit-learn==0.20.0 in Dockerfile 2.Update readme.md of dockerile 3.Add PyTorch 0.4.1 4.Add description for 'nnictl stop all'

Remove "RUN python3 -m pip --no-cache-dir install torch torchvision"

- Updated document for "write a trial" related fixes per Quanlu's feedback; - Fix wrong links in Get started per Meng's feedback.

#355) * Fix the issue#211: WebUI does not support search for a specific Trial * delete unuseful code * Update * default 20

* add gridsearch tuner * add gridsearchtuner * add gridsearchtuner * add gridsearchtuner * update gridsearch tuner * update gridsearch tuner * update gridsearch tuner * update gridsearch tuner * update gridsearch tuner * update gridsearch tuner * update gridsearch tuner * update gridsearch and pylint

Fix "nnictl stop"

* Add more tooltip in default metric graph and fix bug * update

…branch (#382) * Kubeflow TrainingService support, v1 (#373) 1. Create new Training Service: kubeflow trainning service, use 'kubectl' and kubeflow tfjobs CRD to submit and manage jobs 2. Update nni python SDK to support new kubeflow platform 3. Update nni python SDK's get_sequende_id() implementation, read NNI_TRIAL_SEQ_ID env variable, instead of reading .nni/sequence_id file 4. This version only supports Tensorflow operator. Will add more operators' support in future versions

* fix sdk's unittest and add medianstop, batchtuner to ci * fix sdk's unittest and add medianstop, batchtuner to ci * remove debug info * update azure-pipelines * remove useless code * add some checks * fix pylint * update ci test * update ci

* Asynchronous dispatcher * updates * updates * updates * updates

…support distributed training (#387) * Support distributed training on tf-operator, for worker and ps * Update validation rule for kubeflow config * small code refactor adjustment for private methods * Use different output folder for ps and worker

* add gpuNum check for local TS * set CUDA_VISIBLE_DEVICES to empty string when gpuNum is 0 * remove redundency code

…ar (#388) * Use different output folder for ps and worker * Add cuda_visible_devices env var if gpuNum is 0

SparkSnail and others added 30 commits October 18, 2018 16:17

Quick fix bug: nnictl port value error (#245)

b183c3d

* fix port bug

update Makefile (#246)

d058a84

* update Makefile * update Makefile

quick fix for ci (#248)

30e2352

add update trialNum and fix bugs (#261)

ac5fda4

Add builtin tuner to CI (#247)

8d866b5

* update Makefile * update Makefile * add builtin-tuner test * add builtin-tuner test * refractor ci * update azure.yml * add built-in tuner test * fix bugs

Doc refactor (#258)

32478a1

* doc refactor * image name refactor

Refactor nnictl to support listing stopped experiments. (#256)

ebbadfe

Refactor nnictl to support listing stopped experiments.

Show experiment parameters more beautifully (#262)

a26c46b

fix error on example of RemoteMachineMode (#269)

82aa37b

* add pycharm project files to .gitignore list * update pylintrc to conform vscode settings * fix RemoteMachineMode for wrong trainingServicePlatform

Update docker file to use latest nni release (#263)

60ad940

fix bug about execDuration and endTime (#270)

8c93890

* fix bug about execDuration and endTime * modify time interval to 30 seconds * refactor based on Gems's suggestion * for triggering ci

Refactor dockerfile (#264)

ad8afc5

* refactor Dockerfile

Support nnictl tensorboard (#268)

a101461

support tensorboard

Sdk update (#272)

0c17e2d

* Rename get_parameters to get_next_parameter * annotations add get_next_parameter * updates * updates * updates * updates * updates

add experiment log path to experiment profile (#276)

da21bf2

Add sequenceId to TrialJobInfo (#283)

b07309d

Show error information and fix paramiko installation (#282)

67453d1

* fix paramiko install

Refactor pip installation logic for supporting uninstall

4a54e11

Update documents due to new pip installation approach

684fc31

Refactor Makefile for consistent with pip installation approach

17d8566

Add README for building and uploading NNI package

02744b5

Fix issues for pip installation

52aa66a

Minor fix on #41 (#280)

ee390c0

Typo on #12 (#281)

b702854

* plus minor proposals

Quick fix resume logic (#285)

44ab774

Tgs salt example (#286)

e240c7a

* TGS salt example * updates * updates

Hide install via pip prompt, since 0.3 has not been published (#287)

2ce0083

Update README.md (#288)

ff6a7df

added License badge

noklam and others added 26 commits November 12, 2018 13:39

Update nnictl.py (#347)

48b91c4

* Update nnictl.py * modify help message for nnictl stop

update doc for docker image (#353)

35e0832

* update doc for docker image * update

[PAI training service] Support running multiple PAI experiment (#348)

b1d4c12

* Change base image from devel to runtime, to reduce docker image size * Support running multiple experiment for PAI * Fix a bug regarding to recuisively reference between paiRestServer and paiTrainingService

update makefile (#350)

b345da0

* update makefile * update launcher.py to fix the problem of finding main.js * remove duplicated lib

Add Pytorch and set sklearn version in Dockerfile (#346)

e390125

1.Set scikit-learn==0.20.0 in Dockerfile 2.Update readme.md of dockerile 3.Add PyTorch 0.4.1 4.Add description for 'nnictl stop all'

Quick fix Docker (#363)

183763e

Remove "RUN python3 -m pip --no-cache-dir install torch torchvision"

Updated document for "write a trial" related fixes. (#351)

9380e68

- Updated document for "write a trial" related fixes per Quanlu's feedback; - Fix wrong links in Get started per Meng's feedback.

Fix the issue#211: WebUI does not support search for a specific Trial (

5b24f04

#355) * Fix the issue#211: WebUI does not support search for a specific Trial * delete unuseful code * Update * default 20

add more details for remote mode docs (#366)

9f62bf6

add more details for remote mode docs (#365)

e45db62

update tutorial for remote machine as well (#367)

1e9cd5f

Support hyper-band (#358)

f253576

Fix nni stop (#368)

8f71617

Fix "nnictl stop"

Add more tooltips in default metric graph (#370)

9cc234b

* Add more tooltip in default metric graph and fix bug * update

Update README.md (#371)

b749266

Add Gitter badge (#376)

76277db

Show intermediate result (#384)

8d63b10

Asynchronous dispatcher (#372)

a5d614d

* Asynchronous dispatcher * updates * updates * updates * updates

add gpuNum check for local TS (#378)

1df750e

* add gpuNum check for local TS * set CUDA_VISIBLE_DEVICES to empty string when gpuNum is 0 * remove redundency code

[Kubeflow Training Service] Explicitly set cuda_visible_devices env v…

28e26ae

…ar (#388) * Use different output folder for ps and worker * Add cuda_visible_devices env var if gpuNum is 0

yds05 requested review from yds05, scarlett2018 and SparkSnail November 23, 2018 07:17

yds05 merged commit 43d2dbd into dev-weight-sharing Nov 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update dev-weight-sharing to latest master #391

update dev-weight-sharing to latest master #391

leckie-chn commented Nov 23, 2018

update dev-weight-sharing to latest master #391

update dev-weight-sharing to latest master #391

Conversation

leckie-chn commented Nov 23, 2018