# Changelog

All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).

## [Unreleased] - YYYY-MM-DD

### Added

### Changed

### Deprecated

### Removed

### Fixed


## [v0.28.1] - 2023-06-12

### Fixed

- Fixed build with gcc 12. ([#3925](https://github.com/horovod/horovod/pull/3925))
- PyTorch: Fixed build on ROCm. ([#3928](https://github.com/horovod/horovod/pull/3928))
- TensorFlow: Fixed local_rank_op. ([#3940](https://github.com/horovod/horovod/pull/3940))


## [v0.28.0] - 2023-05-10

### Added

- TensorFlow: Added new `get_local_and_global_gradients` to PartialDistributedGradientTape to retrieve local and non-local gradients separately. ([#3859](https://github.com/horovod/horovod/pull/3859))

### Changed

- Improved reducescatter performance by allocating output tensors before enqueuing the operation. ([#3824](https://github.com/horovod/horovod/pull/3824))
- TensorFlow: Ensured that `tf.logical_and` within allreduce `tf.cond` runs on CPU. ([#3885](https://github.com/horovod/horovod/pull/3885))
- TensorFlow: Added support for Keras 2.11+ optimizers. ([#3860](https://github.com/horovod/horovod/pull/3860))
- `CUDA_VISIBLE_DEVICES` environment variable is no longer passed to remote nodes. ([#3865](https://github.com/horovod/horovod/pull/3865))

### Fixed

- Fixed build with ROCm. ([#3839](https://github.com/horovod/horovod/pull/3839), [#3848](https://github.com/horovod/horovod/pull/3848))
- Fixed build of Docker image horovod-nvtabular. ([#3851](https://github.com/horovod/horovod/pull/3851))
- Fixed linking recent NCCL by defaulting CUDA runtime library linkage to static and ensuring that weak symbols are overridden. ([#3867](https://github.com/horovod/horovod/pull/3867), [#3846](https://github.com/horovod/horovod/pull/3846))
- Fixed compatibility with TensorFlow 2.12 and recent nightly versions. ([#3864](https://github.com/horovod/horovod/pull/3864), [#3894](https://github.com/horovod/horovod/pull/3894), [#3906](https://github.com/horovod/horovod/pull/3906), [#3907](https://github.com/horovod/horovod/pull/3907))
- Fixed missing arguments of Keras allreduce function. ([#3905](https://github.com/horovod/horovod/pull/3905))
- Updated with_device functions in MXNet and PyTorch to skip unnecessary cudaSetDevice calls. ([#3912](https://github.com/horovod/horovod/pull/3912))


## [v0.27.0] - 2023-02-01

### Added

- Keras: Added `PartialDistributedOptimizer` API. ([#3738](https://github.com/horovod/horovod/pull/3738))
- Added `HOROVOD_SPARK_USE_LOCAL_RANK_GPU_INDEX` environment variable to ignore GPU device indices assigned by Spark and always use local rank GPU device in Spark estimators. ([#3737](https://github.com/horovod/horovod/pull/3737))
- Added support for reducescatter arguments `prescale_factor` and `postscale_factor` and moved averaging into Horovod backend. ([#3815](https://github.com/horovod/horovod/pull/3815))
- Spark Estimator: Added support for custom data loaders in TorchEstimator. ([#3787](https://github.com/horovod/horovod/pull/3787))
- Spark Estimator: Added NVTabular data loader for TorchEstimator. ([#3787](https://github.com/horovod/horovod/pull/3787))

### Changed

- Improved NCCL performance for fused allgather operations through padding for better memory alignment. ([#3727](https://github.com/horovod/horovod/pull/3727))
- Improved look-ahead tensor fusion buffer size estimates when allgather and other operations are mixed. ([#3727](https://github.com/horovod/horovod/pull/3727))

### Fixed

- ROCm: Fixed GPU MPI operations support in build. ([#3746](https://github.com/horovod/horovod/pull/3746))
- PyTorch: Fixed linking order to avoid using Gloo from PyTorch dynamic libraries. ([#3750](https://github.com/horovod/horovod/pull/3750))
- Fixed memory leak in `MPI_GPUAllgather`. ([#3727](https://github.com/horovod/horovod/pull/3727))
- TensorFlow: Fixed deprecation warnings when building with TensorFlow 2.11. ([#3767](https://github.com/horovod/horovod/pull/3767))
- Keras: Added support for additional arguments to `SyncBatchNormalization._moments()`. ([#3775](https://github.com/horovod/horovod/pull/3775))
- Fixed version number parsing with pypa/packaging 22.0. ([#3794](https://github.com/horovod/horovod/pull/3794))
- TensorFlow: Fixed linking with nightly versions leading up to TensorFlow 2.12. ([#3755](https://github.com/horovod/horovod/pull/3755))
- TensorFlow: Fixed handling of `tf.IndexedSlices` types when scaling local gradients. ([#3786](https://github.com/horovod/horovod/pull/3786))
- Added missing `MEMCPY_IN_FUSION_BUFFER` timeline event for reducescatter. ([#3808](https://github.com/horovod/horovod/pull/3808))
- Fixed build of Docker image horovod-nvtabular. ([#3817](https://github.com/horovod/horovod/pull/3817))
- TensorFlow: Several fixes for allreduce and grouped allreduce handling of `tf.IndexedSlices`. ([#3813](https://github.com/horovod/horovod/pull/3813))
- Spark: Restricted PyArrow to versions < 11.0. ([#3830](https://github.com/horovod/horovod/pull/3830))
- TensorFlow: Resolved conflicts between multiple optimizer wrappers reusing the same gradient accumulation counter. ([#3783](https://github.com/horovod/horovod/pull/3783))
- TensorFlow/Keras: Fixed `DistributedOptimizer` with Keras 2.11+. ([#3822](https://github.com/horovod/horovod/pull/3822))
- PyTorch, ROCm: Fixed allreduce average on process sets. ([#3815](https://github.com/horovod/horovod/pull/3815))

## [v0.26.1] - 2022-10-14

### Fixed

- Fixed packaging import during install to occur after install_requires. ([#3741](https://github.com/horovod/horovod/pull/3741))

## [v0.26.0] - 2022-10-13

### Added

- Spark Estimator: Added support for custom data loaders in KerasEstimator. ([#3603](https://github.com/horovod/horovod/pull/3603))
- Spark Estimator: Added NVTabular data loader for KerasEstimator. ([#3603](https://github.com/horovod/horovod/pull/3603))
- Spark Estimator: Added gradient accumulation support to Spark torch estimator. ([#3681](https://github.com/horovod/horovod/pull/3681))
- TensorFlow: Added `register_local_var` functionality to distributed optimizers and local gradient aggregators. ([#3695](https://github.com/horovod/horovod/pull/3695))
- TensorFlow: Added support for local variables for `BroadcastGlobalVariablesCallback`. ([#3703](https://github.com/horovod/horovod/pull/3703))
- Enabled use of native `ncclAvg` op for NCCL allreduces. ([#3646](https://github.com/horovod/horovod/pull/3646))
- Added support for additional reduction operations for `allreduce` (min, max, product). ([#3660](https://github.com/horovod/horovod/pull/3660))
- Added 2D torus `allreduce` using NCCL. ([#3608](https://github.com/horovod/horovod/pull/3608))
- Added support for Petastorm reader level parallel shuffling. ([#3665](https://github.com/horovod/horovod/pull/3665))
- Added random seed support for Lightning datamodule to generate reproducible data loading outputs. ([#3665](https://github.com/horovod/horovod/pull/3665))
- Added support for `int8` and `uint8` `allreduce` and `grouped_allreduce` in TensorFlow. ([#3649](https://github.com/horovod/horovod/pull/3649))
- Added support for batched memory copies in `GPUAllgather`. ([#3590](https://github.com/horovod/horovod/pull/3590))
- Added support for batched memory copies in `GPUReducescatter`. ([#3621](https://github.com/horovod/horovod/pull/3621))
- Added `hvd.grouped_allgather()` and `hvd.grouped_reducescatter()` operations. ([#3594](https://github.com/horovod/horovod/pull/3594))
- Added warning messages if output tensor memory allocations fail. ([#3594](https://github.com/horovod/horovod/pull/3594))
- Added `register_local_source` and `use_generic_names` funtionality to `DistributedGradientTape`. ([#3628](https://github.com/horovod/horovod/pull/3628))
- Added `PartialDistributedGradientTape()` API for model parallel use cases. ([#3643](https://github.com/horovod/horovod/pull/3643))
- Spark/Lightning: Added `reader_worker_count` and `reader_pool_type`. ([#3612](https://github.com/horovod/horovod/pull/3612))
- Spark/Lightning: Added `transformation_edit_fields` and `transformation_removed_fields` param for `EstimatorParams`. ([#3651](https://github.com/horovod/horovod/pull/3651))
- TensorFlow: Added doc string for `hvd.grouped_allreduce()`. ([#3594](https://github.com/horovod/horovod/pull/3594))
- ROCm: Enabled `alltoall`. ([#3654](https://github.com/horovod/horovod/pull/3654))

### Changed

- Default Petastorm reader pool is changed from `process` to `thread` for lower memory usage. ([#3665](https://github.com/horovod/horovod/pull/3665))
- Keras: Support only legacy optimizers in Keras 2.11+. ([#3725](https://github.com/horovod/horovod/pull/3725))
- Gloo: When negotiating, use `gather` rather than `allgather`. ([#3633](https://github.com/horovod/horovod/pull/3633))
- Use `packaging.version` instead of `distutils` version classes. ([#3700](https://github.com/horovod/horovod/pull/3700))

### Deprecated

- Deprecated field `shuffle_buffer_size` from `EstimatorParams`. Use `shuffle` to enable shuffle or not. ([#3665](https://github.com/horovod/horovod/pull/3665))

### Removed

- Build: Removed std::regex use for better cxxabi11 compatibility. ([#3584](https://github.com/horovod/horovod/pull/3584))

### Fixed

- TensorFlow: Fixed the optimizer iteration increments when `backward_passes_per_step > 1`. ([#3631](https://github.com/horovod/horovod/pull/3631))
- Fixed `FuseResponses()` on `BATCHED_D2D_PADDING` edge cases for Reducescatter and/or ROCm. ([#3621](https://github.com/horovod/horovod/pull/3621))
- PyTorch: Fixed Reducescatter functions to raise `HorovodInternalError` rather than `RuntimeError`. ([#3594](https://github.com/horovod/horovod/pull/3594))
- PyTorch on GPUs without GPU operations: Fixed grouped allreduce to set CPU device in tensor table. ([#3594](https://github.com/horovod/horovod/pull/3594))
- Fixed race condition in PyTorch allocation handling. ([#3639](https://github.com/horovod/horovod/pull/3639))
- Build: Fixed finding `nvcc` (if not in `$PATH`) with older versions of CMake. ([#3682](https://github.com/horovod/horovod/pull/3682))
- Fixed `reducescatter()` and `grouped_reducescatter()` to raise clean exceptions for scalar inputs. ([#3699](https://github.com/horovod/horovod/pull/3699))
- Updated Eigen submodule to fix build on macOS with aarch64. ([#3619](https://github.com/horovod/horovod/pull/3619))
- Build: Correctly select files in `torch/` directory to be hipified. ([#3588](https://github.com/horovod/horovod/pull/3588))
- Build: Modify regex match for CUDA|ROCm in `FindPytorch.cmake`. ([#3593](https://github.com/horovod/horovod/pull/3593))
- Build: Fixed ROCm-specific build failure. ([#3630](https://github.com/horovod/horovod/pull/3630))

## [v0.25.0] - 2022-06-20

### Added

- Added `hvd.reducescatter()` operation with implementations in NCCL, MPI, and Gloo. ([#3299](https://github.com/horovod/horovod/pull/3299), [#3574](https://github.com/horovod/horovod/pull/3574))
- Added AMD GPU XLA Op Implementation. ([#3486](https://github.com/horovod/horovod/pull/3486))
- Added Horovod job to spin up distributed TensorFlow Data Service. ([#3525](https://github.com/horovod/horovod/pull/3525))
- Spark: Expose random seed as an optional parameter. ([#3517](https://github.com/horovod/horovod/pull/3517))
- Add Helm Chart. ([#3546](https://github.com/horovod/horovod/pull/3546))
- Elastic: Add elastic run API. ([#3503](https://github.com/horovod/horovod/pull/3503))
- Spark Estimator: Expose random seed for model training reproducibility. ([#3517](https://github.com/horovod/horovod/pull/3517))
- Spark Estimator: Add option whether to use GPUs at all. ([#3526](https://github.com/horovod/horovod/pull/3526))
- Spark Estimator: Expose parameter to set start method for `multiprocessing`. ([#3580](https://github.com/horovod/horovod/pull/3580))

### Changed

- MXNet: Updated allreduce functions to newer `op` API. ([#3299](https://github.com/horovod/horovod/pull/3299))
- TensorFlow: Make TensorFlow output allocations asynchronous when using NCCL backend. ([#3464](https://github.com/horovod/horovod/pull/3464))
- TensorFlow: Clear locally accumulated gradient by assigning with `zeros_like` to avoid infinite gradient not correctly cleared up. ([#3505](https://github.com/horovod/horovod/pull/3505))
- Make `HorovodVersionMismatchError` subclass `ImportError` instead of just a standard `Exception`. ([#3549](https://github.com/horovod/horovod/pull/3549))
- Elastic: Catch any exception to prevent the discovery thread from silently dying. ([#3436](https://github.com/horovod/horovod/pull/3436))
- Horovodrun: Exit check_build (`--check-build`) via `sys.exit` to flush stdout. ([#3272](https://github.com/horovod/horovod/pull/3272))
- Spark: Use `env` to set environment vars in remote shell. ([#3489](https://github.com/horovod/horovod/pull/3489))
- Build: Avoid redundant ptx generation for maximum specified compute capability. ([#3509](https://github.com/horovod/horovod/pull/3509))

### Deprecated

- MXNet: Deprecated `average` argument of allreduce functions. ([#3299](https://github.com/horovod/horovod/pull/3299))
- Public and internal APIs: deprecate use of np, min_np, max_np. Use num_proc, min_num_proc, and max_num_proc, respectively, instead. ([#3409](https://github.com/horovod/horovod/pull/3409))
- Horovodrun: Providing multiple NICS as comma-separated string via `--network-interface` is deprecated,
  use `--network-interface` multiple times or `--network-interfaces` instead. ([#3506](https://github.com/horovod/horovod/pull/3506))
- horovod.run: Argument `network_interface` with comma-separated string is deprecated,
  use `network_interfaces` with `Iterable[str]` instead. ([#3506](https://github.com/horovod/horovod/pull/3506))

### Fixed

- Fallback to NCCL shared lib if static one is not found. ([#3500]((https://github.com/horovod/horovod/pull/3500))
- Spark/Lightning: Added missing `tranform_spec` for Petastorm datamodule. ([#3543](https://github.com/horovod/horovod/pull/3543))
- Spark/Lightning: Fixed PTL Spark example with checkpoint usage by calling `save_hyperparameters()`. ([#3527](https://github.com/horovod/horovod/pull/3527))
- Elastic: Fixed empty hostname returned from `HostDiscoveryScript`. ([#3490](https://github.com/horovod/horovod/pull/3490))
- TensorFlow 2.9: Fixed build for API change related to `tensorflow_accelerator_device_info`. ([#3513](https://github.com/horovod/horovod/pull/3513))
- TensorFlow 2.10: Bumped build partially to C++17. ([#3558](https://github.com/horovod/horovod/pull/3558))
- TensorFlow: Fixed gradient update timing in TF `AggregationHelperEager`. ([#3496](https://github.com/horovod/horovod/pull/3496))
- TensorFlow: Fixed resource `NotFoundError` in TF `AggregationHelper`. ([#3499](https://github.com/horovod/horovod/pull/3499))

## [v0.24.3] - 2022-04-21

### Fixed

- Make DBFSLocalStore support "file:/dbfs/...", implement get_localized_path. ([#3510](https://github.com/horovod/horovod/pull/3510))

## [v0.24.2] - 2022-03-10

### Fixed

- Setup: Require fsspec >= 2010.07.0 ([#3451](https://github.com/horovod/horovod/pull/3451))
- Fix ignored cuda arch flags ([#3462]((https://github.com/horovod/horovod/pull/3462))

## [v0.24.1] - 2022-03-03

### Fixed

- Extended CMake build script to often find CUDA even if `nvcc` is not in `$PATH`. ([#3444](https://github.com/horovod/horovod/pull/3444))

## [v0.24.0] - 2022-03-01

### Added

- Ray: Added elastic keyword parameters to RayExecutor API: This API supports both static (non-elastic) and elastic Horovod jobs. ([#3190](https://github.com/horovod/horovod/issues/3190))
- TensorFlow: Added in-place broadcasting of variables. ([#3128](https://github.com/horovod/horovod/pull/3128))
- Elastic: Added support for resurrecting blacklisted hosts. ([#3319](https://github.com/horovod/horovod/pull/3319))
- MXNet: Added support for MXNet async dependency engine. ([#3242](https://github.com/horovod/horovod/pull/3242), [#2963](https://github.com/horovod/horovod/pull/2963))
- Spark/Lightning: Added history to lightning estimator. ([#3214](https://github.com/horovod/horovod/pull/3214))

### Changed

- Moved to CMake version 3.13 with first-class CUDA language support and re-enabled parallelized builds. Uses a temporary installation of CMake if CMake 3.13 is not found. ([#3261](https://github.com/horovod/horovod/pull/3261), [#3371](https://github.com/horovod/horovod/pull/3371))
- Moved released Docker image `horovod` and `horovod-cpu` to Ubuntu 20.04 and Python 3.8. ([#3393](https://github.com/horovod/horovod/pull/3393))
- Spark Estimator: Don't shuffle row groups if training data requires non-shuffle ([#3369](https://github.com/horovod/horovod/pull/3369))
- Spark/Lightning: Reduced memory footprint of async dataloader. ([#3239](https://github.com/horovod/horovod/pull/3239))
- Elastic: Improved handling NCCL errors under elastic scenario. ([#3112](https://github.com/horovod/horovod/pull/3112))
- Spark/Lightning: Do not overwrite model with checkpoint by default. ([#3201](https://github.com/horovod/horovod/pull/3201))
- Make checkpoint name optional so that user can save to h5 format. ([#3411](https://github.com/horovod/horovod/pull/3411))

### Deprecated

- Deprecated ElasticRayExecutor APIs in favor of the new RayExecutor API. ([#3190](https://github.com/horovod/horovod/issues/3190))

### Removed

- Spark: Removed `h5py<3` constraint as this is not needed anymore for Tensorflow >2.5.0. ([#3301](https://github.com/horovod/horovod/pull/3301))

### Fixed

- Elastic Spark: Fixed indices in initial task-to-task registration. ([#3410](https://github.com/horovod/horovod/pull/3410))
- PyTorch: Fixed GIL-related deadlock with PyTorch 1.10.1. ([#3352](https://github.com/horovod/horovod/issues/3352))
- PyTorch: Fixed finalization of ProcessSetTable. ([#3351](https://github.com/horovod/horovod/pull/3351))
- Fixed remote trainers to point to the correct shared lib path. ([#3258](https://github.com/horovod/horovod/pull/3258))
- Fixed imports from `tensorflow.python.keras` with tensorflow 2.6.0+. ([#3403](https://github.com/horovod/horovod/pull/3403))
- Fixed Adasum communicator init logic. ([#3379](https://github.com/horovod/horovod/pull/3379))
- Lightning: Fixed resume logger. ([#3375](https://github.com/horovod/horovod/pull/3375))
- Fixed the checkpoint directory structure for pytorch and pytorch lightning. ([#3362](https://github.com/horovod/horovod/pull/3362))
- Fixed possible integer overflow in multiplication. ([#3368](https://github.com/horovod/horovod/pull/3368))
- Fixed the `pytorch_lightning_mnist.py` example. ([#3245](https://github.com/horovod/horovod/pull/3245), [#3290](https://github.com/horovod/horovod/pull/3290))
- Fixed barrier segmentation fault. ([#3313](https://github.com/horovod/horovod/pull/3313))
- Fixed `hvd.barrier()` tensor queue management. ([#3300](https://github.com/horovod/horovod/pull/3300))
- Fixed PyArrow "list index out of range" IndexError. ([#3274](https://github.com/horovod/horovod/pull/3274))
- Elastic: Fixed all workers sometimes failing on elastic Horovod failure. ([#3264](https://github.com/horovod/horovod/issues/3264))
- Spark/Lightning: Fixed setting `limit_train_batches` and `limit_val_batches`. ([#3237](https://github.com/horovod/horovod/pull/3237))
- Elastic: Fixed ElasticSampler and `hvd.elastic.state` losing some indices of processed samples when nodes dropped. ([#3143](https://github.com/horovod/horovod/issues/3143))
- Spark/Lightning: Fixed history metrics for estimator serialization. ([#3216](https://github.com/horovod/horovod/pull/3216))
- Ray: Fixed RayExecutor to fail when `num_workers=0` and `num_hosts=None`. ([#3210](https://github.com/horovod/horovod/pull/3210))
- Spark/Lightning: Fixed checkpoint callback `dirpath` typo. ([#3204](https://github.com/horovod/horovod/pull/3204))

## [v0.23.0] - 2021-10-06

### Added

- Added process sets to concurrently run collective operations on subsets of Horovod processes in TensorFlow, PyTorch, and MXNet. ([#2839](https://github.com/horovod/horovod/pull/2839), [#3042](https://github.com/horovod/horovod/pull/3042), [#3043](https://github.com/horovod/horovod/pull/3043), [#3054](https://github.com/horovod/horovod/pull/3054), [#3083](https://github.com/horovod/horovod/pull/3083), [#3090](https://github.com/horovod/horovod/pull/3090))
- Added XLA support for Allreduce via `tf.function(jit_compile=True)`. ([#3053](https://github.com/horovod/horovod/pull/3053))
- Added fused buffer scaling and unpack/pack kernels on GPU. ([#2973](https://github.com/horovod/horovod/pull/2973))
- Added support for NCCL on CUDA 11.4. ([#3182](https://github.com/horovod/horovod/issues/3182))
- Added fp16 compression for MXNet. ([#2987](https://github.com/horovod/horovod/issues/2987))
- Added terminate_on_nan flag to Spark Lightning estimator. ([#3088](https://github.com/horovod/horovod/issues/3088))
- Added barrier() API to torch module to support simple synchronization among ranks and to achieve parity with PyTorch DDP and similar frameworks. [#3139](https://github.com/horovod/horovod/pull/3139)
- Added params for customizing Tensorboard callback. ([#3153](https://github.com/horovod/horovod/issues/3153))
- Added `hvd.cross_rank()` for keras. ([#3008](https://github.com/horovod/horovod/issues/3008))
- Added barrier() API to torch module to support simple synchronization among ranks and to achieve parity with PyTorch DDP and similar frameworks. [#3139](https://github.com/horovod/horovod/pull/3139)

### Changed

- Implemented more asynchronous dependency handling on GPU. ([#2963](https://github.com/horovod/horovod/pull/2963))
- Ray: RayExecutor will now use the current placement group instead of always creating a new one. ([#3134](https://github.com/horovod/horovod/pull/3134))
- Lightning: turned off shuffling for validation dataset. ([#2974](https://github.com/horovod/horovod/pull/2974))
- Ray: RayExecutor will use the current placement group if one exists. ([#3134](https://github.com/horovod/horovod/pull/3134))
- Extended `hvd.join()` to return the last rank that joined. ([#3097](https://github.com/horovod/horovod/pull/3097)

### Deprecated

### Removed

- Spark/Keras: remove bare Keras support. ([#3191](https://github.com/horovod/horovod/pull/3191))

### Fixed

- Fix Horovod develop/editable install mode and incremental builds. ([#3074](https://github.com/horovod/horovod/pull/3074))
- Estimator/Lightning: use lightning datamodule. ([#3084](https://github.com/horovod/horovod/pull/3084))
- Fix Horovod Spark StringType and numpy type mapping issue. ([#3146](https://github.com/horovod/horovod/pull/3146))
- Fixed error in Keras LearningRateScheduler. ([#3135](https://github.com/horovod/horovod/pull/3135))
- Fixed bug in Lightning Profiler on Ray. ([#3122](https://github.com/horovod/horovod/pull/3122))
- Fixed torch op lazy release to prevent OOM in elastic training. ([#3110](https://github.com/horovod/horovod/pull/3110))
- Lightning: Fixed usage of the checkpoint callback. ([#3186](https://github.com/horovod/horovod/pull/3186))
- Fixed MPICH support to use Intel MPI's implementation. ([#3148](https://github.com/horovod/horovod/pull/3148))
- Fixed race condition in PyTorch async dataloader. ([#3120](https://github.com/horovod/horovod/pull/3120))
- Keras: Fixed learning rate scheduler. ([#3142](https://github.com/horovod/horovod/pull/3142), [#3135](https://github.com/horovod/horovod/pull/3135))

## [v0.22.1] - 2021-06-10

### Added

- Estimator: added support for loading data from S3, GCS, ADLS, and other remote filesystems. ([#2927](https://github.com/horovod/horovod/issues/2927))
- Estimator: added custom Spark data loader interface. ([#2938](https://github.com/horovod/horovod/issues/2923))
- LightningEstimator: added support to supply a logger and associated parameter to control the frequency of logging. ([#2926](https://github.com/horovod/horovod/pull/2926))
- Estimator: added check to ensure all ranks have the same device type. ([#2942](https://github.com/horovod/horovod/pull/2942))

### Changed

- Changed behavior from using TensorBoardLogger to now using it as a fallback if a logger is not supplied. ([#2926](https://github.com/horovod/horovod/pull/2926))
- Ray: disabled capturing child tasks in placement group. ([#2920](https://github.com/horovod/horovod/pull/2920))

### Fixed

- Fixed `hvd.tensorflow.keras.Compression`, accidentally removed in v0.22.0. ([#2945](https://github.com/horovod/horovod/pull/2945))
- TorchEstimator: fixed usage of `validation_steps` in place of `validation_steps_per_epoch`. ([#2918](https://github.com/horovod/horovod/pull/2918))
- TensorFlow: fixed C++ API for TF v2.6.0. ([#2932](https://github.com/horovod/horovod/pull/2932))
- PyTorch: fixed `sparse_allreduce_async` for PyTorch v0.10.0. ([#2965](https://github.com/horovod/horovod/pull/2965))

## [v0.22.0] - 2021-05-18

### Added

- Added pytorch_lightning spark estimator which enables training pytorch_lightning models. ([#2713](https://github.com/horovod/horovod/pull/2713))
- Added NVTX tracing hooks for profiling with Nsight Systems. ([#2723](https://github.com/horovod/horovod/pull/2723))
- Added a generic `num_workers` API for ``RayExecutor`` ([#2870](https://github.com/horovod/horovod/pull/2870))
- Supports Ray Client without code changes. ([#2882](https://github.com/horovod/horovod/pull/2882))
- Supports inmemory cache option for Keras Estimator. ([#2896](https://github.com/horovod/horovod/pull/2896))
- Added FP16 support for GPU tensor in mxnet. ([#2915](https://github.com/horovod/horovod/pull/2915))
- Added response caching for allgather operations. ([#2872](https://github.com/horovod/horovod/pull/2872))
- Estimator: add petastorm reader_pool_type into constructor ([#2903](https://github.com/horovod/horovod/pull/2903))

### Changed

- Changed `alltoall` to return the received splits as a second return value if non-uniform splits are sent. ([#2631](https://github.com/horovod/horovod/pull/2631))
- Changed ``RayExecutor`` to use [Ray Placement Groups](https://docs.ray.io/en/master/placement-group.html) for worker colocation. ([#2824](https://github.com/horovod/horovod/pull/2824))
- Changed ``Inmemory dataloader`` usage for Torch Estimator with petastorm v0.11.0 release. ([#2896](https://github.com/horovod/horovod/pull/2896))

### Fixed

- Changed RayExecutor to use Ray node ID to enable multi-container:single-host setups. ([#2883](https://github.com/horovod/horovod/pull/2882))
- Support sparse gradients aggregation in TF1 Keras. ([#2879](https://github.com/horovod/horovod/pull/2879))
- Respect `global_step` parameter for LegacyOptimizers when aggregating gradients.  ([#2879](https://github.com/horovod/horovod/pull/2879))
- Fixed compatibility with PyTorch 1.9.0. ([#2829](https://github.com/horovod/horovod/pull/2829))

## [v0.21.3] - 2021-02-15

### Added

- Add `groups` parameter in `DistributedOptimizer` for custom allreduce groups. ([#2523](https://github.com/horovod/horovod/pull/2523))

### Removed

- Removed `num_groups` parameter in `DistributedOptimizer`, replaced with `groups`. ([#2523](https://github.com/horovod/horovod/pull/2523))

### Fixed

- Fixed worker desynchronization deadlock issue in TensorFlow 2.4. ([#2647](https://github.com/horovod/horovod/pull/2647))
- Deduped Keras `LearningRateWarmupCallback` log after gradual learning rate warmup. ([#2661](https://github.com/horovod/horovod/pull/2661))

## [v0.21.2] - 2021-02-08

### Added

- Added support for Intel(R) MPI in horovodrun. ([#2374](https://github.com/horovod/horovod/pull/2374))
- Add support for callbacks in Ray Elastic Executor. ([#2639](https://github.com/horovod/horovod/pull/2639))
- Added forwarding of stdout/stderr captured to driver over Gloo. ([#2646](https://github.com/horovod/horovod/pull/2646))

### Fixed

- Fixed broadcast_optimizer_state to handle NoneType params for PyTorch 1.8. ([#2624](https://github.com/horovod/horovod/pull/2624))
- Fixed `local_rank` support for Ray. ([#2596](https://github.com/horovod/horovod/pull/2596))
- Fixed DL estimators to obtain the output df schema without sampling the input. ([#2611](https://github.com/horovod/horovod/pull/2611))
- Fixed wrong default for horovod.tensorflow.keras.allreduce average ([#2627](https://github.com/horovod/horovod/pull/2627))

## [v0.21.1] - 2021-01-06

### Added

- Added in-memory dataset caching param to `TorchEstimator`. ([#2434](https://github.com/horovod/horovod/pull/2434))
- Added `val_batch_size` param to the Estimator API. ([#2505](https://github.com/horovod/horovod/pull/2505))
- Added support for TorchScript modules when using `TorchEstimator`. ([#2494](https://github.com/horovod/horovod/pull/2494))

### Changed

- Migrated to oneCCL aligned with oneAPI specification v1.0. ([#2513](https://github.com/horovod/horovod/pull/2513))
- Added knob to set cache hint for oneCCL allreduce. ([#2560](https://github.com/horovod/horovod/pull/2560))
- Renamed `horovodrun` arg `--ccl-bgt-affinity` to `--thread-affinity`. ([#2562](https://github.com/horovod/horovod/pull/2562))
- Changed default build parallelism from `-j8` to `-j1` to address potential race condition. ([#2572](https://github.com/horovod/horovod/pull/2572))

### Fixed

- Fixed building Horovod for ROCm PyTorch with newer hipify script. ([#2360](https://github.com/horovod/horovod/pull/2360))
- Fixed "Executable class" support for Ray. ([#2510](https://github.com/horovod/horovod/pull/2510))
- Fixed TorchEstimator returning model without switching to eval mode. ([#2517](https://github.com/horovod/horovod/pull/2517))
- Remove ssh reliance for Ray elastic training. ([#2528](https://github.com/horovod/horovod/pull/2528))
- Fixed error handling for changing framework without reinstalling horovod. ([#2529](https://github.com/horovod/horovod/pull/2529))
- Fixed "Intermediate path does not exist" error with DBFSLocalStore. ([#2526](https://github.com/horovod/horovod/pull/2526))
- Avoid synchronization if workers are only shrinked in elastic mode. ([#2514](https://github.com/horovod/horovod/pull/2514))
- Fixed Ray resource test. ([#2575](https://github.com/horovod/horovod/pull/2575))
- Fixed usage of env variable `HOROVOD_GLOO_TIMEOUT_SECONDS` with `horovodrun`. ([#2571](https://github.com/horovod/horovod/pull/2571))

## [v0.21.0] - 2020-11-23

### Added

- Added support for backward_passes_per_step > 1 for TF Keras graph mode. ([#2346](https://github.com/horovod/horovod/pull/2346))
- Added support for backward_passes_per_step > 1 for TF Keras eager execution. ([#2371](https://github.com/horovod/horovod/pull/2371))
- Added support for backward_passes_per_step > 1 for TF LegacyOptimizer in graph mode. ([#2401](https://github.com/horovod/horovod/pull/2401))
- Added grouped allreduce to enable more efficient tensor fusion and deterministic training. ([#2453](https://github.com/horovod/horovod/pull/2453))
- Add support for specifying `op` and `compression` in `horovod.tensorflow.keras.allreduce()`. ([#2423](https://github.com/horovod/horovod/pull/2423))
- Adding support for batched D2D memcopy kernel on GPU. ([#2435](https://github.com/horovod/horovod/pull/2435))
- Added schema inference in Spark Estimator without sampling. ([#2373](https://github.com/horovod/horovod/pull/2373))
- Added `Store.create("dbfs:/")` mapping to `DBFSLocalStore("/dbfs/...")`. ([#2376](https://github.com/horovod/horovod/pull/2376))

### Changed

- Changed Keras callbacks to require parameter `initial_lr` of `LearningRateScheduleCallback` and `LearningRateWarmupCallback`. ([#2459](https://github.com/horovod/horovod/pull/2459))
- Changed default cycle time from 5ms to 1ms and fusion threshold from 64MB to 128MB. ([#2468](https://github.com/horovod/horovod/pull/2468))

### Fixed

- Fixed support for TensorFlow v2.4.0. ([#2381](https://github.com/horovod/horovod/pull/2381))
- Fixed averaging using CUDA half2 implementation one element half buffers. ([#2375](https://github.com/horovod/horovod/pull/2375))
- Fixed `HOROVOD_THREAD_AFFINITY` when using oneCCL. ([#2350](https://github.com/horovod/horovod/pull/2350))
- Added timeout to SSH check in horovodrun to prevent hanging. ([#2448](https://github.com/horovod/horovod/pull/2448))
- Added `HOROVOD_GLOO_TIMEOUT_SECONDS` value to error messages. ([#2436](https://github.com/horovod/horovod/pull/2436))
- Fixed race condition in dynamic timeline API. ([#2341](https://github.com/horovod/horovod/pull/2341))
- Fixed --log-hide-timestamp to apply to driver logs with Gloo. ([#2388](https://github.com/horovod/horovod/pull/2388))
- Fixed the search order of Eigen and Flatbuffers paths. ([#2473](https://github.com/horovod/horovod/pull/2473))
- Fixed type checks in `TorchEstimator` to correctly use `isinstance()`. ([#2480](https://github.com/horovod/horovod/pull/2480))

## [0.20.3] - 2020-10-01

### Added

- Added Elastic Ray integration. ([#2291](https://github.com/horovod/horovod/pull/2291))

### Changed

- Removed dependency on SSH access for Ray. ([#2275](https://github.com/horovod/horovod/pull/2275))

## [0.20.2] - 2020-09-25

### Fixed

- Fixed building Horovod without HOROVOD_WITHOUT_MXNET when MXNet is not installed. ([#2334](https://github.com/horovod/horovod/pull/2334))

## [0.20.1] - 2020-09-25

### Added

- Added Databricks storage `DBFSLocalStore` and support for GPU-aware scheduling to horovod.spark Estimator. ([#2234](https://github.com/horovod/horovod/pull/2234))
- Added ElasticSampler and PyTorch Elastic ImageNet example. ([#2297](https://github.com/horovod/horovod/pull/2297))
- Added ability to dynamically start and stop timeline programmatically. ([#2215](https://github.com/horovod/horovod/pull/2215))
- Added support for Gloo on macOS. ([#2254](https://github.com/horovod/horovod/pull/2254))
- Exposed name argument to TensorFlow allreduce operation. ([#2325](https://github.com/horovod/horovod/pull/2325))
- Added option to strip outer name scope from Horovod ops in TensorFlow. ([#2328](https://github.com/horovod/horovod/pull/2328))

### Fixed

- Fixed usage of VERBOSE=1 when setting custom MAKEFLAGS. ([#2239](https://github.com/horovod/horovod/pull/2239))
- Fixed bugs in Keras Elastic Callback classes. ([#2289](https://github.com/horovod/horovod/pull/2289))
- Fixed RelWithDebInfo build and made it the default with -03 optimizations. ([#2305](https://github.com/horovod/horovod/pull/2305))
- Fixed usage of tf.cond in TensorFlow alltoall gradient. ([#2327](https://github.com/horovod/horovod/pull/2327))
- Fixed allreduce averaging for TF IndexedSlices in ROCm path. ([#2279](https://github.com/horovod/horovod/pull/2279))
- Include stdexcept to handle certain compiler / frameworks that don't include it already. ([#2238](https://github.com/horovod/horovod/pull/2238))
- Fixed Debug builds by setting compiler options based on CMake build type. ([#2263](https://github.com/horovod/horovod/pull/2263))
- Skipped launching zero-sized send/recvs for NCCLAlltoall. ([#2273](https://github.com/horovod/horovod/pull/2273))
- Fixed missing run in tf keras elastic mode. ([#2272](https://github.com/horovod/horovod/pull/2272))
- Fixed loss function in TensorFlow2 elastic synthetic benchmark. ([#2265](https://github.com/horovod/horovod/pull/2265))
- Fixed usage of HOROVOD_MIXED_INSTALL env var in alltoall tests. ([#2266](https://github.com/horovod/horovod/pull/2266))
- Removed keras requirement from Ray example. ([#2262](https://github.com/horovod/horovod/pull/2262))

## [0.20.0] - 2020-09-02

### Added

- Added bare-metal elastic mode implementation to enable auto-scaling and fault tolerance. ([#1849](https://github.com/horovod/horovod/pull/1849))
- Added Elastic Horovod support for Spark auto-scaling. ([#1956](https://github.com/horovod/horovod/pull/1956))
- Added All-to-All operation for TensorFlow, PyTorch, and MXNet. ([#2143](https://github.com/horovod/horovod/pull/2143))
- Added support for `gradient_predivide_factor` and averaging in Horovod backend. ([#1949](https://github.com/horovod/horovod/pull/1949))
- Added NCCL implementation of the allgather operation. ([#1952](https://github.com/horovod/horovod/pull/1952))
- Added `HOROVOD_GPU_OPERATIONS` installation variable to simplify enabling NCCL support for all GPU operations. ([#1960](https://github.com/horovod/horovod/pull/1960))
- Added TensorFlow implementation of `SyncBatchNormalization` layer. ([#2075](https://github.com/horovod/horovod/pull/2075))
- Added `hvd.is_initialized()` method. ([#2020](https://github.com/horovod/horovod/pull/2020))
- Added `hvd.allgather_object` function for TensorFlow, PyTorch, and MXNet. ([#2166](https://github.com/horovod/horovod/pull/2166))
- Added `hvd.broadcast_object` function for MXNet. ([#2122](https://github.com/horovod/horovod/pull/2122))
- Added `label_shapes` parameter to KerasEstimator and TorchEstimator. ([#2140](https://github.com/horovod/horovod/pull/2140))
- Added optional `modelCheckPoint` callback to KerasEstimator params. ([#2124](https://github.com/horovod/horovod/pull/2124))
- Added `ssh_identity_file` argument to `horovodrun`. ([#2201](https://github.com/horovod/horovod/pull/2201))
- Added support for `horovodrun` on `kubeflow/mpi-job`. ([#2199](https://github.com/horovod/horovod/pull/2199))
- Added Ray integration. ([#2218](https://github.com/horovod/horovod/pull/2218))

### Changed

- Moved `horovod.run.runner.run` to `horovod.run`. ([#2099](https://github.com/horovod/horovod/pull/2099))
- HOROVOD_THREAD_AFFINITY accepts multiple values, one for every Horovod rank ([#2131](https://github.com/horovod/horovod/pull/2131))
- Migrated build system for native libraries to CMake ([#2009](https://github.com/horovod/horovod/pull/2009))

### Deprecated

- HOROVOD_CCL_BGT_AFFINITY is deprected. Use HOROVOD_THREAD_AFFINITY instead ([#2131](https://github.com/horovod/horovod/pull/2131))

### Removed

- Dropped support for Python 2. ([#1954](https://github.com/horovod/horovod/pull/1954))
- Dropped support for TensorFlow < 1.15. ([#2169](https://github.com/horovod/horovod/pull/2169))
- Dropped support for PyTorch < 1.2. ([#2086](https://github.com/horovod/horovod/pull/2086))

### Fixed

- Fixed MXNet allgather implementation to correctly handle resizing the output buffer. ([#2092](https://github.com/horovod/horovod/pull/2092))
- Fixed Keras Spark Estimator incompatibility with TensorFlow 1.15 due to `tf.autograph`. ([#2069](https://github.com/horovod/horovod/pull/2069))
- Fixed API compatibility with PyTorch 1.6. ([#2051](https://github.com/horovod/horovod/pull/2051))
- Fixed Keras API compatibility with TensorFlow 2.4.0. ([#2178](https://github.com/horovod/horovod/pull/2178))
- Fixed allgather gradient for TensorFlow 2 in cases where the tensor shape is not known during graph construction. ([#2121](https://github.com/horovod/horovod/pull/2121))
- Fixed running using Gloo with an imbalanced number of workers per host. ([#2212](https://github.com/horovod/horovod/pull/2212))