Yiğit Ekin, Ahmet Burak Yildirim, Erdem Eren Çağlar, Aykut Erdem, Erkut Erdem, Aysegul Dundar
This repository contains the official implementation of the paper CLIPAway which is accepted to NeurIPS 2024. CLIPAway is novel framework manipulating CLIP embeddings via projection to remove objects using Stable Diffusion prior.
Advanced image editing techniques, particularly inpainting, are essential for seamlessly removing unwanted elements while preserving visual integrity. Traditional GAN-based methods have achieved notable success, but recent advancements in diffusion models have produced superior results due to their training on large-scale datasets, enabling the generation of remarkably realistic inpainted images.Despite their strengths, diffusion models often struggle with object removal tasks without explicit guidance, leading to unintended hallucinations of the removed object. To address this issue, we introduce CLIPAway, a novel approach leveraging CLIP embeddings to focus on background regions while excluding foreground elements. CLIPAway enhances inpainting accuracy and quality by identifying embeddings that prioritize the background, thus achieving seamless object removal. Unlike other methods that rely on specialized training datasets or costly manual annotations, CLIPAway provides a flexible, plug-and-play solution compatible with various diffusion-based inpainting techniques.
- Paper Accepted to NeurIPS 2024 (22.09.2024)
- Training and inference codes are released. (22.06.2024)
git clone [email protected]:YigitEkin/CLIPAway.git
cd CLIPAway
Our model uses pretrained Alpha-CLIP networks. Make sure that you are at the root directory of the repository and then clone the Alpha-CLIP repository: To clone the source code of Alpha-CLIP, use:
git clone https://github.com/SunzeY/AlphaCLIP.git
Clone Alpha-CLIP and ensure it's in the repository's root directory. Please refer to the Source Code Downloads. After that, environment setup process can be started.
Anaconda is recommended to install the required dependencies. These dependencies are specified in the conda environment named clipaway
, which can be created and activated as follows:
conda env create -f environment.yaml
conda activate clipaway
To download the pretrained models for Alpha-CLIP, IP-Adapter, and our MLP projection network use the script that we provide:
./download_pretrained_models.sh
Alternatively, the pretrained models can be downloaded manually as follows: NOTE: If you are executing the following commands manually, please make sure that you are in the root directory of the repository.
mkdir ckpts ckpts/AlphaCLIP ckpts/IPAdapter ckpts/CLIPAway
cd ckpts/AlphaCLIP
gdown 1JfzOTvjf0tqBtKWwpBJtjYxdHi-06dbk
cd ../IPAdapter
wget https://huggingface.co/h94/IP-Adapter/resolve/main/models/ip-adapter_sd15.bin
mkdir image_encoder && cd image_encoder
wget https://huggingface.co/h94/IP-Adapter/resolve/main/models/image_encoder/pytorch_model.bin
wget https://huggingface.co/h94/IP-Adapter/resolve/main/models/image_encoder/config.json
cd ../../CLIPAway
gdown 1lFHAT2dF5GVRJLxkF1039D53gixHXaTx
cd ../../
It is important to note that these scripts will download the pretrained models that we have used in our experiments. However, other pretrained models can be used according to users' preferences.
For training the MLP projection network, we provide a dataset class which can be edited according to the dataset of choice. The dataset class is located in dataset/dataset.py
.
For training and validation datasets the expected file structure is as follows:
root_path
├── image1.jpg
├── image2.jpg
└── ...
As masks are static full masks, they are not required to be in the dataset folder. Only the images are required.
For test dataset, the expected file structure is as follows:
root_path
├── images
│ ├── image1.jpg
│ ├── image2.jpg
│ └── ...
└── masks
├── image1.png
├── image2.png
└── ...
The masks should be in the same folder with the same name as the images. The masks should be in .png format.
Note: We have trained our MLP projection network on COCO 2017 Training Dataset. for calculating the validation loss, we have used COCO 2017 Test Dataset. The dataset can be downloaded from the provided link or by using the following commands:
wget https://images.cocodataset.org/zips/train2017.zip
unzip train2017.zip
wget https://images.cocodataset.org/zips/test2017.zip
unzip test2017.zip
rm train2017.zip test2017.zip
Make sure that you update the configuration file with the correct paths to the training and validation datasets.
We provide a code for training the MLP projection network. The training code can be run using the following command:
python train.py --config <path_to_training_config>
It is important to note that the training code uses the training configuration file to set the parameters of the training process. The configuration file should be edited according to the users' preferences.
The description of the parameters in the training configuration file is as follows:
Args | Description |
---|---|
--data_path | Path to the training data directory, where the training images are stored. |
--val_path | Path to the validation data directory, used for evaluating the model during training. |
--train_batch_size | Batch size for training, which determines the number of samples processed before the model is updated. |
--val_batch_size | Batch size for validation, which determines the number of samples processed during evaluation. |
--lr | Learning rate for the optimizer, which controls how much to change the model in response to the estimated error each time the model weights are updated. |
--weight_decay | Weight decay (L2 regularization) parameter, which helps to prevent overfitting by penalizing large weights. |
--eval_interval | Interval (in steps) at which to evaluate the model. |
--save_interval | Interval (in steps) at which to save the model checkpoint. |
--save_path | Directory path where the model checkpoints will be saved during training. |
--image_encoder_path | Path to the pre-trained image encoder (CLIP) model, which is used for encoding the input images into embeddings. |
--alpha_vision_ckpt_pth | Path to the AlphaCLIP checkpoint, which contains pre-trained weights for the AlphaCLIP model used in conjunction with CLIPAway. |
--mlp_projection_layer_ckpt_path | Path to the MLP projection layer checkpoint if applicable. Used for projecting embeddings if a specific layer is required. Set to null if not used. |
--alpha_clip_id | Identifier for the AlphaCLIP model variant used. Specifies the version and configuration of the AlphaCLIP model. |
--epochs | Number of training epochs, which defines how many times the training process will iterate over the entire dataset. |
--number_of_hidden_layers | Number of hidden layers in the MLP architecture. Affects model capacity and complexity. |
--alpha_clip_embed_dim | Embedding dimension for the AlphaCLIP model, which defines the size of the feature vectors produced by the AlphaCLIP encoder. |
--ip_adapter_embed_dim | Embedding dimension for the IP-Adapter model, which defines the size of the feature vectors expected as input by the IP-Adapter. |
--device | Device used for training, typically set to "cuda" for GPU acceleration, which speeds up the training process. |
Note: Our best performing model is built upon SD-Inpaint which downscales the provided mask to dimension of the latent for determining the inpainting region. This downscaling can cause some inconsistencies and result in artifacts. To prevent this, we advise you to dilate your masks before using them in the inference process. We provide a script for dilating the masks which can be run as follows:
python3 dilate.py --directory <path_to_masks> --kernel-size 5 --iterations 5
the description of the arguments are as follows:
Args | Description |
---|---|
--directory | Path to the directory containing the masks |
--kernel-size | Size of the kernel for dilation |
--iterations | Number of iterations for dilation |
After dilation, the inference code can be run on a directory of images using the following command:
python3 inference.py --config <path_to_inference_config>
It is important to note that the inference code uses the inference configuration file to set the parameters of the inference process. The configuration file should be edited according to the users' preferences.
The description of the parameters in the inference configuration file is as follows:
Args | Description |
---|---|
--device | Device used for inference, typically set to "cuda" for GPU acceleration, which speeds up the process. |
--root_path | Path to the directory containing the images to be inpainted and masks |
--image_encoder_path | Path to the pre-trained image encoder (CLIP) model, which is used for encoding the input images into embeddings. |
--alpha_clip_ckpt_pth | Path to the AlphaCLIP checkpoint, which contains pre-trained weights for the AlphaCLIP model used in conjunction with CLIPAway. |
--alpha_clip_id | Identifier for the AlphaCLIP model variant used. Specifies the version and configuration of the AlphaCLIP model. |
--ip_adapter_ckpt_pth | Path to the IP-Adapter checkpoint, which contains pre-trained weights for the IP-Adapter model used in conjunction with CLIPAway. |
--sd_model_key | Key for the SD-Inpaint model variant used. Specifies the version and configuration of the SD-Inpaint model. |
--number_of_hidden_layers | Number of hidden layers in the MLP architecture. Affects model capacity and complexity. |
--alpha_clip_embed_dim | Embedding dimension for the AlphaCLIP model, which defines the size of the feature vectors produced by the AlphaCLIP encoder. |
--ip_adapter_embed_dim | Embedding dimension for the IP-Adapter model, which defines the size of the feature vectors expected as input by the IP-Adapter. |
--mlp_projection_layer_ckpt_path | Path to the MLP projection layer checkpoint if applicable. Used for projecting embeddings if a specific layer is required. Set to null if not used. |
--save_path_prefix | Prefix for the output directory where the inpainted images will be saved. |
--seed | Seed for the random number generator, which ensures reproducibility of the results. |
--scale | scale parameter of ipadapter model. Determines how much focus is put on the image embeddings. expects a value in range [0,1]. |
--strength | strength parameter which determines how much forward diffusion is applied. expects a value in range [0,1]. |
--display_focused_embeds | If set to True, the saved outputs will include unconditional image generations of the focused embeddings as well as the projection block. |
We provide a gradio interface for the inference process. The interface can be run using the following command:
python3 app.py --config <path_to_inference_config>
If you want to get a shareable link for the interface, you can use the following command:
python3 app.py --config <path_to_inference_config> --share
CLIPAway is implemented on top of the IP-Adapter paper which heavily relies on the Diffusers repository. In addition, the AlphaCLIP is used for obtaining focused embeddings. We would like to thank the authors of these repositories for their contributions.
@misc{ekin2024clipaway,
title={CLIPAway: Harmonizing Focused Embeddings for Removing Objects via Diffusion Models},
author={Yigit Ekin and Ahmet Burak Yildirim and Erdem Eren Caglar and Aykut Erdem and Erkut Erdem and Aysegul Dundar},
year={2024},
eprint={2406.09368},
archivePrefix={arXiv},
primaryClass={id='cs.CV' full_name='Computer Vision and Pattern Recognition' is_active=True alt_name=None in_archive='cs' is_general=False description='Covers image processing, computer vision, pattern recognition, and scene understanding. Roughly includes material in ACM Subject Classes I.2.10, I.4, and I.5.'}
}