Implementing YOLOv5 for Parallel and Distributed Computing (AWS)

This repo is forked and modified from Ultralytics's YOLOv5 Pytorch implementation. Visit here for the original README.

Table of Content

Project structure
How to use
Putting everything together: DAGMan Workflow
Common Issues

Project structure

The project is structured as follows:

<root>
|-- adr (architectural decision records)
|-- archive
|-- classify (classification)
|-- data
|   |-- hyps (hyperparam YAMLs)
|   |   |-- hyp.*.yaml
|   |   |-- ...
|   |
|   |-- images (images folder)
|   |-- scripts (bash scripts folder)
|   |-- pricetag.yaml (dataset config for this project)
|   |-- *.yaml (configs for other datasets)
|
|-- models (model configs)
|   |-- hub
|   |-- segment
|   |-- *.py
|   |-- yolov5*.yaml (YOLOv5 configs)
|
|-- runs
|   |-- detect
|   |   |-- exp* (folder containing detection results)
|
|-- segment (Python files for segmentation)
|-- utils (utility functions)
|-- detect.py (run this to perform inference)
|-- train.py (run this to train your model)
|-- *.py (other Python files)
|-- requirements.txt

Note: Some files are omitted in the project tree above for the sake of brevity.

How to use

This section outlines the step-by-step procedure to replicate this project.

Amazon Web Services (AWS) Setup

Spin up EC2 instances with Ubuntu AMI 20.04. Configure them as follows:
- Instance 1:
  - Name: YOLO-CentralMgt
  - Type: t2.micro
- Instance 2:
  - Name: YOLO-SubmHost
  - Type: t2.micro
- Instance 3:
  - Name: YOLO-Executor
  - Type: t2.micro (Can be upgraded to t2.large if needed)
- Group all instances into single security group, and private key pair .pem is used to login in this project.
Access the instances via terminal by using the command at the Connect button at each instances.
```
$ cd .ssh
$ ssh -i "private_key.pem" [email protected]
```

HTCondor Setup

Install HTCondor in all three (3) instances.

Perform update on all instances via sudo apt-get update on all instances.

Install HTCondor using this guide on respective machines as follows (Note that the variables marked $ should be replaced with user-defined values):

YOLO-CentralMgt:

$ curl -fsSL https://get.htcondor.org | sudo GET_HTCONDOR_PASSWORD="$htcondor_password" /bin/bash -s -- --no-dry-run --central-manager $central_manager_name

YOLO-SubmHost:

$ curl -fsSL https://get.htcondor.org | sudo GET_HTCONDOR_PASSWORD="$htcondor_password" /bin/bash -s -- --no-dry-run --submit $central_manager_name

YOLO-Executor:

$ curl -fsSL https://get.htcondor.org | sudo GET_HTCONDOR_PASSWORD="$htcondor_password" /bin/bash -s -- --no-dry-run --execute $central_manager_name

At YOLO-SubmHost, run $ condor_status to see execute machines in the pool, $ condor_submit to submit jobs and $ condor_q to see the jobs run.

At each instance, update /etc/hosts with IP address and machine names.

$ sudo nano /etc/hosts

Add the CentralMgt, SubmHost and Executor IP address and machine name like below:

127.0.0.1 localhost
172.31.92.114 CentralMgt
172.31.91.11 SubmHost
172.31.88.90 Executor

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
...

Edit inbound rules for security group to allow all traffic to pass within the pool group.
- At sidebar, go to Network & Security and select Security Groups.
- Choose the security group that applies to the pool.
- At Inbound rules, select Edit inbound rules, then Add rule.
- Choose All traffic for Type, Custom for Source, and select the security group in the box next to Source. Then Save rules.
- Test HTCondor using this sleep.sh example.

Network File System (NFS) Setup

The following setup procedure can be found here.

Install NFS Server on SubmHost, and then start it.

$ sudo apt install nfs-kernel-server
$ sudo systemctl start nfs-kernel-server.service

Create yolo folder at /home/ubuntu in SubmHost and Executor instances.
```
$ mkdir /yolo
```

Add the yolo folder to /etc/exports file.

$ sudo nano /etc/exports

Add /home/ubuntu/yolo *(rw,sync,no_subtree_check) into the file like below:

# /etc/exports: the access control list for filesystems which may be exported
#               to NFS clients.  See exports(5).
...
#
/home/ubuntu/yolo *(rw,sync,no_subtree_check)

Then apply the new config via

$ sudo exportfs -a

At Executor instance, install NFS Client and then start it if it is not active.

$ sudo apt install nfs-common
$ sudo systemctl status nfs-common.service
$ sudo systemctl start nfs-common.service

Still at Executor, mount the created /home/ubuntu/yolo directory to the exported directory in NFS Server.

$ sudo mount SubmHost:/home/ubuntu/yolo /home/ubuntu/yolo

Test by creating a file at SubmHost and echo its content in the Executor side.

# At SubmHost
$ cd /home/ubuntu/yolo
$ echo "Hello" > testfile.txt

# At Executor
$ cd /home/ubuntu/yolo
$ echo testfile.txt

See Common Issues if you run into any issues during NFS setup.

You Only Look Once (YOLO) Model Inference Setup

Clone this repository to Executor because there are several Ubuntu-based packages needed to run YOLO model. Note that Git and Python 3.8 come pre-installed with Ubuntu 20.04 AMI.
```
$ git clone https://github.com/yuenherny/um-wqd7008-pdc-yolov5.git
```

Install required Ubuntu packages for OpenCV and venv.

$ sudo apt install python3-opencv
$ sudo apt install python3.8-venv

At the cloned local repository, create and activate Python environment.
```
$ cd um-wqd7008-pdc-yolov5
$ python3 -m venv venv
$ source venv/bin/activate
```
Check if environment is activated. You should see a list of pre-installed packages.
```
$ pip list
```
Before installing other dependencies using requirements.txt, install torch and torchvision packages from the official PyTorch docs, as downloads via PyPi wheel can be slow on AWS.
- At START LOCALLY section, choose:
  - PyTorch Build: Stable
  - Your OS: Linux
  - Package: Pip
  - Language: Python
  - Compute Platform: CPU
- Then copy the command with torchaudio removed.
```
$ pip3 install torch torchvision --extra-index-url https://download.pytorch.org/whl/cpu
```
Amend requirements.txt and comment out thop>=0.1.1, then save.
```
$ nano requirements.txt
```
Then, install dependencies.
```
$ pip install -r requirements.txt
```

Now that required dependencies are installed, we can check if things could be run like normal - invoking from terminal.

$ python3 detect.py --weights yolov5s.pt --source data/images/zidane.jpg

This would download the YOLOv5s weights and perform inference using data/images/zidane.jpg input source. You should see something like below:

detect: weights=['yolov5s.pt'], source=data/images/bus.jpg, data=data/coco128.yaml, imgsz=[640, 640], conf_thres=0.25, iou_thres=0.45, max_det=1000, device=, view_img=False, save_txt=False, save_conf=False, save_crop=False, nosave=False, classes=None, agnostic_nms=False, augment=False, visualize=False, update=False, project=runs/detect, name=exp, exist_ok=False, line_thickness=3, hide_labels=False, hide_conf=False, half=False, dnn=False, vid_stride=1

Fusing layers...
YOLOv5s summary: 213 layers, 7225885 parameters, 0 gradients
image 1/1 /home/ubuntu/yolo/um-wqd7008-pdc-yolov5/data/images/bus.jpg: 640x480 4 persons, 1 bus, 309.3ms
Speed: 2.9ms pre-process, 309.3ms inference, 3.7ms NMS per image at shape (1, 3, 640, 640)
Results saved to runs/detect/exp

which means inference is successful and the result is saved to runs/detect/exp folder.

Submit a HTCondor Job to Perform Inference

At SubmHost, create a bash file (see yolo.sh).

$ nano yolo.sh

Then paste this into the bash file, and save:

#!/usr/bin/bash
# file name: yolo.sh

# check python version
echo "$(python3 --version)"

# change to directory
cd /home/ubuntu/yolo/um-wqd7008-pdc-yolov5
echo "Directory changed"

# activate python env
source venv/bin/activate
echo "Python env activated"

# run detect.py on an image
echo "Execution start"
python3 detect.py --weights yolov5s.pt --source data/images/bus.jpg
echo "Execution complete"

# deactivate python env
deactivate
echo "Python env deactivated"

Without the shebang (the first line in bash file - #!/usr/bin/bash), you might get the following error:

...
007 (019.000.000) 2023-01-09 10:23:53 Shadow exception!
        Error from [email protected]: Failed to execute '/var/lib/condor/execute/dir_1257/condor_exec.exe': (errno=8: 'Exec format error')
        0  -  Run Bytes Sent By Job
        298  -  Run Bytes Received By Job
...
012 (019.000.000) 2023-01-09 10:23:53 Job was held.
        Error from [email protected]: Failed to execute '/var/lib/condor/execute/dir_1257/condor_exec.exe': (errno=8: 'Exec format error')
        Code 6 Subcode 8
...

This shebang-induced error is documented here and here.

Still at SubmHost, create a HTCondor submit file (see yolo.sub).

$ nano yolo.sub

Then paste the following into it, and save:

# YOLO detection on an image

executable   = yolo.sh

output       = yolo.out
error        = yolo.err
log          = yolo.log

should_transfer_files = yes
when_to_transfer_output = ON_EXIT

queue

Still at SubmHost, submit yolo.sh as a job using yolo.sub submit file.
```
$ condor_submit yolo.sub
```

If there were no errors, you should see your result in the Executor instance's um-wqd7008-pdc-yolov5/runs/detect/exp directory. However, you might get the following output in yolo.err:

Traceback (most recent call last):
File "detect.py", line 261, in <module>
    main(opt)
File "detect.py", line 256, in main
    run(**vars(opt))
File "/home/ubuntu/yolo/um-wqd7008-pdc-yolov5/venv/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
File "detect.py", line 98, in run
    model = DetectMultiBackend(weights, device=device, dnn=dnn, data=data, fp16=half)
File "/home/ubuntu/yolo/um-wqd7008-pdc-yolov5/models/common.py", line 345, in __init__
    model = attempt_load(weights if isinstance(weights, list) else w, device=device, inplace=True, fuse=fuse)
File "/home/ubuntu/yolo/um-wqd7008-pdc-yolov5/models/experimental.py", line 79, in attempt_load
    ckpt = torch.load(attempt_download(w), map_location='cpu')  # load
File "/home/ubuntu/yolo/um-wqd7008-pdc-yolov5/venv/lib/python3.8/site-packages/torch/serialization.py", line 771, in load
    with _open_file_like(f, 'rb') as opened_file:
File "/home/ubuntu/yolo/um-wqd7008-pdc-yolov5/venv/lib/python3.8/site-packages/torch/serialization.py", line 270, in _open_file_like
    return _open_file(name_or_buffer, mode)
File "/home/ubuntu/yolo/um-wqd7008-pdc-yolov5/venv/lib/python3.8/site-packages/torch/serialization.py", line 251, in __init__
    super(_open_file, self).__init__(open(name, mode))
PermissionError: [Errno 13] Permission denied: 'yolov5s.pt'

or:

Traceback (most recent call last):
  File "detect.py", line 261, in <module>
    main(opt)
  File "detect.py", line 256, in main
    run(**vars(opt))
  File "/home/ubuntu/yolo/um-wqd7008-pdc-yolov5/venv/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "detect.py", line 94, in run
    (save_dir / 'labels' if save_txt else save_dir).mkdir(parents=True, exist_ok=True)  # make dir
  File "/usr/lib/python3.8/pathlib.py", line 1288, in mkdir
    self._accessor.mkdir(self, mode)
PermissionError: [Errno 13] Permission denied: 'runs/detect/exp2'

The [Errno 13] error means that the program tried to create new file or folder but it was denied due to write permission issues. We need to assign other users with write access permissions in both project root and the runs/detect/ directory. At the parent directory of um-wqd7008-pdc-yolov5:

$ chmod 777 um-wqd7008-pdc-yolov5
$ chmod 777 um-wqd7008-pdc-yolov5/runs

Change permissions for other files and directories if needed. Read more about chmod here.

Google Drive Setup

To support uploading to Google Drive, we will be using gdrive.

Before proceeding, please create Google OAuth Client credentials by following this guide.

Due to a limitation of this package which requires a web browser, you are required to perform this setup on a GUI-supported local machine. After the setup, we will then export the account and import it on any remote server.

Part 1: Setup on Local Machine

Download the latest release from Github. v3.1.0 as of Jan 10, 2023.
Unzip the Archive
Open Terminal where gdrive is located, add Google Account to gdrive
```
$ ./gdrive account add
```
- This will prompt you for your google Client ID and Client Secret
- Next you will be presented with an url
- Open the url in your browser and give approval for gdrive to access your Google Drive
- You will be redirected to http:https://localhost:8085 (gdrive starts a temporary web server) which completes the setup
- Gdrive is now ready to use!
Test upload a file
```
$ ./gdrive files upload <FILE_PATH>
```
Export your gdrive Account (i.e. [email protected])
```
$ ./gdrive account export <ACCOUNT_NAME>
```

Copy the exported archive to the remote server(s) using the command below:

$ scp <Options> <PATH/ON/LOCAL> <SERVER_NAME>@<HOST>:<PATH/ON/SERVER>

Example:

$ scp -i "nicholasleezt-7008.pem" ~/Downloads/gdrive_export-s2132376_siswa_um_edu_my.tar [email protected]:/home/ubuntu

Part 2: On Remote Server(s)

Download the latest release from Github. v3.1.0 as of Jan 10, 2023.

$ wget https://github.com/glotlabs/gdrive/releases/download/3.1.0/gdrive_linux-x64.tar.gz

Unzip the archive.
```
$ tar -xvf gdrive_linux-x64.tar.gz
```
Put the gdrive binary at your PATH (i.e. /usr/local/bin)
```
$ sudo mv /home/ubuntu/gdrive /usr/local/bin
```

Import the gdrive account.

$ gdrive account import <EXPORTED_ACCOUNT>

Test upload a file.
```
$ gdrive files upload <FILE/TO/PATH>
```

Alternatively, to upload to a specific folder in your gdrive, run:

$ gdrive files list

to get the folder ID. You will see something like:

 Id                                                                          Name                                      Type       Size       Created
 1N6Hs1Tx1f0ubfKphI4cfGyZCGsjJCEz4                                           Folder1                                    folder                2022-12-31 09:24:37
 1i7uFBCbqgw176TftRsws6z0BTr64zB1m                                           Folder2                                folder                2022-11-07 06:43:56
 1GNCqI5lG-6wd_XQnYz0Z1z3ReUTGUQUY                                           Folder3                             folder                2022-07-20 12:48:50

then do:

$ gdrive files upload <PATH/TO/FILE> --parent <FOLDER_ID>

Note that as of 15 Jan 2023, gdrive package is only able to upload files. Folders or multiple files are not supported.

Putting Everything Together: HTCondor Workflow Execution using DAGMan

The DAGMan workflow yolo-gdrive.dag aims to:

Perform YOLO inference on images on parallel - see yolo.sh and yolo.sub
Upload the results to a Google Drive folder - see gdrive.sh and gdrive.sub

Unfortunately, the shell script for gdrive execution was unable to complete due to permission issues:

Error: Failed to create directory '/nonexistent/.config/gdrive3': Permission denied (os error 13)
Error: No account has been selected
Use `gdrive account list` to show all accounts.
Use `gdrive account switch` to select an account.
...
Error: No account has been selected
Use `gdrive account list` to show all accounts.
Use `gdrive account switch` to select an account.
sudo: a terminal is required to read the password; either use the -S option to read from standard input or configure an askpass helper

Common Issues

Issue 1: NFS Client (on Executor instance) inactive after instance stopped and started

This could happen when you stopped and then started the instance after complete setting up NFS Server and Client (you took a break and resumed this project).

Solution: Start the NFS Client service on Executor.

At Executor, start the NFS Client:

$ sudo systemctl start nfs-common.service

At Executor, perform mounting:

$ sudo mount <NFS_SERVER_IP_ADDRESS_OR_MACHINE_NAME>:<DIR_ON_NFS_SERVER> <DIR_ON_NFS_CLIENT>

Issue 2: NFS Client could not be started due to masked

This could happen when your NFS Client service unit file was symlinked to /dev/null.

Solution: Remove the symlink, unmask and start the service.

At Executor, navigate to /lib/systemd/system/ and check if service unit file was symlinked to /dev/null.

$ file /lib/systemd/system/nfs-common.service
OR
$ file /etc/systemd/system/nfs-common.service

It should return:

/lib/systemd/system/nfs-common.service: symbolic link to /dev/null

Delete the symlink.

$ sudo rm /lib/systemd/system/nfs-common.service

Reload the systemd daemon:
```
$ sudo systemctl daemon-reload
```

Unmask, start and check the service:

$ sudo systemctl unmask nfs-common.service
$ sudo systemctl start nfs-common.service
$ $ sudo systemctl status nfs-common.service

As documented here and here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Implementing YOLOv5 for Parallel and Distributed Computing (AWS)

Table of Content

Project structure

How to use

Amazon Web Services (AWS) Setup

HTCondor Setup

Network File System (NFS) Setup

You Only Look Once (YOLO) Model Inference Setup

Submit a HTCondor Job to Perform Inference

Google Drive Setup

Part 1: Setup on Local Machine

Part 2: On Remote Server(s)

Putting Everything Together: HTCondor Workflow Execution using DAGMan

Common Issues

Issue 1: NFS Client (on Executor instance) inactive after instance stopped and started

Issue 2: NFS Client could not be started due to masked

About

Releases

Packages

Contributors 294

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2,553 Commits
.github		.github
adr		adr
archive		archive
classify		classify
data		data
models		models
segment		segment
utils		utils
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
benchmarks.py		benchmarks.py
detect.py		detect.py
export.py		export.py
gdrive.sh		gdrive.sh
gdrive.sub		gdrive.sub
hubconf.py		hubconf.py
nfs-setup.sh		nfs-setup.sh
requirements.txt		requirements.txt
setup.cfg		setup.cfg
train.py		train.py
tutorial.ipynb		tutorial.ipynb
val.py		val.py
yolo-gdrive.dag		yolo-gdrive.dag
yolo.sh		yolo.sh
yolo.sub		yolo.sub

License

yuenherny/um-wqd7008-pdc-yolov5

Folders and files

Latest commit

History

Repository files navigation

Implementing YOLOv5 for Parallel and Distributed Computing (AWS)

Table of Content

Project structure

How to use

Amazon Web Services (AWS) Setup

HTCondor Setup

Network File System (NFS) Setup

You Only Look Once (YOLO) Model Inference Setup

Submit a HTCondor Job to Perform Inference

Google Drive Setup

Part 1: Setup on Local Machine

Part 2: On Remote Server(s)

Putting Everything Together: HTCondor Workflow Execution using DAGMan

Common Issues

Issue 1: NFS Client (on Executor instance) inactive after instance stopped and started

Issue 2: NFS Client could not be started due to masked

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 294

Languages

Packages