Skip to content

Copy vast, complex and deep file directory structures to Azure Files, when AzCopy doesn't get the job done.

License

Notifications You must be signed in to change notification settings

edwin-huber/AzureFileCopier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AzureFileCopier

Copy vast, complex and deep file directory structures to Azure Files, when AzCopy doesn't get the job done.
This tool was written as a POC to determine if there are faster ways to copy particular folder structures and their contents to Azure than the current tooling allows for.

It provides an option for Azure sysadmins and time critical projects which can't wait for an Azure DataBox to arrive.

Current Version was written as a POC in dotnet core 3.1, and as such is cross platform, if you run into any problems, feel free to open an issue.

For details of the optimizations and design choices we made, see here.

NOTE: Use of this code is at your own risk, and no warranties are given. Alternatives such as AzCopy, Azure File Sync, and Azure DataBox exist to help migrate data to Azure, and may be a better choice for your scenario.

After copy job is finished, please remember to destroy the supporting control services, Azure Storage Account and Azure Cache for Redis.

For more information on these cloud services, please see:

Azure Storage Documentation Azure Cache for Redis Documentation

Requirements / Infrastructure

  1. Create an extra Azure Storage Account to hold the Azure Storage queues used to copy files and folders to the target Azure Files Share.
  2. Create an Azure Redis Cache, for most purposes a Basic C2 cache with 2,5 GB should be sufficient.
  3. You will need the connection strings for the target storage account, the name of the target share, the connection string for the control storage account, and the redis cache, and will need to place these in the appsettings.json.

Hosting the copy runtime

To copy Azure Files to Azure Blob storage, please use a VM running in the same region as your storage accounts.
Copy performance will be very dependent on the latency to the target storage accounts, as well as the latency of the storage account providing the Azure Storage queue for job control.

For datacenter to Azure copy tasks, please ensure separate storage accounts for control and target storage.

datacenter

When copying from Azure Files to Azure Blob, mount the Azure Files storage in an Azure VM, and run the job in the same region as the Azure storage.
Currently there is no direct Azure Files to Azure Blob implemented, and jobs need to use a VM with a mount point.
Performance will vary based on the VM type chosen, the number of cores and number of storage queues chosen.

azuretoazure

Compilation

Use Visual Studio 2019!

If you want to build with VS Code, you will need to Compile and install MSBuild, which needs Visual Studio...

https://docs.microsoft.com/dotnet/core/tutorials/with-visual-studio-code

Build Task for Visual Studio Code has not been added to this project, nor has a launch.json been configured.

Architecture

The copier logic is based on "at least once" queue message processing, and redis SET data structures.

As it uses a pattern of competing consumers for work placed in Azure Storage Queues, the individual copiers will jitter their backoff on queue message retrieval, in the case of collisions, there is still however the chance that they will occasionally do the same work.

Using Azure Storage and Redis services to maintain our copy job state allows us to parallelize the work, and restart in the case of errors and failures, and buffer in the case slow performance.

By being able to parallelize both analysis and copying of files, we can greatly increase the copy performance, of very large and complex folder structures.

The current implementation does not use MD5 hashing to check the validity of files copied, this could be implemented in a future version.

Configuration

Configuration settings are stored in the appsettings.json, and are as follows:

  "RedisCacheConnectionString": "",

Connection string for Azure Redis Cache used to store copy job data structures and progress

  "ControlStorageConnectionString": "",

Connection string for storage account used to control copy jobs

  "LARGE_FILE_SIZE_BYTES": 10000000,

Size we assume for large files which will slow down copy jobs. Approx 10 MB for large file bytes to keep main copy jobs fast.

  "LARGE_FILE_COPY_TIMEOUT": 300,

Timesout a message in the queue after 5 minutes, making it reappear for another copy job to try.
If you are copying very large files to Azure Files, then this may need to be adjusted.

  "TargetStorageConnectionString": "",

Connection string for target storage account where files should be copied to

  "TargetAzureFilesShareName": "",

This is the target share name in Azure Files where the files should be copied to

  "InstrumentationKey" :  "",

If you want telemetry to be collected in Azure App Insights

  "TargetStorageAccountName" : "",

The name of the target storage account

  "TargetStorageKey" : ""

The access key for the target storage account. Access keys should be rolled over after they have been used in this fashion.

TargetAzureBlobContainerName

The target container name if copying to Azure Blob Storage.

Usage

GENERAL OPTIONS

--quietmode or -q
Reduces messages sent to standard output.

COPY MODE: Local File System to Azure Files

Analyzes the folder structure under the path given, and creates file and folder creation tasks in an Azure Storage queues.
Once all folders assigned to a particular have been created, worker jobs start to copy files from the file queues.

Example Usage:

aafccore.exe localtofiles -p E:\testing\testcontent4 -w 16 --pathtoremove testing --excludefolders E:\testing\testcontent4\2,E:\testing\testcontent4\1 --excludefiles office.jpg

COPY MODE: Local File System to Azure Blob

Analyzes the folder structure under the path given, and creates file and folder creation tasks in an Azure Storage queues.
Once all folders assigned to a particular have been created, worker jobs start to copy files from the file queues.

Can be used to copy from Azure Files to Azure Blob by mounting Azure Files in a VM in the same region as the storage account, and transferring from there.

Example Usage:

aafccore.exe localtoblob -p E:\testing\testcontent4 -w 16 --pathtoremove testing --excludefolders E:\testing\testcontent4\2,E:\testing\testcontent4\1 --excludefiles office.jpg

Arguments

--path or -p followed by a path to analyze
Denotes path containing directory and files to be analyzed and copied to Azure.

--excludefolders or -x Exclude a comma separated list of folder paths.

--excludefiles Exclude a comma separated list of files.

--workercount or -w Determines how many jobs will be used for folder analysis.
Splits the top level folder list into this number of batches.
Each is assigned a batch number, which is used to name the worker queue in Azure.

--batchclient Determines the client number for batch processing, i.e: Which batch will be processed by this job.

--batchmode Will only start one of a subset of batches, used to start individual copy processes

destinationsubfolder Sets the destination subfolder in the target share.

pathtoremove Will try to remove this prefix string from the path when copying to target share.

Example usage:

aafccore.exe localtoblob --destinationsubfolder mysub --pathtoremove "\Code\AzureAsyncFileCopier\testing\"

To target subfolders and remove prefix path:

destinationsubfolder Sets the destination subfolder in the target share.

pathtoremove Will try to remove this prefix string from the path when copying to target share.

Exmaple usage:

aafccore.exe localtoblob --destinationsubfolder mysub --pathtoremove "\Code\AzureAsyncFileCopier\testing\"

--largefiles or -l
Sets this job to copy large files from the large files queue.

Copying large files takes much longer than small files, so we use a separate Azure storage queue for these files.
On this queue we have a 5 minute timeout for the copying of large files. After 5 minutes, the copy message will reappear in the queue. If your copy jobs take longer than 5 minutes, you can increase this timeout in the app.config . There are 2 settings for large files in the app.config:

  • LARGE_FILE_COPY_TIMEOUT
  • LARGE_FILE_SIZE_BYTES

Use these to tune your copier performance / throughput. Donot reduce the large file copy timeout, otherwise the copier will just get stuck on large files and keep retrying them!

RESETTING

If you have made a mistake, or think you need to restart, feel free to reach out to me with issues / questions on github. The copier will overwrite files by default, and jobs can be started / batched. It maintains a set of copied folders, so as not to copy them again, and is not designed to sync changing directory structures, which can be done with Azure File Sync.
If you need to reset, use reset mode to just wipe out the Azure Storage Queues and clear the redis cache.

About

Copy vast, complex and deep file directory structures to Azure Files, when AzCopy doesn't get the job done.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published