C4CDuplicateAnalyzer is a mass duplicate finder for SAP Cloud for Customer based on Spring Batch. It is highly configurable and uses multithreading for a great performance.
Make sure that you have maven and a proper Java IDE installed on your system.
Clone this git repository into your working directory
git clone https://github.com/overlinden/C4CDuplicateAnalyzer.git
Open the project in your IDE
Compile the application
Create a application.properties with the following properties. Replace the placeholders with your C4C url, username and password. For each job you want to execute, add one job definition block to your application.properties file. A job is defined by the country, a comma separated list of roles, a threshold and an output file name.
Example:
logging.level = INFO
logging.file = application.log
logging.level.de.wpsverlinden.c4caccountduplicate=INFO
c4cduplicateanalyzer.endpoint = https://myXXXXXX.crm.ondemand.com/sap/c4c/odata/v1/c4codataapi
c4cduplicateanalyzer.user = YOUR_USERNAME
c4cduplicateanalyzer.password = YOUR_PASSWORD
c4cduplicateanalyzer.chunksize = 10000
c4cduplicateanalyzer.jobs[0].countrycode = AT
c4cduplicateanalyzer.jobs[0].roles = Z01
c4cduplicateanalyzer.jobs[0].threshold = 0.75
c4cduplicateanalyzer.jobs[0].outputfilename = AT_075_Z01_Duplicates.txt
c4cduplicateanalyzer.jobs[1].countrycode = DE
c4cduplicateanalyzer.jobs[1].roles = Z02
c4cduplicateanalyzer.jobs[1].threshold = 0.85
c4cduplicateanalyzer.jobs[1].outputfilename = DE_085_Z02_Duplicates.txt
The application will download all customers based on the defined country and roles from your C4C environment. After that it calculates the similarity (based on the levenshtein distance) between all customers based on several hardcoded fields. If the similarity exceeded the defined threshold, a potential duplicate is recognized and reported.
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.