Skip to content

chinkitp/entity-matching-dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Benchmarked dataset for entity resolution

This repository helps to create dataset for entity resolution.

    root
    |-- recid: string (nullable = true)
    |-- givename: string (nullable = true)
    |-- surname: string (nullable = true)
    |-- suburb: string (nullable = true)
    |-- postcode: string (nullable = true)

recId entites with the same recId refer to the same entity.

Download the dataset

Remove duplicates

people.distinct()
    .repartition(4)
    .write
    .option("compression","gzip")
    .format("csv") 
    .mode(SaveMode.Overwrite)
    .save("file:/home/jovyan/work/data/de-duplicated/")

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published