GitHub - codingconcepts/dg: A fast data generator that produces CSV files from generated relational data

A fast data generator that produces CSV files from generated relational data.

$ dg
Usage dg:
  -c string
        the absolute or relative path to the config file
  -cpuprofile string
        write cpu profile to file
  -i string
        write import statements to file
  -o string
        the absolute or relative path to the output dir (default ".")
  -p int
        port to serve files from (omit to generate without serving)
  -version
        display the current version number

Create a config file. In the following example, we create 10,000 people, 50 events, 5 person types, and then populate the many-to-many person_event resolver table with 500,000 rows that represent the Cartesian product between the person and event tables:

tables:
  - name: person
    count: 10000
    columns:
      # Generate a random UUID for each person
      - name: id
        type: gen
        processor:
          value: ${uuid}

  - name: event
    count: 50
    columns:
      # Generate a random UUID for each event
      - name: id
        type: gen
        processor:
          value: ${uuid}

  - name: person_type
    count: 5
    columns:
      # Generate a random UUID for each person_type
      - name: id
        type: gen
        processor:
          value: ${uuid}

      # Generate a random 16 bit number and left-pad it to 5 digits
      - name: name
        type: gen
        processor:
          value: ${uint16}
          format: "%05d"

  - name: person_event
    columns:
      # Generate a random UUID for each person_event
      - name: id
        type: gen
        processor:
          value: ${uuid}

      # Select a random id from the person_type table
      - name: person_type
        type: ref
        processor:
          table: person_type
          column: id

      # Generate a person_id column for each id in the person table
      - name: person_id
        type: each
        processor:
          table: person
          column: id

      # Generate an event_id column for each id in the event table
      - name: event_id
        type: each
        processor:
          table: event
          column: id

Run the application:

$ dg -c your_config_file.yaml -o your_output_dir -p 3000
loaded config file                       took: 428µs
generated table: person                  took: 41ms
generated table: event                   took: 159µs
generated table: person_type             took: 42µs
generated table: person_event            took: 1s
generated all tables                     took: 1s
wrote csv: person                        took: 1ms
wrote csv: event                         took: 139µs
wrote csv: person_type                   took: 110µs
wrote csv: person_event                  took: 144ms
wrote all csvs                           took: 145ms

This will output and dg will then run an HTTP server allow you to import the files from localhost.

your_output_dir
├── event.csv
├── person.csv
├── person_event.csv
└── person_type.csv

Import via HTTP

Then import the files as you would any other; here's an example insert into CockroachDB:

IMPORT INTO "person" ("id")
CSV DATA (
    'https://localhost:3000/person.csv'
)
WITH skip='1', nullif = '', allow_quoted_null;

IMPORT INTO "event" ("id")
CSV DATA (
    'https://localhost:3000/event.csv'
)
WITH skip='1', nullif = '', allow_quoted_null;

IMPORT INTO "person_type" ("id", "name")
CSV DATA (
    'https://localhost:3000/person_type.csv'
)
WITH skip='1', nullif = '', allow_quoted_null;

IMPORT INTO "person_event" ("person_id", "event_id", "id", "person_type")
CSV DATA (
    'https://localhost:3000/person_event.csv'
)
WITH skip='1', nullif = '', allow_quoted_null;

Import via psql

If you're working with a remote database and have access to the psql binary, try importing the CSV file as follows:

psql "postgres:https://root@localhost:26257/defaultdb?sslmode=disable" \
-c "\COPY public.person (id, full_name, date_of_birth, user_type, favourite_animal) FROM './csvs/person/person.csv' WITH DELIMITER ',' CSV HEADER NULL E''"

Import via nodelocal

If you're working with a remote database and have access to the cockroach binary, try importing the CSV file as follows:

cockroach nodelocal upload ./csvs/person/person.csv imports/person.csv \
  --url "postgres:https://root@localhost:26257?sslmode=disable"

Then importing the file as follows:

IMPORT INTO person ("id", "full_name", "date_of_birth", "user_type", "favourite_animal")
  CSV DATA (
    'nodelocal:https://1/imports/person.csv'
  ) WITH skip = '1';

Tables

Table elements instruct dg to generate data for a single table and output it as a csv file. Here are the configuration options for a table:

tables:
  - name: person
    unique_columns: [col_a, col_b]
    count: 10
    columns: ...

This config generates 10 random rows for the person table. Here's a breakdown of the fields:

Field Name	Optional	Description
name	No	Name of the table. Must be unique.
unique_columns	Yes	Removes duplicates from the table based on the column names provided
count	Yes	If provided, will determine the number of rows created. If not provided, will be calculated by the current table size.
suppress	Yes	If `true` the table won't be written to a CSV. Useful when you need to generate intermediate tables to combine data locally.
columns	No	A collection of columns to generate for the table.

Processors

dg takes its configuration from a config file that is parsed in the form of an object containing arrays of objects; tables and inputs. Each object in the tables array represents a CSV file to be generated for a named table and contains a collection of columns to generate data for.

gen

Generate a random value for the column. Here's an example:

- name: sku
  type: gen
  processor:
    value: SKU${uint16}
    format: "%05d"

This configuration will generate a random left-padded uint16 with a prefix of "SKU" for a column called "sku". value contains zero or more function placeholders that can be used to generate data. A list of available functions can be found here.

Generate a pattern-based value for the column. Here's an example:

- name: phone
  type: gen
  processor:
    pattern: \d{3}-\d{3}-\d{4}

This configuration will generate US-format phone number, like 123-456-7890.

const

Provide a constant set of values for a column. Here's an example:

- name: options
  type: const
  processor:
    values: [bed_breakfast, bed]

This configuration will create a column containing two rows.

set

Select a value from a given set. Here's an example:

- name: user_type
  type: set
  processor:
    values: [admin, regular, read-only]

This configuration will select between the values "admin", "regular", and "read-only"; each with an equal probability of being selected.

Items in a set can also be given a weight, which will affect their likelihood of being selected. Here's an example:

- name: favourite_animal
  type: set
  processor:
    values: [rabbit, dog, cat]
    weights: [10, 60, 30]

This configuration will select between the values "rabbit", "dog", and "cat"; each with different probabilities of being selected. Rabbits will be selected approximately 10% of the time, dogs 60%, and cats 30%. The total value doesn't have to be 100, however, you can use whichever numbers make most sense to you.

inc

Generates an incrementing number. Here's an example:

- name: id
  type: inc
  processor:
    start: 1
    format: "P%03d"

This configuration will generate left-padded ids starting from 1, and format them with a prefix of "P".

ref

References a value from a previously generated table. Here's an example:

- name: ptype
  type: ref
  processor:
    table: person_type
    column: id

This configuration will choose a random id from the person_type table and create a ptype column to store the values.

Use the ref type if you need to reference another table but don't need to generate a new row for every instance of the referenced column.

each

Creates a row for each value in another table. If multiple each columns are provided, a Cartesian product of both columns will be generated.

Here's an example of one each column:

- name: person
  count: 3
  columns:
    - name: id
      type: gen
      processor:
        value: ${uuid}

# person
#
# id
# c40819f8-2c76-44dd-8c44-5eef6a0f2695
# 58f42be2-6cc9-4a8c-b702-c72ab1decfea
# ccbc2244-667b-4bb5-a5cd-a1e9626a90f9

- name: pet
  columns:
    - name: person_id
      type: each
      processor:
        table: person
        column: id
    - name: name
      type: gen
      processor:
        value: first_name
# pet
#
# person_id                            name
# c40819f8-2c76-44dd-8c44-5eef6a0f2695 Carlo
# 58f42be2-6cc9-4a8c-b702-c72ab1decfea Armando
# ccbc2244-667b-4bb5-a5cd-a1e9626a90f9 Kailey

Here's an example of two each columns:

- name: person
  count: 3
  columns:
    - name: id
      type: gen
      processor:
        value: ${uuid}

# person
#
# id
# c40819f8-2c76-44dd-8c44-5eef6a0f2695
# 58f42be2-6cc9-4a8c-b702-c72ab1decfea
# ccbc2244-667b-4bb5-a5cd-a1e9626a90f9

- name: event
  count: 3
  columns:
    - name: id
      type: gen
      processor:
        value: ${uuid}

# event
#
# id
# 39faeb54-67d1-46db-a38b-825b41bfe919
# 7be981a9-679b-432a-8a0f-4a0267170c68
# 9954f321-8040-4cd7-96e6-248d03ee9266

- name: person_event
  columns:
    - name: person_id
      type: each
      processor:
        table: person
        column: id
    - name: event_id
      type: each
      processor:
        table: event
        column: id
# person_event
#
# person_id
# c40819f8-2c76-44dd-8c44-5eef6a0f2695 39faeb54-67d1-46db-a38b-825b41bfe919
# c40819f8-2c76-44dd-8c44-5eef6a0f2695 7be981a9-679b-432a-8a0f-4a0267170c68
# c40819f8-2c76-44dd-8c44-5eef6a0f2695 9954f321-8040-4cd7-96e6-248d03ee9266
# 58f42be2-6cc9-4a8c-b702-c72ab1decfea 39faeb54-67d1-46db-a38b-825b41bfe919
# 58f42be2-6cc9-4a8c-b702-c72ab1decfea 7be981a9-679b-432a-8a0f-4a0267170c68
# 58f42be2-6cc9-4a8c-b702-c72ab1decfea 9954f321-8040-4cd7-96e6-248d03ee9266
# ccbc2244-667b-4bb5-a5cd-a1e9626a90f9 39faeb54-67d1-46db-a38b-825b41bfe919
# ccbc2244-667b-4bb5-a5cd-a1e9626a90f9 7be981a9-679b-432a-8a0f-4a0267170c68
# ccbc2244-667b-4bb5-a5cd-a1e9626a90f9 9954f321-8040-4cd7-96e6-248d03ee9266

Use the each type if you need to reference another table and need to generate a new row for every instance of the referenced column.

range

Generates data within a given range. Note that a number of factors determine how this generator will behave. The step (and hence, number of rows) will be generated in the following priority order:

If an each generator is being used, step will be derived from that
If a count is provided, step will be derived from that
Otherwise, step will be used

Here's an example that generates monotonically increasing ids for a table, starting from 1:

- name: users
  count: 10000
  columns:
    - name: id
      type: range
      processor:
        type: int
        from: 1
        step: 1

Here's an example that generates all dates between 2020-01-01 and 2023-01-01 at daily intervals:

- name: event
  columns:
    - name: date
      type: range
      processor:
        type: date
        from: 2020-01-01
        to: 2023-01-01
        step: 24h
        format: 2006-01-02

Here's an example that generates 10 dates between 2020-01-01 and 2023-01-02:

- name: event
  count: 10
  columns:
    - name: date
      type: range
      processor:
        type: date
        from: 2020-01-01
        to: 2023-01-01
        format: 2006-01-02
        step: 24h # Ignored due to table count.

Here's an example that generates 20 dates (one for every row found from an each generator) between 2020-01-01 and 2023-01-02:

- name: person
  count: 20
  columns:
    - name: id
      type: gen
      processor:
        value: ${uuid}

- name: event
  count: 10 # Ignored due to resulting count from "each" generator.
  columns:
    - name: person_id
      type: each
      processor:
        table: person
        column: id

    - name: date
      type: range
      processor:
        type: date
        from: 2020-01-01
        to: 2023-01-01
        format: 2006-01-02

The range generate currently supports the following data types:

date - Generate dates between a from and to value
int - Generate integers between a from and to value

match

Generates data by matching data in another table. In this example, we'll assume there's a CSV file for the significant_event input that generates the following table:

date	event
2023-01-10	abc
2023-01-11
2023-01-12	def

inputs:
  - name: significant_event
    type: csv
    source:
      file_name: significant_dates.csv

tables:
  - name: events
    columns:
      - name: timeline_date
        type: range
        processor:
          type: date
          from: 2023-01-09
          to: 2023-01-13
          format: 2006-01-02
          step: 24h
      - name: timeline_event
        type: match
        processor:
          source_table: significant_event
          source_column: date
          source_value: events
          match_column: timeline_date

dg will match rows in the significant_event table with rows in the events table based on the match between significant_event.date and events.timeline_date, and take the value from the significant_events.event column where there's a match (otherwise leaving NULL). This will result in the following events table being generated:

timeline_date	timeline_event
2023-01-09
2023-01-10	abc
2023-01-11
2023-01-12	def
2023-01-13

Inputs

dg takes its configuration from a config file that is parsed in the form of an object containing arrays of objects; tables and inputs. Each object in the inputs array represents a data source from which a table can be created. Tables created via inputs will not result in output CSVs.

csv

Reads in a CSV file as a table that can be referenced from other tables. Here's an example:

- name: significant_event
  type: csv
  source:
    file_name: significant_dates.csv

This configuration will read from a file called significant_dates.csv and create a table from its contents. Note that the file_name should be relative to the config directory, so if your CSV file is in the same directory as your config file, just include the file name.

Functions

Name	Type	Example
${ach_account}	string	586981797546
${ach_routing}	string	441478502
${adjective_demonstrative}	string	there
${adjective_descriptive}	string	eager
${adjective_indefinite}	string	several
${adjective_interrogative}	string	whose
${adjective_possessive}	string	her
${adjective_proper}	string	Iraqi
${adjective_quantitative}	string	sufficient
${adjective}	string	double
${adverb_degree}	string	far
${adverb_frequency_definite}	string	daily
${adverb_frequency_indefinite}	string	always
${adverb_manner}	string	unexpectedly
${adverb_place}	string	here
${adverb_time_definite}	string	yesterday
${adverb_time_indefinite}	string	just
${adverb}	string	far
${animal_type}	string	mammals
${animal}	string	ape
${app_author}	string	RedLaser
${app_name}	string	SlateBlueweek
${app_version}	string	3.2.10
${bitcoin_address}	string	16YmZ5ol5aXKjilZT2c2nIeHpbq
${bitcoin_private_key}	string	5JzwyfrpHRoiA59Y1Pd9yLq52cQrAXxSNK4QrGrRUxkak5Howhe
${bool}	bool	true
${breakfast}	string	Awesome orange chocolate muffins
${bs}	string	leading-edge
${car_fuel_type}	string	LPG
${car_maker}	string	Seat
${car_model}	string	Camry Solara Convertible
${car_transmission_type}	string	Manual
${car_type}	string	Passenger car mini
${chrome_user_agent}	string	Mozilla/5.0 (X11; Linux i686) AppleWebKit/5310 (KHTML, like Gecko) Chrome/37.0.882.0 Mobile Safari/5310
${city}	string	Memphis
${color}	string	DarkBlue
${company_suffix}	string	LLC
${company}	string	PlanetEcosystems
${connective_casual}	string	an effect of
${connective_complaint}	string	i.e.
${connective_examplify}	string	for example
${connective_listing}	string	next
${connective_time}	string	soon
${connective}	string	for instance
${country_abr}	string	VU
${country}	string	Eswatini
${credit_card_cvv}	string	315
${credit_card_exp}	string	06/28
${credit_card_type}	string	Mastercard
${currency_long}	string	Mozambique Metical
${currency_short}	string	SCR
${date}	time.Time	2005-01-25 22:17:55.371781952 +0000 UTC
${day}	int	27
${dessert}	string	Chocolate coconut dream bars
${dinner}	string	Creole potato salad
${domain_name}	string	centralb2c.net
${domain_suffix}	string	com
${email}	string	[email protected]
${emoji}	string	♻️
${file_extension}	string	csv
${file_mime_type}	string	image/vasa
${firefox_user_agent}	string	Mozilla/5.0 (X11; Linux x86_64; rv:6.0) Gecko/1951-07-21 Firefox/37.0
${first_name}	string	Kailee
${flipacoin}	string	Tails
${float32}	float32	2.7906555e+38
${float64}	float64	4.314310154193861e+307
${fruit}	string	Eggplant
${gender}	string	female
${hexcolor}	string	#6daf06
${hobby}	string	Bowling
${hour}	int	18
${http_method}	string	DELETE
${http_status_code_simple}	int	404
${http_status_code}	int	503
${http_version}	string	HTTP/1.1
${int16}	int16	18940
${int32}	int32	2129368442
${int64}	int64	5051946056392951363
${int8}	int8	110
${ipv4_address}	string	191.131.155.85
${ipv6_address}	string	1642:94b:52d8:3a4e:38bc:4d87:846e:9c83
${job_descriptor}	string	Senior
${job_level}	string	Identity
${job_title}	string	Executive
${language_abbreviation}	string	kn
${language}	string	Bengali
${last_name}	string	Friesen
${latitude}	float64	45.919913
${longitude}	float64	-110.313125
${lunch}	string	Sweet and sour pork balls
${mac_address}	string	bd:e8:ce:66:da:5b
${minute}	int	23
${month_string}	string	April
${month}	int	10
${name_prefix}	string	Ms.
${name_suffix}	string	I
${name}	string	Paxton Schumm
${nanosecond}	int	349669923
${nicecolors}	[]string	[#490a3d #bd1550 #e97f02 #f8ca00 #8a9b0f]
${noun_abstract}	string	timing
${noun_collective_animal}	string	brace
${noun_collective_people}	string	mob
${noun_collective_thing}	string	orchard
${noun_common}	string	problem
${noun_concrete}	string	town
${noun_countable}	string	cat
${noun_uncountable}	string	wisdom
${noun}	string	case
${opera_user_agent}	string	Opera/10.10 (Windows NT 5.01; en-US) Presto/2.11.165 Version/13.00
${password}	string	1k0vWN 9Z
${pet_name}	string	Bernadette
${phone_formatted}	string	(476)455-2253
${phone}	string	2692528685
${phrase}	string	I'm straight
${preposition_compound}	string	ahead of
${preposition_double}	string	next to
${preposition_simple}	string	at
${preposition}	string	outside of
${programming_language}	string	PL/SQL
${pronoun_demonstrative}	string	those
${pronoun_interrogative}	string	whom
${pronoun_object}	string	us
${pronoun_personal}	string	I
${pronoun_possessive}	string	mine
${pronoun_reflective}	string	yourself
${pronoun_relative}	string	whom
${pronoun}	string	those
${quote}	string	"Raw denim tilde cronut mlkshk photo booth kickstarter." - Gunnar Rice
${rgbcolor}	[]int	[152 74 172]
${safari_user_agent}	string	Mozilla/5.0 (Windows; U; Windows 95) AppleWebKit/536.41.5 (KHTML, like Gecko) Version/5.2 Safari/536.41.5
${safecolor}	string	gray
${second}	int	58
${snack}	string	Crispy fried chicken spring rolls
${ssn}	string	783135577
${state_abr}	string	AL
${state}	string	Kentucky
${street_name}	string	Way
${street_number}	string	6234
${street_prefix}	string	Port
${street_suffix}	string	stad
${street}	string	11083 Lake Fall mouth
${time_zone_abv}	string	ADT
${time_zone_full}	string	(UTC-02:00) Coordinated Universal Time-02
${time_zone_offset}	float32	3
${time_zone_region}	string	Asia/Aqtau
${time_zone}	string	Mountain Standard Time (Mexico)
${uint128_hex}	string	0xcd50930d5bc0f2e8fa36205e3d7bd7b2
${uint16_hex}	string	0x7c80
${uint16}	uint16	25076
${uint256_hex}	string	0x61334b8c51fa841bf9a3f1f0ac3750cd1b51ca2046b0fb75627ac73001f0c5aa
${uint32_hex}	string	0xfe208664
${uint32}	uint32	783098878
${uint64_hex}	string	0xc8b91dc44e631956
${uint64}	uint64	5722659847801560283
${uint8_hex}	string	0x65
${uint8}	uint8	192
${url}	string	https://www.leadcutting-edge.net/productize
${user_agent}	string	Opera/10.64 (Windows NT 5.2; en-US) Presto/2.13.295 Version/10.00
${username}	string	Gutmann2845
${uuid}	string	e6e34ff4-1def-41e5-9afb-f697a51c0359
${vegetable}	string	Tomato
${verb_action}	string	knit
${verb_helping}	string	did
${verb_linking}	string	has
${verb}	string	be
${weekday}	string	Tuesday
${word}	string	month
${year}	int	1962
${zip}	string	45618

Building releases locally

$ VERSION=0.1.0 make release

Thanks

Thanks to the maintainers of the following fantastic packages, whose code this tools makes use of:

Todos

Improve code coverage
Write file after generating, then only keep columns that other tables need
Support for range without a table count (e.g. the following results in zero rows unless a count is provided)

- name: bet_types
  count: 3
  columns:
    - name: id
      type: range
      processor:
        type: int
        from: 1
        step: 1
    - name: description
      type: const
      processor:
        values: [Win, Lose, Draw]

Name		Name	Last commit message	Last commit date
Latest commit History 134 Commits
.github/workflows		.github/workflows
assets		assets
examples		examples
internal/pkg		internal/pkg
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
dg.go		dg.go
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

Installation

Usage

Import via HTTP

Import via psql

Import via nodelocal

Tables

Processors

gen

const

set

inc

ref

each

range

match

Inputs

csv

Functions

Building releases locally

Thanks

Todos

About

Releases 27

Contributors 2

Languages

License

codingconcepts/dg

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Installation

Usage

Import via HTTP

Import via psql

Import via nodelocal

Tables

Processors

gen

const

set

inc

ref

each

range

match

Inputs

csv

Functions

Building releases locally

Thanks

Todos

About

Resources

License

Stars

Watchers

Forks

Releases 27

Contributors 2

Languages