Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] [VDP] [Pipeline] Data movement tools #1023

Open
1 task done
chuang8511 opened this issue Jun 20, 2024 · 1 comment
Open
1 task done

[Feature] [VDP] [Pipeline] Data movement tools #1023

chuang8511 opened this issue Jun 20, 2024 · 1 comment
Labels
feature New feature or request need-triage Need to be investigated further

Comments

@chuang8511
Copy link
Member

Is There an Existing Issue for This?

  • I have searched the existing issues

Where do you intend to apply this feature?

Instill Core, Instill Cloud

Is your Proposal Related to a Problem?

Background

When there are multiple data sources in companies, the data engineers in the companies need to migrate data from a source to another source.

The data is scattered around in applications, it is time-consuming for a company to write several tools to collect the data from applications, such as Gmail / Slack / ….

Describe Your Proposed Solution

User stories

Story 1

  • As a data engineer, he/she wants to migrate raw data to analysable data to another data source

Possible pipelines
image

Concrete examples
image

e.g. transaction data is not analysable, but weekly transaction amount & transaction count are.

Story 2

As a data engineer, he/ she wants to transform unstructured data into analysable data and load to another data source.

Possible pipelines
image

Concrete example
image

Highlight the Benefits

It can solve the problem in the real world.

Anything Else?

Possible components

  • Note: The sequence means the priority.

Data components

RDBMS

  • AWS
    • RDS
  • GCP
    • Cloud SQL / BigQuery
  • Postgres
  • MySQL
  • MSSQL
  • Oracle DB

NoSQL

  • AWS
    • NoSQL (DynamoDB / MongoDB)
  • GCP
    • Datastore
  • MongoDB
  • Elasticsearch
  • Cassandra

Vector DB

  • Weaviate
  • Qdrant
  • Chroma
  • Zilliz
  • Milvus

Others

  • AWS
    • S3
  • GCP
    • Google Cloud Storage
  • AWS Datalake
  • Google Sheet

Application components

  • Discord / X / Slack / … are expected to built from other tools. But, you could need to build a specific TASK for Application component according to your usage.
  • Please notify in Slack if there are further concrete idea that you want to build some specific application components. We can discuss those in details.

Reference tools

  • Airbyte
    • Data source -> Data destination

Milestones

  1. Read the current pipelines
  1. Design the pipeline according to user stories.
  • Please draw the concrete pipelines first to ask us review before delving into development.
  • Timeline: 5 working days
  1. Check which components we are missing according to the designed pipeline.
  • Please create the skeleton PR first for the incoming components
  • Timeline: 2~3 working days
  1. Connect those components.
  • Timeline: 10 working days
  1. Build the designed pipeline after you connect those components.
  • Timeline: 1 working day

Note

  • About timeline, let's adjust it dynamically if there are much more complicated issues than we think
  • Milestone 2~5 is a cycle. Let's finish a whole complete user story first and then iterate it.
@chuang8511 chuang8511 added need-triage Need to be investigated further feature New feature or request labels Jun 20, 2024
Copy link

linear bot commented Jun 20, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request need-triage Need to be investigated further
Projects
None yet
Development

No branches or pull requests

1 participant