Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(docs): refactor source and sink docs #3031

Merged
merged 40 commits into from
Aug 8, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
0b2f343
Begin reorg
kevinhu Jul 27, 2021
0916b75
Add links
kevinhu Jul 27, 2021
2bb1d79
Fix link
kevinhu Jul 27, 2021
487a2b6
Fix glue link
kevinhu Jul 27, 2021
a24dc59
Add module installs to each page
kevinhu Jul 27, 2021
5c6a19a
Consistency
kevinhu Jul 27, 2021
2382c30
Standardize sqlalchemy pattern
kevinhu Jul 27, 2021
34fbccf
Add missing sql options
kevinhu Jul 27, 2021
9808735
More consistent recipes
kevinhu Jul 27, 2021
9af3cab
Finish consistency checks for recipes
kevinhu Jul 27, 2021
9dc365f
As above
kevinhu Jul 28, 2021
9afa393
Typo fixes
kevinhu Jul 28, 2021
c6388cb
More typo fixes
kevinhu Jul 28, 2021
8588cb9
More consistency fixes
kevinhu Jul 28, 2021
63691dd
Fix broken links
kevinhu Jul 28, 2021
f186b49
Merge branch 'master' of github.com:kevinhu/datahub into reorganize-docs
kevinhu Jul 28, 2021
410b9b8
Merge
kevinhu Aug 2, 2021
59623e4
Merge
kevinhu Aug 2, 2021
eef2a62
Note on allow/deny
kevinhu Aug 2, 2021
bee872f
Add questions section
kevinhu Aug 2, 2021
124c0a3
Merge branch 'master' of github.com:kevinhu/datahub into reorganize-docs
kevinhu Aug 2, 2021
6ffd8a1
Fix inconsistencies
kevinhu Aug 3, 2021
ba3cb36
Merge branch 'master' of github.com:kevinhu/datahub into reorganize-docs
kevinhu Aug 3, 2021
8a4de6d
Begin separation of quickstart and config details
kevinhu Aug 3, 2021
8bf27a5
Write generic sqlalchemy options
kevinhu Aug 3, 2021
3dbb736
Up to looker
kevinhu Aug 3, 2021
186235f
Add all config vars
kevinhu Aug 4, 2021
35ecc45
Add source config docs
kevinhu Aug 4, 2021
73a42fd
Clean up quickstart configs
kevinhu Aug 4, 2021
b1bf7e7
Update usage docs
kevinhu Aug 4, 2021
5933f1f
Formatting
kevinhu Aug 4, 2021
bbbe612
Revise capabilities
kevinhu Aug 4, 2021
30f9e6f
Merge branch 'master' of github.com:kevinhu/datahub into reorganize-docs
kevinhu Aug 4, 2021
9cf1acb
Merge
kevinhu Aug 6, 2021
aa608b6
PR fixes
kevinhu Aug 6, 2021
f429324
Add link back to main readme
kevinhu Aug 6, 2021
5fbac7b
Add link back to recipe section
kevinhu Aug 6, 2021
387137f
Add sink config placeholder
kevinhu Aug 6, 2021
34d6c57
Categories
kevinhu Aug 6, 2021
625baa0
Remove sink compatibility
kevinhu Aug 6, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions docs-website/generateDocsDir.ts
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,13 @@ function markdown_guess_title(
} else {
// Find first h1 header and use it as the title.
const headers = contents.content.match(/^# (.+)$/gm);

if (!headers) {
throw new Error(
`${filepath} must have at least one h1 header for setting the title`
);
}

if (headers.length > 1 && contents.content.indexOf("```") < 0) {
throw new Error(`too many h1 headers in ${filepath}`);
}
Expand Down
8 changes: 8 additions & 0 deletions docs-website/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,14 @@ module.exports = {
"docs/architecture/metadata-serving",
//"docs/what/gms",
],
"Metadata Ingestion": [
{
Sources: list_ids_in_directory("metadata-ingestion/source_docs"),
},
{
Sinks: list_ids_in_directory("metadata-ingestion/sink_docs"),
},
],
"Metadata Modeling": [
"docs/modeling/metadata-model",
"docs/modeling/extending-the-metadata-model",
Expand Down
2 changes: 1 addition & 1 deletion docs/features.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ Our open sourcing [blog post](https://engineering.linkedin.com/blog/2020/open-so
- **Schema history**: view and diff historic versions of schemas
- **GraphQL**: visualization of GraphQL schemas

### Jos/flows [*coming soon*]
### Jobs/flows [*coming soon*]
- **Search**: full-text & advanced search, search ranking
- **Browse**: browsing through a configurable hierarchy
- **Basic information**:
Expand Down
952 changes: 48 additions & 904 deletions metadata-ingestion/README.md

Large diffs are not rendered by default.

1 change: 0 additions & 1 deletion metadata-ingestion/examples/recipes/mongodb_to_datahub.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,6 @@ source:
collection_pattern: {}
enableSchemaInference: True
schemaSamplingSize: 1000
# database_pattern/collection_pattern are similar to schema_pattern/table_pattern from above
sink:
type: "datahub-rest"
config:
Expand Down
33 changes: 33 additions & 0 deletions metadata-ingestion/sink_docs/console.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Console

For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).

## Setup

Works with `acryl-datahub` out of the box.

## Capabilities

Simply prints each metadata event to stdout. Useful for experimentation and debugging purposes.

## Quickstart recipe

Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.

For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).

```yml
source:
# source configs

sink:
type: "console"
```

## Config details

None!

## Questions

If you've got any questions on configuring this sink, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!
87 changes: 87 additions & 0 deletions metadata-ingestion/sink_docs/datahub.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# DataHub

## DataHub Rest

For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).

### Setup

To install this plugin, run `pip install 'acryl-datahub[datahub-rest]'`.

### Capabilities

Pushes metadata to DataHub using the GMA rest API. The advantage of the rest-based interface
is that any errors can immediately be reported.

### Quickstart recipe

Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.

For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).

```yml
source:
# source configs
sink:
type: "datahub-rest"
config:
server: "https://localhost:8080"
```

### Config details

Note that a `.` is used to denote nested fields in the YAML recipe.

| Field | Required | Default | Description |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is amazing.

| -------- | -------- | ------- | ---------------------------- |
kevinhu marked this conversation as resolved.
Show resolved Hide resolved
| `server` | ✅ | | URL of DataHub GMS endpoint. |

## DataHub Kafka

For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).

### Setup

To install this plugin, run `pip install 'acryl-datahub[datahub-kafka]'`.

### Capabilities

Pushes metadata to DataHub by publishing messages to Kafka. The advantage of the Kafka-based
interface is that it's asynchronous and can handle higher throughput.

### Quickstart recipe

Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.

For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).

```yml
source:
# source configs

sink:
kevinhu marked this conversation as resolved.
Show resolved Hide resolved
type: "datahub-kafka"
config:
connection:
bootstrap: "localhost:9092"
schema_registry_url: "https://localhost:8081"
```

### Config details

Note that a `.` is used to denote nested fields in the YAML recipe.

| Field | Required | Default | Description |
| -------------------------------------------- | -------- | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
kevinhu marked this conversation as resolved.
Show resolved Hide resolved
| `connection.bootstrap` | ✅ | | Kafka bootstrap URL. |
| `connection.producer_config.<option>` | | | Passed to https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html#confluent_kafka.SerializingProducer |
| `connection.schema_registry_url` | ✅ | | URL of schema registry being used. |
| `connection.schema_registry_config.<option>` | | | Passed to https://docs.confluent.io/platform/current/clients/confluent-kafka-python/html/index.html#confluent_kafka.schema_registry.SchemaRegistryClient |

The options in the producer config and schema registry config are passed to the Kafka SerializingProducer and SchemaRegistryClient respectively.

For a full example with a number of security options, see this [example recipe](../examples/recipes/secured_kafka.yml).

## Questions

If you've got any questions on configuring this sink, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!
41 changes: 41 additions & 0 deletions metadata-ingestion/sink_docs/file.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# File

For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).

## Setup

Works with `acryl-datahub` out of the box.

## Capabilities

Outputs metadata to a file. This can be used to decouple metadata sourcing from the
process of pushing it into DataHub, and is particularly useful for debugging purposes.
Note that the [file source](../source_docs/file.md) can read files generated by this sink.

## Quickstart recipe

Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.

For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).

```yml
source:
# source configs

sink:
type: file
config:
filename: ./path/to/mce/file.json
```

## Config details

Note that a `.` is used to denote nested fields in the YAML recipe.

| Field | Required | Default | Description |
| -------- | -------- | ------- | ------------------------- |
| filename | ✅ | | Path to file to write to. |
kevinhu marked this conversation as resolved.
Show resolved Hide resolved

## Questions

If you've got any questions on configuring this sink, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!
70 changes: 70 additions & 0 deletions metadata-ingestion/source_docs/athena.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# Athena

For context on getting started with ingestion, check out our [metadata ingestion guide](../README.md).

## Setup

To install this plugin, run `pip install 'acryl-datahub[athena]'`.

## Capabilities

This plugin extracts the following:

- Metadata for databases, schemas, and tables
- Column types associated with each table

## Quickstart recipe

Check out the following recipe to get started with ingestion! See [below](#config-details) for full configuration options.

For general pointers on writing and running a recipe, see our [main recipe guide](../README.md#recipes).

```yml
source:
type: athena
config:
# Coordinates
aws_region: my_aws_region_name
work_group: my_work_group

# Credentials
username: my_aws_access_key_id
password: my_aws_secret_access_key
database: my_database

# Options
s3_staging_dir: "s3:https://<bucket-name>/<folder>/"

sink:
# sink configs
```

## Config details

Note that a `.` is used to denote nested fields in the YAML recipe.

| Field | Required | Default | Description |
| ---------------------- | -------- | ------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `username` | | Autodetected | Username credential. If not specified, detected with boto3 rules. See https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html |
| `password` | | Autodetected | Same detection scheme as `username` |
| `database` | | Autodetected | |
| `aws_region` | ✅ | | AWS region code. |
| `s3_staging_dir` | ✅ | | Of format `"s3:https://<bucket-name>/prefix/"`. The `s3_staging_dir` parameter is needed because Athena always writes query results to S3. <br />See https://docs.aws.amazon.com/athena/latest/ug/querying.html. |
| `work_group` | ✅ | | Name of Athena workgroup. <br />See https://docs.aws.amazon.com/athena/latest/ug/manage-queries-control-costs-with-workgroups.html. |
| `env` | | `"PROD"` | Environment to use in namespace when constructing URNs. |
| `options.<option>` | | | Any options specified here will be passed to SQLAlchemy's `create_engine` as kwargs.<br />See https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine for details. |
| `table_pattern.allow` | | | Regex pattern for tables to include in ingestion. |
| `table_pattern.deny` | | | Regex pattern for tables to exclude from ingestion. |
| `schema_pattern.allow` | | | Regex pattern for schemas to include in ingestion. |
| `schema_pattern.deny` | | | Regex pattern for schemas to exclude from ingestion. |
| `view_pattern.allow` | | | Regex pattern for views to include in ingestion. |
| `view_pattern.deny` | | | Regex pattern for views to exclude from ingestion. |
| `include_tables` | | `True` | Whether tables should be ingested. |

## Compatibility

Coming soon!

## Questions

If you've got any questions on configuring this source, feel free to ping us on [our Slack](https://slack.datahubproject.io/)!
Loading