Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scripts table should be created for each database/schema #665

Open
sunng87 opened this issue Nov 30, 2022 · 22 comments
Open

Scripts table should be created for each database/schema #665

sunng87 opened this issue Nov 30, 2022 · 22 comments
Labels
C-enhancement Category Enhancements

Comments

@sunng87
Copy link
Member

sunng87 commented Nov 30, 2022

What type of enhancement is this?

Refactor

What does the enhancement do?

The scripts table, where we stored python scripts, should be created for each database/schema. This can either be done when creating the database/schema, or lazily created on first script register request.

Also we need a special naming pattern for these built-in tables, for example, to be prefixed with _ or gt_.

Implementation challenges

No response

@sunng87 sunng87 added the C-enhancement Category Enhancements label Nov 30, 2022
@ShenJunkun
Copy link
Contributor

I want to try this task, please assign this task to me if you don't mind.

@MichaelScofield
Copy link
Collaborator

@ShenJunkun assigned

@evenyag
Copy link
Contributor

evenyag commented Jan 3, 2023

Also we need a special naming pattern for these built-in tables, for example, to be prefixed with _ or gt_

We could reserve a schema for all built-in tables, such as a schema named system or greptime.

@ShenJunkun
Copy link
Contributor

This is my plan:

  1. Reserve a schema named system in greptime catalog for all scripts table.
  2. The name of scripts table will be catalogName_schemaName_scripts.
  3. Currently, the codes in standalone mode will generate scripts table in public schema, I need to move it to my new schema.
  4. GreptimeDB has two modes,i.e. standalone mode and distributed mode, I need to consider both scenarios.

I also have some questions.

  1. When I launch greptimedb in distributed mode, I can's insert data into table, the log shows that the logic hasn't implement. So, I want to know whether there is something wrong with me.
  2. For this task, compared to standalone mode, need I do some special in distributed mode? May be I need to write some codes to update catalog and schema info in MetaSrv.

@evenyag
Copy link
Contributor

evenyag commented Jan 4, 2023

There are several approaches

Reserve one scripts table in the system schema

We could reserve a schema for all built-in tables, such as a schema named system or greptime.

Drawbacks

  • users need to access their script table via the fully qualified name in their SQL
  • scripts from different schemas are mixed together
select script from catalog.system.scripts where name='xxx';

select script from catalog.system.scripts where name='xxx' and schema='xxx';

But this could be improved by putting the scripts table or the system catalog in the search path like PostgreSQL

In addition to public and user-created schemas, each database contains a pg_catalog schema, which contains the system tables and all the built-in data types, functions, and operators. pg_catalog is always effectively part of the search path. If it is not named explicitly in the path then it is implicitly searched before searching the path's schemas. This ensures that built-in names will always be findable.

Reserve scripts table for each schema in the system schema

Reserve a schema named system in greptime catalog for all scripts table.
The name of scripts table will be catalogName_schemaName_scripts.

Drawbacks

  • We need to create lots of scripts table
  • The table name would be too long
select script from catalogName_schemaName_scripts where name = 'xxx';

Maybe we need to use a fully qualified name as we are in a different catalog/schema

select script from greptime.system.catalogName_schemaName_scripts where name = 'xxx';

Actually, we could reserve a system schema for each catalog, so the catalogName isn't necessary. We could also add an additional column to represent the schema of each script. Finally, this becomes the first approach.

Reserve scripts table for each schema

Also we need a special naming pattern for these built-in tables, for example, to be prefixed with _ or gt_.

We need to create this table for each schema, then we could access it by typing its name directly

select script from gt_scripts where name = 'xxx';

Drawback

We need to create the table for each schema, which also means that there are lots of scripts table

Conclusion

As for me, I'd prefer the first solution: putting all system tables, including the scripts table in a specific schema. Generally, we don't recommend users operate the system tables directly. We should provide SQL commands or API to do that

Normally, one should not change the system catalogs by hand, there are normally SQL commands to do that. (For example, CREATE DATABASE inserts a row into the pg_database catalog — and actually creates the database on disk.)

I'd be happy to hear your thoughts @sunng87 @killme2008 @ShenJunkun

@evenyag
Copy link
Contributor

evenyag commented Jan 4, 2023

  1. For this task, compared to standalone mode, need I do some special in distributed mode

I think we could implement this in standalone mode. I'm not sure whether scripts table is supported in distributed mode.

When I launch greptimedb in distributed mode, I can's insert data into table, the log shows that the logic hasn't implement. So, I want to know whether there is something wrong with me.

You might need to create a distributed table, as mentioned in this

@ShenJunkun
Copy link
Contributor

As for me, I'd prefer the first solution: putting all system tables, including the scripts table in a specific schema. Generally, we don't recommend users operate the system tables directly. We should provide SQL commands or API to do that

Sounds reasonable! By the way, Is there a difference between a schema and a database in GreptimeDB? In MySQL, a schema is synonymous with a database, but it's not in PostgreSQL!

@evenyag
Copy link
Contributor

evenyag commented Jan 4, 2023

You might need to create a distributed table, as mentioned in this

Oops, @MichaelScofield is refactoring our gPRC protocol, so I'm afraid that the distributed mode may not work until #381 is done

By the way, Is there a difference between a schema and a database in GreptimeDB

Catalog and schema are different in GreptimeDB. But there are still some works to be done if we'd like to support creating a catalog/schema via SQL manually.

@killme2008
Copy link
Contributor

I prefer the first solution too. Let's do it.

@sunng87
Copy link
Member Author

sunng87 commented Jan 4, 2023

By the way, Is there a difference between a schema and a database in GreptimeDB? In MySQL, a schema is synonymous with a database, but it's not in PostgreSQL!

Yes, this is the key issue I have been concerning about. In MySQL schema == database, while in Postgres and some other databases, catalog == database, schema is namespace of a few tables. In GreptimeDB, we follow MySQL convention to treat schema as database: by running create database you created a schema.

From my perspective, I want to keep database self-contained, to be the minimum isolation unit in shared environment, which means:

  • all data/configuration stored within a database, there is no dependency between databases
  • for operation, database can be moved between clusters without caring about any other things
  • we can limit qps or other finer access control unit to a single database without affecting others

So that's the reason I'm +1 for storing scripts as built-in table for each schema(database). But I agree this approach has its cons for example the table is visible and modifiable to user.

@ShenJunkun
Copy link
Contributor

So that's the reason I'm +1 for storing scripts as built-in table for each schema(database). But I agree this approach has its cons for example the table is visible and modifiable to user.

That's a trade-off! We can discuss about it. @killme2008 @sunng87 @evenyag

Oops, @MichaelScofield is refactoring our gPRC protocol, so I'm afraid that the distributed mode may not work until #381 is done

Never mind!

@evenyag
Copy link
Contributor

evenyag commented Jan 4, 2023

@sunng87 @ShenJunkun After taking a brief look at the codes, I find out that adding a schema column to the scripts table is much easier to implement. The ScriptsTable registers itself to the catalog manager and finds the table when it needs to access a script

catalog_manager
.register_system_table(RegisterSystemTableRequest {
create_table_request: request,
open_hook: None,
})

To support schema isolation, we just need to specify the schema column in insert() and find_script_by_name()

pub async fn insert(&self, name: &str, script: &str) -> Result<()> {

pub async fn find_script_by_name(&self, name: &str) -> Result<String> {
// FIXME(dennis): SQL injection
// TODO(dennis): we use sql to find the script, the better way is use a function
// such as `find_record_by_primary_key` in table_engine.
let sql = format!("select script from {} where name='{}'", self.name(), name);

But the catalog manager itself doesn't support creating the system tables for each new schema now.

I think we can fix this issue by simply adding a schema column to the existing scripts table. But the script table is still using the default catalog and default schema.

let request = CreateTableRequest {
id: SCRIPTS_TABLE_ID,
catalog_name: DEFAULT_CATALOG_NAME.to_string(),
schema_name: DEFAULT_SCHEMA_NAME.to_string(),
table_name: SCRIPTS_TABLE_NAME.to_string(),

Maybe we need to move the scripts table to the system catalog like this system table

let request = CreateTableRequest {
id: SYSTEM_CATALOG_TABLE_ID,
catalog_name: SYSTEM_CATALOG_NAME.to_string(),
schema_name: INFORMATION_SCHEMA_NAME.to_string(),
table_name: SYSTEM_CATALOG_TABLE_NAME.to_string(),

For distributed mode, we need some other ways to implement system tables, which is out of the topic.

So that's the reason I'm +1 for storing scripts as built-in table for each schema(database). But I agree this approach has its cons for example the table is visible and modifiable to user.

Actually, the script operation is only available under the HTTP API, so it's meaningless for users to access the table directly. So it should be fine that we only have one script table now.

@sunng87
Copy link
Member Author

sunng87 commented Jan 4, 2023

Agreed. Let's use system schema to store scripts and use a schema column to specify execution context. We can consider special handling for system schema in cluster setup to resolve isolation issue.

@ShenJunkun
Copy link
Contributor

Ok, let's do it!

@ShenJunkun
Copy link
Contributor

Another question:
How can I get data from system catalog?

When I try to use sql to access system catalog like this, select * from system.information_schema.system_catalog;

the log show that:

2023-01-05T06:59:59.491339Z ERROR common_telemetry::panic_hook: panicked at 'System catalog table does not support scan!', src/catalog/src/system.rs:68:9 backtrace=   0: common_telemetry::panic_hook::set_panic_hook::{{closure}}
             at src/common/telemetry/src/panic_hook.rs:30:25
   1: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
             at /rustc/935dc07218b4bf6e20231e44eb9263b612fd649b/library/alloc/src/boxed.rs:2032:9
      std::panicking::rust_panic_with_hook
             at /rustc/935dc07218b4bf6e20231e44eb9263b612fd649b/library/std/src/panicking.rs:692:13
   2: std::panicking::begin_panic_handler::{{closure}}
             at /rustc/935dc07218b4bf6e20231e44eb9263b612fd649b/library/std/src/panicking.rs:577:13
   3: std::sys_common::backtrace::__rust_end_short_backtrace
             at /rustc/935dc07218b4bf6e20231e44eb9263b612fd649b/library/std/src/sys_common/backtrace.rs:137:18
   4: rust_begin_unwind
             at /rustc/935dc07218b4bf6e20231e44eb9263b612fd649b/library/std/src/panicking.rs:575:5
   5: core::panicking::panic_fmt
             at /rustc/935dc07218b4bf6e20231e44eb9263b612fd649b/library/core/src/panicking.rs:64:14
   6: <catalog::system::SystemCatalogTable as table::table::Table>::scan::{{closure}}
             at src/catalog/src/system.rs:68:9
   7: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/935dc07218b4bf6e20231e44eb9263b612fd649b/library/core/src/future/future.rs:124:9
   8: <table::table::adapter::DfTableProviderAdapter as datafusion::datasource::datasource::TableProvider>::scan::{{closure}}
             at src/table/src/table/adapter.rs:77:65
   9: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/935dc07218b4bf6e20231e44eb9263b612fd649b/library/core/src/future/future.rs:124:9
  10: datafusion::physical_plan::planner::DefaultPhysicalPlanner::create_initial_plan::{{closure}}
             at /home/jkshen/.cargo/git/checkouts/arrow-datafusion-71ae82d9dec9a01c/4917235/datafusion/core/src/physical_plan/planner.rs:488:88

@evenyag
Copy link
Contributor

evenyag commented Jan 5, 2023

What data do you want to get from this table? Now we don't support reading the contents in this table via the SQL interface, so you might need to implement scan() for it if you'd like to select it. I think we should create an issue for this.

I think we should hide system_catalog or some other system tables until they support scan @v0y4g3r .

@ShenJunkun
Copy link
Contributor

What data do you want to get from this table? Now we don't support reading the contents in this table via the SQL interface, so you might need to implement scan()

If I move scripts table to system catalog like this, Should I implement the logic for user to read scripts table via the SQL interface?

Maybe we need to move the scripts table to the system catalog like this system table

@sunng87
Copy link
Member Author

sunng87 commented Jan 8, 2023

It's recommended. We need a way to list stored scripts for a database. It can either be done via a Http REST API, or a dedicated SQL statement like SHOW SCRIPTS

@evenyag
Copy link
Contributor

evenyag commented Jan 9, 2023

We need a way to list stored scripts for a database. It can either be done via a Http REST API, or a dedicated SQL statement like SHOW SCRIPTS

I think we could create another issue for this feature and do it later

@sunng87
Copy link
Member Author

sunng87 commented Feb 8, 2023

There are two tasks to be finished:

  • move scripts table to system scehama
  • provide show script api or statement

@apdong2022
Copy link
Contributor

@ShenJunkun Hi Junkun, this is April from Greptime Devrel team, Thanks a lot for your contribution in our community!! Do you mind joining our Slack https://greptime.com/slack or leaving your email address? so we can contact you and send you important notices. Thanks a lot!

@ShenJunkun
Copy link
Contributor

@ShenJunkun Hi Junkun, this is April from Greptime Devrel team, Thanks a lot for your contribution in our community!! Do you mind joining our Slack https://greptime.com/slack or leaving your email address? so we can contact you and send you important notices. Thanks a lot!

I have already joined GreptimeDB slack. My email address is [email protected].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category Enhancements
Projects
None yet
Development

No branches or pull requests

6 participants