Releases: StarRocks/starrocks
[Candidate] 3.2.0-rc01
Release date: November 15, 2023
New Features
Shared-data cluster
- Supports the persistent index for Primary Key tables on local disks.
- Supports the even distribution of Data Cache among multiple local disks.
Data Lake Analytics
- Supports creating and dropping databases and managed tables in Hive catalogs, and supports exporting data to Hive's managed tables using INSERT or INSERT OVERWRITE.
- Supports Unified Catalog, with which users can access different table formats (Hive, Iceberg, Hudi, and Delta Lake) that share a common metastore like Hive metastore or AWS Glue.
Storage engine, data ingestion, and export
- Added the following features of loading with the table function FILES():
- Loading Parquet and ORC format data from Azure or GCP.
- Extracting the value of a key/value pair from the file path as the value of a column using the parameter
columns_from_path
. - Loading complex data types including ARRAY, JSON, MAP, and STRUCT.
- Supports the dict_mapping column property, which can significantly facilitate the loading process during the construction of a global dictionary, accelerating the exact COUNT DISTINCT calculation.
- Supports unloading data from StarRocks to Parquet-formatted files stored in AWS S3 or HDFS by using INSERT INTO FILES. For detailed instructions, see Unload data using INSERT INTO FILES.
SQL reference
Added the following functions:
- String functions: substring_index, url_extract_parameter, url_encode, url_decode, and translate
- Date functions: dayofweek_iso, week_iso, quarters_add, quarters_sub, milliseconds_add, milliseconds_sub, date_diff, jodatime_format, str_to_jodatime, to_iso8601, to_tera_date, and to_tera_timestamp
- Pattern matching function: regexp_extract_all
- hash function: xx_hash3_64
- Aggregate functions: approx_top_k
- Window functions: cume_dist, percent_rank and session_number
- Utility functions: dict_mapping and get_query_profile
Privileges and security
StarRocks supports access control through Apache Ranger, providing a higher level of data security and allowing the reuse of existing Ranger Service of external data sources. After integrating with Apache Ranger, StarRocks enables the following access control methods:
- When accessing internal tables, external tables, or other objects in StarRocks, access control can be enforced based on the access policies configured for the StarRocks Service in Ranger.
- When accessing an external catalog, access control can also leverage the corresponding Ranger service of the original data source (such as Hive Service) to control access (currently, access control for exporting data to Hive is not yet supported).
For more information, see Manage permissions with Apache Ranger.
Improvements
Materialized View
Asynchronous materialized view
- Creation:
Supports automatic refresh for an asynchronous materialized view created upon views or materialized views when schema changes occur on the views, materialized views, or their base tables. - Observability:
Supports Query Dump for asynchronous materialized views. - The Spill to Disk feature is enabled by default for the refresh tasks of asynchronous materialized views, reducing memory consumption.
- Data consistency:
- Added the property
query_rewrite_consistency
for asynchronous materialized view creation. This property defines the query rewrite rules based on the consistency check. - Add the property
force_external_table_query_rewrite
for external catalog-based asynchronous materialized view creation. This property defines whether to allow force query rewrite for asynchronous materialized views created upon external catalogs.
For detailed information, see CREATE MATERIALIZED VIEW.
- Added the property
- Added a consistency check for materialized views' partitioning key.
When users create an asynchronous materialized view with window functions that include a PARTITION BY expression, the partitioning column of the window function must match that of the materialized view.
Storage engine, data ingestion, and export
- Optimized the persistent index for Primary Key tables by improving memory usage logic while reducing I/O read and write amplification. #24875 #27577 #28769
- Supports data re-distribution across local disks for Primary Key tables.
- Partitioned tables support automatic cooldown based on the partition time range and cooldown time. For detailed information, see Set initial storage medium and automatic storage cooldown time.
- The Publish phase of a load job that writes data into a Primary Key table is changed from asynchronous mode to synchronous mode. As such, the data loaded can be queried immediately after the load job finishes. For detailed information, see enable_sync_publish.
Query
- Optimized StarRocks' compatibility with Metabase and Superset. Supports integrating them with external catalogs.
SQL Reference
- array_agg supports the keyword DISTINCT.
Developer tools
- Supports Trace Query Profile for asynchronous materialized views, which can be used to analyze its transparent rewrite.
Compatibility Changes
Parameters
- Added new parameters for Data Cache.
Bug Fixes
Fixed the following issues:
- BEs crash when libcurl is invoked. #31667
- Schema Change may fail if it takes an excessive period of time, because the specified tablet version is handled by garbage collection. #31376
- Failed to access the Parquet files in MinIO or AWS S3 via file external tables. #29873
- The ARRAY, MAP, and STRUCT type columns are not correctly displayed in
information_schema.columns
. #33431 DATA_TYPE
andCOLUMN_TYPE
for BINARY or VARBINARY data types are displayed asunknown
in theinformation_schema.columns
view. #32678
2.5.14
Release date: November 14, 2023
Improvements
- The COLUMNS table in the system database INFORMATION_SCHEMA can display ARRAY, MAP, and STRUCT columns. #33431
Bug Fixes
Fixed the following issues:
- The error java.lang.IllegalStateException: null is reported if the ON condition is nested with a subquery. #30876
- The result of COUNT() is inconsistent among replicas if COUNT() is run immediately after INSERT INTO SELECT ... LIMIT is successfully executed. #24435
- BE may crash for specific data types if the target data type specified in CAST is the same as the original data type. #31465
- An error is reported if specific path formats are used during data loading via Broker Load: msg:Fail to parse columnsFromPath, expected: [rec_dt]. #32721
- During an upgrade to 3.x, if some column types are also upgraded (for example, Decimal is upgraded to Decimal v3), BEs crash when Compaction is performed on tables with specific characteristics. #31626
- When data is loaded by using Flink Connector, the load job is suspended unexpectedly if there are highly concurrent load jobs and both the number of HTTP and Scan threads have reached their upper limits. #32251
- BEs crash when libcurl is invoked. #31667
- Adding BITMAP columns to a Primary Key table fails with the following error: Analyze columnDef error: No aggregate function specified for 'userid'. #31763
- Long-time, frequent data loading into a Primary Key table with persistent index enabled may cause BEs to crash. #33220
- The query result is incorrect when Query Cache is enabled. #32778
- Specifying a nullable Sort Key when creating a Primary Key table causes compaction to fail. #29225
- The error "StarRocks planner use long time 10000 ms in logical phase" occassionally occurs for complex Join queries. #34177
3.1.4
Release date: November 2, 2023
New Features
- Supports sort keys for Primary Key tables created in shared-data StarRocks clusters.
- Supports using the str2date function to specify partition expressions for asynchronous materialized views. This helps facilitate incremental updates and query rewrites of asynchronous materialized views created on tables that reside in external catalogs and use the STRING-type data as their partitioning expressions. #29923 #31964
- Added a new session variable enable_query_tablet_affinity, which controls whether to direct multiple queries against the same tablet to a fixed replica. This session variable is set to false by default. #33049
- Added the utility function is_role_in_session, which is used to check whether the specified roles are activated in the current session. It supports checking nested roles granted to a user. #32984
- Supports setting resource group-level query queue, which is controlled by the global variable enable_group_lelvel_query_queue (default value: false). When the global-level or resource group-level resource consumption reaches a predefined threshold, new queries are placed in queue, and will be run when both the global-level resource consumption and the resource group-level resource consumption fall below their thresholds.
- Users can set concurrency_limit for each resource group to limit the maximum number of concurrent queries allowed per BE.
- Users can set max_cpu_cores for each resource group to limit the maximum CPU consumption allowed per BE.
- Added two parameters, plan_cpu_cost_range and plan_mem_cost_range, for resource group classifiers.
- plan_cpu_cost_range: the CPU consumption range estimated by the system. The default value NULL indicates no limit is imposed.
- plan_mem_cost_range: the memory consumption range estimated by the system. The default value NULL indicates no limit is imposed.
Improvements
- Window functions COVAR_SAMP, COVAR_POP, CORR, VARIANCE, VAR_SAMP, STD, and STDDEV_SAMP now support the ORDER BY clause and Window clause. #30786
- An error instead of NULL is returned if a decimal overflow occurs during queries on the DECIMAL type data. #30419
- The number of concurrent queries allowed in a query queue is now managed by the leader FE. Each follower FE notifies of the leader FE when a query starts and finishes. If the number of concurrent queries reaches the global-level or resource group-level concurrency_limit, new queries are rejected or placed in queue.
Bug Fixes
Fixed the following issues:
- Spark or Flink may report data read errors due to inaccurate memory usage statistics. #30702 #30751
- Memory usage statistics for Metadata Cache are inaccurate. #31978
BEs crash when libcurl is invoked. #31667 - When StarRocks materialized views created on Hive views are refreshed, an error "java.lang.ClassCastException: com.starrocks.catalog. HiveView cannot be cast to com.starrocks.catalog. HiveMetaStoreTable" is returned. #31004
- If the ORDER BY clause contains aggregate functions, an error "java.lang.IllegalStateException: null" is returned. #30108
- In shared-data StarRocks clusters, the information of table keys is not recorded in information_schema.COLUMNS. As a result, DELETE operations cannot be performed when data is loaded by using Flink Connector. #31458
- When data is loaded by using Flink Connector, the load job is suspended unexpectedly if there are highly concurrent load jobs and both the number of HTTP threads and the number of Scan threads have reached their upper limits. #32251
- When a field of only a few bytes is added, executing SELECT COUNT(*) before the data change finishes returns an error that reads "error: invalid field name". #33243
- Query results are incorrect after the query cache is enabled. #32781
- Queries fail during hash joins, causing BEs to crash. #32219
- DATA_TYPE and COLUMN_TYPE for BINARY or VARBINARY data types are displayed as unknown in the information_schema.columns view. #32678
Behavior Change
- From v3.1.4 onwards, persistent indexing is enabled by default for Primary Key tables created in new StarRocks clusters (this does not apply to existing StarRocks clusters whose versions are upgraded to v3.1.4 from an earlier version). #33374
- A new FE parameter enable_sync_publish which is set to true by default is added. When this parameter is set to true, the Publish phase of a data load into a Primary Key table returns the execution result only after the Apply task finishes. As such, the data loaded can be queried immediately after the load job returns a success message. However, setting this parameter to true may cause data loads into Primary Key tables to take a longer time. (Before this parameter is added, the Apply task is asynchronous with the Publish phase.) #27055
2.5.13
Release date: September 28, 2023
Improvements
- Window functions COVAR_SAMP, COVAR_POP, CORR, VARIANCE, VAR_SAMP, STD, and STDDEV_SAMP now support the ORDER BY clause and Window clause. #30786
- An error instead of NULL is returned if a decimal overflow occurs during queries on the DECIMAL type data. #30419
- Executing SQL commands with invalid comments now returns results consistent with MySQL. #30210
- Rowsets corresponding to tablets that have been deleted are cleaned up, reducing the memory usage during BE startup. #30625
Bug Fixes
Fixed the following issues:
- An error "Set cancelled by MemoryScratchSinkOperator" occurs when users read data from StarRocks using the Spark Connector or Flink Connector. #30702 #30751
- An error "java.lang.IllegalStateException: null" occurs during queries with an ORDER BY clause that includes aggregate functions. #30108
- FEs fail to restart when there are inactive materialized views. #30015
- Performing INSERT OVERWRITE operations on duplicate partitions corrupts the metadata, leading to FE restart failures. #27545
- An error "java.lang.NullPointerException: null" occurs when users modify columns that do not exist in a Primary Key table. #30366
- An error "get TableMeta failed from TNetworkAddress" occurs when users load data into a partitioned StarRocks external table. #30124
- If users use CloudCanal to load data into table columns that are set to NOT NULL but have no default value specified, an error "Unsupported dataFormat value is : \N" is thrown. #30799
- An error "current running txns on db xxx is 200, larger than limit 200" occurs when users load data via the Flink Connector or perform DELETE and INSERT operations. #18393
- Asynchronous materialized views which use HAVING clauses that include aggregate functions cannot rewrite queries properly. #29976
3.1.3
Release date: September 25, 2023
New Features
- The aggregate function group_concat supports the DISTINCT keyword and the ORDER BY clause. #28778
- Stream Load, Broker Load, Kafka Connector, Flink Connector, and Spark Connector support partial updates in column mode on a Primary Key table. #28288
- Data in partitions can be automatically cooled down over time. (This feature is not supported for list partitioning.) #29335 #29393
Improvements
- Executing SQL commands with invalid comments now returns results consistent with MySQL. #30210
Bug Fixes
Fixed the following issues:
- If the BITMAP or HLL data type is specified in the WHERE clause of a DELETE statement to be executed, the statement cannot be properly executed. #28592
- After a follower FE is restarted, CpuCores statistics are not up-to-date, resulting in query performance degradation. #28472 #30434
- The execution cost of the to_bitmap() function is incorrectly calculated. As a result, an inappropriate execution plan is selected for the function after materialized views are rewritten. #29961
- In certain use cases of the shared-data architecture, after a follower FE is restarted, queries submitted to the follower FE return an error that reads "Backend node not found. Check if any backend node is down". #28615
- If data is continuously loaded into a table that is being altered by using the ALTER TABLE statement, an error "Tablet is in error state" may be thrown. #29364
- Modifying the FE dynamic parameter max_broker_load_job_concurrency using the ADMIN SET FRONTEND CONFIG command does not take effect. #29964 #29720
- BEs crash if the time unit in the date_diff() function is a constant but the dates are not constants. #29937
- In the shared-data architecture, automatic partitioning does not take effect after asynchronous load is enabled. #29986
- If users create a Primary Key table by using the CREATE TABLE LIKE statement, an error "Unexpected exception: Unknown properties: {persistent_index_type=LOCAL}" is thrown. #30255
- Restoring Primary Key tables causes metadata inconsistency after BEs are restarted. #30135
- If users load data into a Primary Key table on which truncate operations and queries are concurrently performed, an error "java.lang.NullPointerException" is thrown in certain cases. #30573
- If predicate expressions are specified in materialized view creation statements, the refresh results of those materialized views are incorrect. #29904
- After users upgrade their StarRocks cluster to v3.1.2, the storage volume properties of the tables created before the upgrade are reset to null. #30647
- If checkpointing and restoration are concurrently performed on tablet metadata, some tablet replicas will be lost and cannot be retrieved. #30603
- If users use CloudCanal to load data into table columns that are set to NOT NULL but have no default value specified, an error "Unsupported dataFormat value is : \N" is thrown. #30799
2.5.12
Release date: September 4, 2023
Improvements
- Comments in an SQL are retained in the Audit Log. #29747
- Added CPU and memory statistics of INSERT INTO SELECT to the Audit Log. #29901
Bug Fixes
Fixed the following issues:
- When Broker Load is used to load data, the NOT NULL attribute of some fields may cause BEs to crash or cause the "msg:mismatched row count" error. #29832
- Queries against ORC-formatted files fail because the bugfix ORC-1304 (apache/orc#1299) from Apache ORC is not merged. #29804
- Restoring Primary Key tables causes metadata inconsistency after BEs are restarted. #30135
3.1.2
Release date: August 25, 2023
Bug Fixes
Fixed the following issues:
- If a user specifies which database is to be connected by default and the user only has permissions on tables in the database but does not have permissions on the database, an error stating that the user does not have permissions on the database is thrown. #29767
- The values returned by the RESTful API action show_data for cloud-native tables are incorrect. #29473
- BEs crash if queries are canceled while the array_agg() function is being run. #29400
- The Default field values returned by the SHOW FULL COLUMNS statement for columns of the BITMAP or HLL data type are incorrect. #29510
- If the array_map() function in queries involves multiple tables, the queries fail due to pushdown strategy issues. #29504
Queries against ORC-formatted files fail because the bugfix ORC-1304 (apache/orc#1299) from Apache ORC is not merged. #29804
Behavior Change
-
For a newly deployed StarRocks v3.1 cluster, you must have the USAGE privilege on the destination external catalog if you want to run SET CATALOG to switch to that catalog. You can use GRANT to grant the required privileges.
-
For a v3.1 cluster upgraded from an earlier version, you can run SET CATALOG with inherited privilege.
3.1.1
Release date: August 18, 2023
New Features
- Supports Azure Blob Storage for shared-data clusters.
- Supports aggregate functions COVAR_SAMP, COVAR_POP, and CORR.
- Supports the following window functions: COVAR_SAMP, COVAR_POP, CORR, VARIANCE, VAR_SAMP, STD, and STDDEV_SAMP.
Improvements
- Supports implicit conversions for all compound predicates and for all expressions in the WHERE clause. You can enable or disable implicit conversions by using the session variable ENABLE_STRICT_TYPE. The default value of this session variable is false.
Bug Fixes
Fixed the following issues:
- When data is loaded into tables that have multiple replicas, a large number of invalid log records are written if some partitions of the tables are empty. #28824
- Inaccurate estimation of average row size causes partial updates in column mode on Primary Key tables to occupy excessively large memory. #27485
- If clone operations are triggered on tablets in an ERROR state, disk usage increases. #28488
- Compaction causes cold data to be written to the local cache. #28831
3.0.5
Release date: August 16, 2023
New Features
- Supports aggregate functions COVAR_SAMP, COVAR_POP, and CORR.
- Supports the following window functions: COVAR_SAMP, COVAR_POP, CORR, VARIANCE, VAR_SAMP, STD, and STDDEV_SAMP.
Improvements
- Added more prompts in the error message xxx too many versions xxx. #28397
- Dynamic partitioning further supports the partitioning unit to be year. #28386
- The partitioning field is case-insensitive when expression partitioning is used at table creation and INSERT OVERWRITE is used to overwrite data in a specific partition. #28309
Bug Fixes
Fixed the following issues:
- Incorrect table-level scan statistics in FE cause inaccurate metrics for table queries and loading. #27779
- The query result is not stable if the sort key is modified for a partitioned table. #27850
- The version number for a tablet is inconsistent between the BE and FE after data is restored. #26518
- If the bucket number is not specified when users create a Colocation table, the number will be inferred as 0, which causes failures in adding new partitions. #27086
- When the SELECT result set of INSERT INTO SELECT is empty, the load job status returned by SHOW LOAD is CANCELED. #26913
- BEs may crash when the input values of the sub_bitmap function are not of the BITMAP type. #27982
- BEs may crash when the AUTO_INCREMENT column is being updated. #27199
- Outer join and Anti join rewrite errors for materialized views. #28028
- Inaccurate estimation of average row size causes Primary Key partial updates to occupy excessively large memory. #27485
- Activating an inactive materialized view may cause a FE to crash. #27959
- Queries can not be rewritten to materialized views created based on external tables in a Hudi catalog. #28023
- The data of a Hive table can still be queried even after the table is dropped and the metadata cache is manually updated. #28223
- Manually refreshing an asynchronous materialized view via a synchronous call results in multiple INSERT OVERWRITE records in the information_schema.task_runs table. #28060
- FE memory leak caused by blocked LabelCleaner threads. #28311
3.1.0
Release date: August 7, 2023
New Features
Shared-data cluster
- Added support for Primary Key tables, on which persistent indexes cannot be enabled.
- Supports the AUTO_INCREMENT column attribute, which enables a globally unique ID for each data row and thus simplifies data management.
- Supports automatically creating partitions during loading and using partitioning expressions to define partitioning rules, thereby making partition creation easier to use and more flexible.
Data Lake analytics
- Supports accessing views created on tables within Hive catalogs.
- Supports accessing Parquet-formatted Iceberg v2 tables.
- [Preview] Supports sinking data to Parquet-formatted Iceberg tables.
- [Preview] Supports accessing data stored in Elasticsearch by using Elasticsearch catalogs. This simplifies the creation of Elasticsearch external tables.
- [Preview] Supports performing analytics on streaming data stored in Apache Paimon by using Paimon catalogs.
Storage engine, data ingestion, and query
- Upgraded automatic partitioning to expression partitioning. Users only need to use a simple partition expression (either a time function expression or a column expression) to specify a partitioning method at table creation, and StarRocks will automatically create partitions based on the data characteristics and the rule defined in the partition expression during data loading. This method of partition creation is suitable for most scenarios and is more flexible and user-friendly.
- Supports list partitioning. Data is partitioned based on a list of values predefined for a particular column, which can accelerate queries and manage clearly categorized data more efficiently.
- Added a new table named
loads
to theInformation_schema
database. Users can query the results of Broker Load and Insert jobs from theloads
table. - Supports logging the unqualified data rows that are filtered out by Stream Load, Broker Load, and Spark Load jobs. Users can use the
log_rejected_record_num
parameter in their load job to specify the maximum number of data rows that can be logged. - Supports random bucketing. With this feature, users do not need to configure bucketing columns at table creation, and StarRocks will randomly distribute the data loaded into it to buckets. Using this feature together with the capability of automatically setting the number of buckets (
BUCKETS
) that StarRocks has provided since v2.5.7, users no longer need to consider bucket configurations, and table creation statements are greatly simplified. In big data and high performance-demanding scenarios, however, we recommend that users continue using hash bucketing, because this way they can use bucket pruning to accelerate queries. - Supports using the table function FILES() in INSERT INTO to directly load the data of Parquet- or ORC-formatted data files stored in AWS S3. The FILES() function can automatically infer the table schema, which relieves the need to create external catalogs or file external tables before data loading and therefore greatly simplifies the data loading process.
- Supports generated columns. With the generated column feature, StarRocks can automatically generate and store the values of column expressions and automatically rewrite queries to improve query performance.
- Supports loading data from Spark to StarRocks by using Spark connector. Compared to Spark Load, the Spark connector provides more comprehensive capabilities. Users can define a Spark job to perform ETL operations on the data, and the Spark connector serves as the sink in the Spark job.
- Supports loading data into columns of the MAP and STRUCT data types, and supports nesting Fast Decimal values in ARRAY, MAP, and STRUCT.
SQL reference
-
Added the following storage volume-related statements: CREATE STORAGE VOLUME, ALTER STORAGE VOLUME, DROP STORAGE VOLUME, SET DEFAULT STORAGE VOLUME, DESC STORAGE VOLUME, SHOW STORAGE VOLUMES.
-
Supports altering table comments using ALTER TABLE. #21035
-
Added the following functions:
- Struct functions: struct (row), named_struct
- Map functions: str_to_map, map_concat, map_from_arrays, element_at, distinct_map_keys, cardinality
- Higher-order Map functions: map_filter, map_apply, transform_keys, transform_values
- Array functions: array_agg supports
ORDER BY
, array_generate, element_at, cardinality - Higher-order Array functions: all_match, any_match
- Aggregate functions: min_by, percentile_disc
- Table functions: generate_series, FILES
- Date functions: next_day, previous_day, last_day, makedate, date_diff
- Bitmap functions:bitmap_subset_limit, bitmap_subset_in_range
Privileges and security
Added [privilege items](https://docs.starrocks.io/en-us/3.1/administration/privilege_item#...