dataset: fix dataset performance drops signifcantly on high dimensions data. #15355

pissang · 2021-07-15T05:17:02Z

Brief Information

This pull request is in the type of:

bug fixing
new feature
others

What does this PR do?

This PR refactors the List module and aims to fix the terrible performance when many series are sharing the same high dimension dataset.

Result of test case on my MacBook

Case from dataset performance for large data #11907. 1000 dimensions dataset :
Before: About 45s to render.
After: About 300ms to render and only takes 20MB heap memory.
test/sample-compare.html. 1 million, 4 dimensions dataset with LTTB sampling
Before: About 40ms
After: About 25ms

Fixed issues

#11907

Details explain

The major reason why high dimension data is so slow previously is that echarts will process all dimensions and recreate storage from the dataset in each series. In most cases, it won't cost much. But in #11907, There is extra high dimension data and each dimension will be distributed to a line series. In this case, echarts need to process the 1000 dimensions 1000 times. It brings a lot of pressure to the GC.

So the main purpose of this refactoring is quite clear: This dataset which is shared by all the series should only be processed once. To achieve this purpose, we need to divide the original List module into two modules:

DataStorage: process and store the multi-dimension data. It can be shared by multiple series.
SeriesData: A data reading wrapper of the storage for each series. Also manages the relationship of data and graphic info(visual, elements). The interface is almost the same as the original List to avoid breaks.

It sounds quite straightforward and simple. So why we didn't do it before? Because we need to collect enough information before parsing data and it's hard to prepare the parsed data for sharing ahead. For example, we need to know which axis of a cartesian coordinate system is a category, so we can parse data of this dimension as category data instead of converting it to numbers. We also detect the type from data. But we don't trust it. Because detect every value is costly, we can only detect part of them. Also, we can accept a number as a category, a string as a date, a string number('123') as a number, etc. So the easiest way is parsing the source when initializing each series after all necessary information is prepared.

In this PR keep the idea of preparing the storage ahead based on the information dataset given. If series needs the source to be parsed differently. We discard the shared storage and create a new one for this series. There are several factors that will invalidate the shared storage:

series.seriesLayoutBy is different.
series.sourceHeader is different.
series also specify dimensions and it's different from the dataset.
dataset detects one dimension to be number but the series needs it to be ordinal.
dataset detects one dimension to be ordinal but the series needs it to be number or time.

This invalidate strategy will ensure the behavior is correct but may have extra costs in some cases. So there is a best practice to avoid this:

Always specify the name and type of dimensions in the dataset.

ordinalMeta update. encodeDef

fix dimSize param for typedarray to provider

…e wrong.

to reduce the cost of source dimension parse

echarts-bot · 2021-07-15T05:17:05Z

Thanks for your contribution!
The community will review it ASAP. In the meanwhile, please checkout the coding standard and Wiki about How to make a pull request.

The pull request is marked to be PR: author is committer because you are a committer of this project.

…in SeriesData

…imensions more than 30.

(1). If no dimensions specified on dataset, series can not really share one storage instance. (Because the each series will create its own dimension name (like ['x', 'y', 'value'], ['x', 'value', 'y'], and the storage hash is based on those names. So the hash can not match). (2). Each time `setOption` update series (but not change `dataset`), new data stack dimensions (and corresponding chunks) will be keep added to the shared data storage, and then the chunks will be more and more. (3). When "unused dimension omit" happen, the index of SeriesData['dimensions'] and SeriesData['getDimension'] is not the dimensionIndex that users known. But there are someplace still use dimensionIndex to visit them. (especially in visualMap and other similar cases that user can input dimension index via option). (4). If user only specify type but no name in dimensions, their will be some bug when "unused dimension omit" happen. (Because unused dimensions will not auto-generate dimension name by `createDimensions` and so that it has no dimension name in storage, and can not be queried by dimension name). (5). If different series option specify its own `dimensions` but share one `dataset`, the `source` get by `sourceManager` is different `source` instances in each series. Those `source` instances contain different `dimensionDefine` but reference the same data. And then a data storage created by based on s **This commit try resolve those issues by this way:** 1. Do not save the "dimName->dimIndex map" in data storage any more. Because: 1. In fact data storage do not need this map to read/write data. 2. dimNames are usually created based on each series info (like ['x', 'y', 'value'], ['x', 'value', 'y'], ...) if not specified by user. So they are different between series. But even those series have different generated dimension names, they can still share one storage, because essentially they visit the same dataset source by dimIndex. 2. Make `SeriesDimensionDefine` (that is, each item of `SeriesData['dimensionInfos']`) contain `storageDimensionIndex` to indicate its corresponding data store dimension index. And alway user `storageDimensionIndex` to visit dat storage rather than dimName. `storageDimensionIndex` is created in `createDimension`. 3. create a new structure `SeriesDimensionRequest` for each series. It contains the info generated by `createDimension` (like dimensionDefineList, whether dimension omitted, source for this series). 3.1 `sourceManager` use `seriesDimensionRequest` to find the shared storage by generate storage dimensions and hash based on the `dimCount` and `dimensionDefineList` (which are created by `createDimension` and saved in `seriesDimensionRequest`). 3.2 `dataStack` add "data stack dimensions" to `dimensionDefineList` in `seriesDimensionRequest`. 3.3 `seriesData` use `seriesDimensionRequest` to init its dimensions, and use `seriesDimensionRequest` query dimName by dimIndex from source, or query dimIndex by dimName from source for "omitted dimension". If different series option specify its own `dimensions` but share one `dataset`, the `source` get by `sourceManager` is different `source` instances in each series. Those `source` instances contain different `dimensionDefine` but reference the same "raw data". The data storage generated based on the "raw data" can be shared between series, but the `dimensionDefine` should not be shared. `SeriesDimensionRequest` encapsulate these mess and work each series to query dimension name or index. 3.4 `seriesDimensionRequest` do not create new data structure as possible as it can, but reference shared data structure (like `source` instance). So it will not cost memory issue. 4. Change the previous `storage.appendDimension` to `storage.ensureCalculationDimension` for data stack. That is, if its dimension has been created, reuse it. 5. Remove previous `canUse` method. Whether a storage can be shared by series is all determined by hash. The hash is generated in two ways: 5.1 For source format "arrayRows" (i.e., [[12, 33], [55, 99], ...]), dimension name do not need to be added to hash, because this kind of data actually visited by index. If two series have different dimension name (like 'x', 'y') for single index, they also can share the storage. 5.2 For source format "objectRows" (i.e, [{a: 12, b: 33}, {b: 55, a: 99}, ...]), property name 'a', 'b' will be added to hash, because this kind of data actually visited by property name. 5.3 And as before, dimension type, ordinal meta id, source header, series layout will also be added to hash. 6. Make `DataStorage` method immutable: `DataStorage['filterSelf']` -> `DataStorage['filter']` `DataStorage['selectRange']` **PENDING:** 1. Should deprecate `dimName = seriesData.getDimension(dimLoose)` and `series.get(dimName)` but always use `dimIdx = seriesData.getDimensionIndex(dimLoose)` and `dataStorage.get(dimIdx)` instead. For examples: ```js // Previously const val = data.get(data.getDimension(dim), dataIdx); // Now const val = data.getStorage().get(data.getDimensionIndex(dim), dataIdx); ``` `seriesData.getDimension(dimLoose)` has a feature that convert dimIdx to dimName, which is not essentially necessary (because dimIdx can be used to visit data directly), but this feature require a "dimIdx->dimName map" in `SeriesData` (why? because when some dimensions are omitted, we can not use dimIdx on `SeriesData['dimensions']` directly). 2. Radar has bug when using `series.encode`. This commit do not fix the bug but keep as it is.

100pah · 2021-08-19T20:37:26Z

Some changes has been made from 2bf188f to 39c7111

There are these issues existing before this commit:

If no dimensions specified on dataset, series can not really share one storage instance. (Because the each series will create its own dimension name (like ['x', 'y', 'value'], ['x', 'value', 'y'], and the storage hash is based on those names. So the hash can not match).
Each time setOption update series (but not change dataset), new data stack dimensions (and corresponding chunks) will be keep added to the shared data storage, and then the chunks will be more and more.
When "unused dimension omit" happen, the index of SeriesData['dimensions'] and SeriesData['getDimension'] is not the dimensionIndex that users known. But there are someplace still use dimensionIndex to visit them. (especially in visualMap and other similar cases that user can input dimension index via option).
If user only specify only type but no name in dimensions, their will be some bug when "unused dimension omit" happen. (Because unused dimensions will not auto-generate dimension name by createDimensions and so that it has no dimension name in storage, and can not be queried by dimension name).

This commit try resolve those issues by this way:

Do not save the "dimName->dimIndex map" in data storage any more. Because:
1. In fact data storage do not need this map to read/write data.
2. dimNames are usually created based on each series info (like ['x', 'y', 'value'], ['x', 'value', 'y'], ...) if not specified by user. So they are different between series. But even those series have different generated dimension names, they can still share one storage, because essentially they visit the same dataset source by dimIndex.
Make SeriesDimensionDefine (that is, each item of SeriesData['dimensionInfos']) contain storageDimensionIndex to indicate its corresponding data store dimension index. And always use storageDimensionIndex to visit dat storage rather than dimName. storageDimensionIndex is created in createDimension.
create a new structure SeriesDataSchema for each series. It contains the info generated by createDimension (like (1) dimensionDefineList, (2) whether dimension omitted, (3) source for this series, (4) ...).
3.1. sourceManager use seriesDataSchema to find the shared storage by generate storage dimensions and hash based on the dimCount and dimensionDefineList (which are created by createDimension and saved in seriesDataSchema).
3.2. dataStack add "data stack dimensions" to dimensionDefineList in seriesDataSchema.
3.3. seriesData use seriesDataSchema to init its dimensions, and use seriesDataSchema query dimName by dimIndex from source, or query dimIndex by dimName from source (only when "dimension omit" happen). If different series option specify its own dimensions but share one dataset, the source get by sourceManager is different source instances in each series. Those source instances contain different dimensionDefine but reference the same "raw data". The data storage generated based on the "raw data" can be shared between series, but the dimensionDefine should not be shared. SeriesDataSchema encapsulate these mess and work each series to query dimension name or index.
3.4. seriesDataSchema do not create new data structure as possible as it can, but reference shared data structure (like source instance). So it will not cost memory issue.
Change the previous storage.appendDimension to storage.ensureCalculationDimension for data stack. That is, if its dimension has been created, reuse it.
Remove previous canUse method. Whether a storage can be shared by series is all determined by hash. The hash is generated in two ways:
5.1. For source format "arrayRows" (i.e., [[12, 33], [55, 99], ...]), dimension name do not need to be added to hash, because this kind of data actually visited by index. If two series have different dimension name (like 'x', 'y') for single index, they also can share the storage.
5.2. For source format "objectRows" (i.e, [{a: 12, b: 33}, {b: 55, a: 99}, ...]), property name 'a', 'b' will be added to hash, because this kind of data actually visited by property name.
5.3. And as before, dimension type, ordinal meta id, source header, series layout will also be added to hash.
Make DataStorage method immutable:
DataStorage['filterSelf'] -> DataStorage['filter']
DataStorage['selectRange']

PENDING

Should deprecate dimName = seriesData.getDimension(dimLoose) and series.get(dimName) but always use dimIdx = seriesData.getDimensionIndex(dimLoose) and dataStorage.get(dimIdx) instead. For examples:

// Previously
const val = data.get(data.getDimension(dim), dataIdx);
// Now
const val = data.getStorage().get(data.getDimensionIndex(dim), dataIdx);

seriesData.getDimension(dimLoose) has a feature that convert dimIdx to dimName, which is not essentially necessary (because dimIdx can be used to visit data directly), but this feature require a "dimIdx->dimName map" in SeriesData (why? because when some dimensions are omitted, we can not use dimIdx on SeriesData['dimensions'] directly).

Radar has bug when using series.encode. This commit do not fix the bug but keep as it is.

Test case

About `SeriesDataSchema`

…used by extensions.

…package.json, for update eslint-plugin version, for npm run checktype fail, for old tslib that the eslint-plugin used in the old version).

echarts-bot · 2021-08-22T13:10:56Z

Congratulations! Your PR has been merged. Thanks for your contribution! 👍

pissang added 28 commits July 8, 2021 16:09

perf(dataset): improve name-gen perf for high dim data.

8e1c0fe

refact(data): separate List to DataStorage and SeriesData

85ee886

refact(data): fix some typo bugs

deafca7

refact(data): fix downsample

973b70d

refact(data): indices should not keep after clone

ed75a2e

refact(data): share storage on same dataset.

6c07c45

refact: optimize createDimensions. update test

1d4be90

refact(data): fix several issues in storage sharing

63c4177

ordinalMeta update. encodeDef

refact(data): cant reuse storage if ordinalMeta is different

027cf16

refact(data): fix seriesLayoutBy and sourceHeader

2cac7ca

refact(data): fix chunk not exists when data is empty.

2a8b198

fix dimSize param for typedarray to provider

refact(dim): remove completeDimensions

8b2e48b

refact(data): only pick necessary dimensions when using dataset

7739a95

refact(data): add back appendValues. fix ut

ff7498a

refact(data): fix createDimensions multiple times cause default encod…

83bc2d9

…e wrong.

refact(data): share source instance with dataset

bf059ad

to reduce the cost of source dimension parse

refact(data): not clone store if possible

50d430d

refact(data): omit unused dimensions in createDimensions

53113c4

refact(data): try to cache dim name map in complete dimensions

7573ceb

refact(data): fix type from dataset may incorrect.

0349b87

refact(data): fix storage clone lost indices.

e9e6077

Merge branch 'master' into dataset-perf

325a5c9

refact: don't recreate data storage when converting float to time

345402d

refact(data): fix marker

e026f3f

refact(data): fix some extent calc issues

3475b9c

test(ut): add path alias for root dir

23548b8

refact(data): getValues from storage. add unit test

1d8da4a

refact(data): add missing unit test update

7dd81e1

echarts-bot bot added the PR: author is committer label Jul 15, 2021

pissang and others added 7 commits August 4, 2021 13:42

Merge branch 'master' into dataset-perf

7819423

refact(data): fix wrong index when only using part of the dimensions …

8e6bd68

…in SeriesData

refact(data): fix unexpected code removement

2bf188f

fix: fix createDimension result might container name:undefined when d…

10aed78

…imensions more than 30.

update eslint-plugin version

5e38220

Merge branch 'master' into dataset-perf

c7d5734

100pah added 14 commits August 20, 2021 05:47

fix: rename SeriesDimensionRequest to SeriesDataSchema.

39c7111

fix: remove deprecated of some method of SeriesData. They are widely …

8aacfd1

…used by extensions.

fix: revert dis/echarts.js.map (mistakenly committed previously)

15f217a

fix: update package-lock to fix ci fail (previous commit only commit …

6e0a254

…package.json, for update eslint-plugin version, for npm run checktype fail, for old tslib that the eslint-plugin used in the old version).

fix: rename to make name shorter.

1e3f1ad

fix: rename shorter.

dd48619

fix: rename DataStorage -> dataStore and some other renames.

f26ccae

fix: fix some issues brought by refactor.

136a97c

fix: other rename of data storage -> data store

afafc7b

fix: fix and add test cases.

ea7b2ec

fix: add test case.

75dc3ab

fix: fix issue brought by previous refactor.

4d1b252

fix: tweak test case.

c0bb629

fix: fix ut.

53fa03d

100pah approved these changes Aug 22, 2021

View reviewed changes

100pah merged commit 4569dc1 into master Aug 22, 2021

echarts-bot bot removed the PR: awaiting review label Aug 22, 2021

pissang mentioned this pull request Sep 5, 2021

Fix(MarkerLine): convert date string to timestamp #15677

Closed

6 tasks

100pah mentioned this pull request Sep 7, 2021

fix: Fix MarkLine/MarkPoint/MarkArea do not work on time axis if input string time format #15686

Merged

3 tasks

plainheart deleted the dataset-perf branch January 8, 2022 20:04

plainheart mentioned this pull request Mar 23, 2022

fix(dataZoom): fix the dataZoom was unexpectedly displayed at the top when data contains null values #16730

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset: fix dataset performance drops signifcantly on high dimensions data. #15355

dataset: fix dataset performance drops signifcantly on high dimensions data. #15355

pissang commented Jul 15, 2021 •

edited

Loading

echarts-bot bot commented Jul 15, 2021

100pah commented Aug 19, 2021 •

edited

Loading

echarts-bot bot commented Aug 22, 2021

dataset: fix dataset performance drops signifcantly on high dimensions data. #15355

dataset: fix dataset performance drops signifcantly on high dimensions data. #15355

Conversation

pissang commented Jul 15, 2021 • edited Loading

Brief Information

What does this PR do?

Result of test case on my MacBook

Fixed issues

Details explain

echarts-bot bot commented Jul 15, 2021

100pah commented Aug 19, 2021 • edited Loading

There are these issues existing before this commit:

This commit try resolve those issues by this way:

PENDING

Test case

About SeriesDataSchema

echarts-bot bot commented Aug 22, 2021

pissang commented Jul 15, 2021 •

edited

Loading

100pah commented Aug 19, 2021 •

edited

Loading

About `SeriesDataSchema`