Chinese blog about this project: 量化系列2 - 众包数据集
Follow https://github.com/dolthub/dolt
Raw data hosted on dolt: https://www.dolthub.com/repositories/chenditc/investment_data
To download as dolt database:
dolt clone chenditc/investment_data
docker run -v /<some output directory>:/output -it --rm chenditc/investment_data bash dump_qlib_bin.sh && cp ./qlib_bin.tar.gz /output/
export TUSHARE=<Token>
bash daily_update.sh
docker run -v /<some output directory>:/output -it --rm chenditc/investment_data bash daily_update.sh && bash dump_qlib_bin.sh && cp ./qlib_bin.tar.gz /output/
tar -zxvf qlib_bin.tar.gz -C ~/.qlib/qlib_data/cn_data --strip-components=2
- Try to fill in missing data by combining data from multiple data source. For example, delist company's data.
- Try to correct data by cross validate against multiple data source.
w(wind): high quality static data source. Only available till 2019. c(caihui): high quality static data source. Only available till 2019. ts: Tushare data source ak: Akshare data source yahoo: Use Qlib's yahoo collector https://github.com/microsoft/qlib/tree/main/scripts/data_collector/yahoo
final: Merged final data with validation and correction
Use one_time_db_scripts to import w_a_stock_eod_price table, used as initial price standard
- Use tushare/update_stock_list.sh to load stock list
- Use tushare/update_stock_price.sh to load stock price
- Use yahoo collector to load stock price
- Use w data source as baseline, use other data source to validate against it.
- Since w data's adjclose is different from ts data's adjclose, we will use a "link date" to calculate a ratio to map ts adjclose to w adjclose. This can be the maximum first valid data for each data source. The reason we don't use a fixed value for link date is: Some stock might not be trading at specific date, and the enlist and delist date are all different. We store the link date information and adj_ratio in link_table. adj_ratio = link_adj_close / w_adj_close;
- Append ts data to final dataset, the adjclose will be ts_adj_close / ts_adj_ratio
- Generate final data by concatinate w data and ts data.
- Run validate by pair two data source:
- Compare high, low, open, close, volume absolute value
- Calcualte adjclose convert ratio use a link date for each stock.
- Calculate w data adjclose use link date's ratio, and compare it with final data.
To add a new stock index, we need to change:
- Add index weight download script. Change tushare/dump_index_eod_price.py script to dump the index info. If the index is not available in tushare, write a new script and add to the daily_update.sh script. Example commit
- Add price download script. Change tushare/dump_index_eod_price.py to add the index price. Eg. Example Commit
- Modify export script. Change the qlib dump script qlib/dump_index_weight.py#L13, so that index will be dump and renamed to a txt file for use. Example commit