The dataset viewer is not available for this dataset.
Error code: JobManagerExceededMaximumDurationError
Need help to make the dataset viewer work? Make sure to review how to configure the dataset viewer, and open a discussion for direct support.
Zellic 2023 Smart Contract Source Index
Zellic is making publicly available a dataset of known Ethereum mainnet smart contract source code.
Our aim is to provide a contract source code dataset that is readily available to the public to download in bulk. We believe this dataset will help advance the frontier of smart contract security research. Applications include static analysis, machine learning, and more. This effort is part of Zellic’s mission to create a world with no smart contract hacks.
Methodology
First, we accumulated a list of all deployed contracts on Ethereum mainnet as of block 16860349. This does not include contracts that have been SELFDESTRUCT
ed. We progressively built up this index by performing a full sync from the genesis block using the modified Geth instance. Whenever a new contract was created, we added it to our index. When a contract SELFDESTRUCT
ed, we removed it from the index. This list is available in this dataset as the file address_bytecodehash_index
.
Next, we collected contract source code from publicly available online sources. All data was obtained from publicly accessible resources.
Finally, we calculated all of the Keccak256 hashes of the deployed runtime EVM bytecode of each contract. We deduplicated contract source code by bytecode hash. In other words, we organized the contract source code set by the bytecode hash of their corresponding verified contracts. For example, if source codes A and B are both verified against smart contracts X and Y with the same deployed EVM bytecode, we only include one of A or B in this dataset. Choosing among duplicates was arbitrarily.
Dataset Statistics
Number of unique source codes, by bytecode hash: 149,386
Contracts with code available: 3,897,319 (This is more than the previous number, because MANY contracts share identical bytecode)
Number of smart contracts in global index: 30,586,657 (not all have source code available, see Methodology)
Chars (wc -c) | Words (wc -w) | LoC (code) | LoC (comments) | LoC (whitespace) | LoC (total) |
---|---|---|---|---|---|
6,473,548,073 | 712,444,206 | 90,562,628 | 62,503,873 | 24,485,549 | 177,552,050 |
Unique words: 939,288
Dataset Structure
Index
The address_bytecodehash_index
file contains a list of known smart contract addresses mapped to the Keccak256 hash of their EVM bytecode.
Look up the smart contract address in this file to find the source. This file also serves as a list of all deployed smart contracts as of block 16860349.
Not all contracts in the index file will have source code available. This is a list of all deployed smart contracts as of block 16860349. (See Methodology).
Excerpt of data from the index for preview purposes:
...
00012e87fa9172d0c613f69d0abf752bb00310ec:4f5a5f6706dc853cb3ae2279729e0d7e24dda128a77358144e4c0fd3e5d60e98
00012c8ef0fef0a06e1644ab91107fe8584fb91e:a828ef7f5f6d2ebb1203de12878e16aa5ba6984c12ededff4e19876233533505
00012df38ea3a6dabefb8407a59219a0c7dd0bc8:c279544d07d9631b1e37d835cadfe7098d60e508cf8f18a89ddb8b176d56874d
00012d92a0e7ee1b19f8e018267c97a3a7e99aa7:0865cec1e9ac3048b12a85fc3b9fbc682c3831784e3396416635df4cb88c3fdd
00012f07e281c1d8a9d790358050b6015eef942c:ab7af4c77ed6371c7eda04ba317a134f0b06593c0dc2851bf4c709a367ea50ed
00012e198745e53293bf09ddec8da1284963fded:ce33220d5c7f0d09d75ceff76c05863c5e7d6e801c70dfe7d5d45d4c44e80654
00012ec2c9fc4a1692176da5202a44a4aea5e177:ce33220d5c7f0d09d75ceff76c05863c5e7d6e801c70dfe7d5d45d4c44e80654
...
Contract Sources
Smart Contract sources are organized by folder in the organized_contracts
directory.
For example, a contract with the bytecode hash beef3d7d1884c4fee50548cfe762415fe494e3feb1e6ca181352ef023ba1ff7a
would be in the directory organized_contracts/be/beef3d7d1884c4fee50548cfe762415fe494e3feb1e6ca181352ef023ba1ff7a/
.
Each folder for a smart contract contains the source files as well as a metadata.json
that contains information about the contract such as the compiler version and optimizations used. These settings can be used to attempt to reproduce the build.
Example of metadata.json for preview purposes (unminified for ease of viewing):
{
"ContractName": "MageSpace",
"CompilerVersion": "v0.8.10+commit.fc410830",
"Runs": 200,
"OptimizationUsed": false,
"BytecodeHash": "c2f8f4e79a9d7c23d8a398768e1476f03f0e11c44fc7441c021e098c71678d03"
}
Source Formats
Contracts may come in one of three source formats. Single file, multiple files, and Solidity Compiler JSON.
For multiple file contacts, each .sol
file will be included in the directory.
Single file contracts will be named main.sol
. Some contracts are written in Vyper, not Solidity. These will be named main.vy
.
For Solidity Compiler Input JSON, the compiler input will be stored in contract.json
.
Not all contract code is in Solidity. Some contract code is in Vyper, or other languages! Check metadata.json!
As a quick-and-dirty script, to extract all of the source code, you can use this bash script:
mkdir code
cd organized_contracts/
for f in * ; do
echo $f
cat $f/*/contract.json | jq '.sources | to_entries[].value.content' -r > ../code/"$f".txt
cat $f/*/*.sol > ../code/"$f".txt
done
Other Fun Facts
Top 100 words:
Click to expand
23189252 the 20816285 address 16207663 uint256 14793579 to 13746030 function 9952507 returns 9069124 0 8256548 a 8189582 of 6854095 is 6783298 dev 6363279 return 5555811 if 5497552 memory 5403232 from 5203839 amount 5146685 internal 4838549 value 4753195 be 4700814 external 4676440 owner 4535518 this 4477899 view 4463166 for 4205382 bool 3770805 contract 3732595 token 3719841 and 3578693 public 3447968 string 3422923 tokenid 3243596 require 3134425 1 3063929 in 2996585 bytes 2976900 data 2831472 by 2748878 transfer 2729742 account 2605117 that 2588692 param 2535414 private 2465042 an 2418190 solidity 2377723 uint 2333621 call 2326567 not 2319841 virtual 2295154 zero 2220201 sender 2118342 as 2113922 sol 2024428 target 1945888 event 1919425 s 1901005 or 1899022 pure 1884128 tokens 1859283 must 1850785 it 1796854 with 1783457 contracts 1760318 b 1742610 revert 1711696 spender 1698735 bytes32 1655261 recipient 1645305 i 1608529 indexed 1585283 true 1575421 2 1551352 when 1528254 can 1475879 length 1466789 override 1444666 will 1356364 approve 1355666 8 1314732 notice 1304351 implementation 1293963 are 1291253 import 1290551 on 1267019 balance 1257438 available 1253286 log 1232433 pragma 1211177 since 1193506 msgsender 1193496 result 1190481 liquidity 1185869 msg 1181724 operator 1178211 errormessage 1176497 slot 1156971 set 1154460 openzeppelin 1148764 cannot 1123141 erc20 1115019 abi
Notices
The smart contract source code in this dataset were obtained from publicly available sources. You should always abide by the appropriate code and software licenses, as well as all applicable copyright law.
THE DATASET/SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE DATASET/SOFTWARE OR THE USE OR OTHER DEALINGS IN THE DATASET/SOFTWARE.
- Downloads last month
- 7,756