Pulse · tinygrad/tinygrad · GitHub

July 19, 2024 – July 26, 2024

Overview

133 Active pull requests

6 Active issues

116 Pull requests merged by 9 people

move CUDA/HIP compilers to their own files [run_process_replay]
#5732 merged Jul 26, 2024
UOp simple mul add div fold
#5726 merged Jul 26, 2024
remove redundant symbolic mod rule [run_process_replay]
#5725 merged Jul 26, 2024
UOp simple mul-add-lt fold
#5721 merged Jul 26, 2024
revert isolated dags scheduling
#5724 merged Jul 25, 2024
UOp more generic div folding
#5722 merged Jul 25, 2024
hcq do not update the same signal
#5719 merged Jul 25, 2024
hcq update_exec with optional params
#5708 merged Jul 25, 2024
remove global_size and local_size from Kernel class [run_process_replay]
#5720 merged Jul 25, 2024
faster beam [run_process_replay]
#5718 merged Jul 25, 2024
halve kernel counts in metal Fuzz Test linearizer
#5716 merged Jul 25, 2024
cleaner uop expand [run_process_replay]
#5715 merged Jul 25, 2024
more test_pattern_matcher fixups
#5714 merged Jul 25, 2024
rename to realize_reduceop
#5713 merged Jul 25, 2024
fixup test_pattern_matcher
#5712 merged Jul 25, 2024
beautiful_mnist -4.3% kernels
#5709 merged Jul 25, 2024
towards NOp as UOp superclass
#5711 merged Jul 25, 2024
map groupable children
#5710 merged Jul 25, 2024
Fix repr upat
#5705 merged Jul 25, 2024
hotfix: compare_schedule defaults to false
#5707 merged Jul 25, 2024
more scheduler process replay tooling
#5706 merged Jul 25, 2024
start work on indexing fusion
#5590 merged Jul 25, 2024
more info on failure 41
#5704 merged Jul 25, 2024
kernel from amd resnet page fault
#5703 merged Jul 25, 2024
enable hip tc
#5702 merged Jul 25, 2024
shorter llvm and ptx rendering [run_process_replay]
#5686 merged Jul 25, 2024
UOp more generic mul -> mod folding
#5698 merged Jul 25, 2024
UOp mod reduction
#5697 merged Jul 25, 2024
UOp vmin/vmax on ADD
#5689 merged Jul 24, 2024
bring unbind back in Varaible const
#5687 merged Jul 24, 2024
nv ptx print log
#5691 merged Jul 24, 2024
UOps div folding
#5690 merged Jul 24, 2024
unify UOp min/max default [run_process_replay]
#5688 merged Jul 24, 2024
first fold, then expand
#5673 merged Jul 24, 2024
shorter BufferOps.LOAD creation
#5685 merged Jul 24, 2024
make fusion deterministic
#5684 merged Jul 24, 2024
docs: add more info on HCQProgram
#5683 merged Jul 24, 2024
nv better nvdisasm fail message
#5682 merged Jul 24, 2024
shorter BufferOps.CONST creation
#5681 merged Jul 24, 2024
share fusion behavior for r3 kernels
#5680 merged Jul 24, 2024
scheduling infra for isolated dags
#5679 merged Jul 24, 2024
replace RANGE max fold with generic max fold
#5676 merged Jul 24, 2024
UOp mul lt fold
#5677 merged Jul 24, 2024
generic UOp max folding
#5675 merged Jul 24, 2024
UOp compute min and max in one call [run_process_replay]
#5674 merged Jul 24, 2024
UOp mod folding
#5668 merged Jul 24, 2024
increase amount of float2/float4 folding
#5672 merged Jul 24, 2024
remove MERGE opt, cleanup wmma upcast
#5669 merged Jul 24, 2024
simple TC change [run_process_replay]
#5671 merged Jul 24, 2024
add vmin vmax of SPECIAL
#5670 merged Jul 24, 2024
switch contract arg to match expand arg [run_process_replay]
#5667 merged Jul 24, 2024
remove UOps lt pattern of booleans
#5666 merged Jul 24, 2024
more generic lt folding
#5665 merged Jul 23, 2024
skip interpolate tests for PYTHON=1
#5664 merged Jul 23, 2024
Fix cuda tc emu test
#5663 merged Jul 23, 2024
remove ptx PTXRenderer.gdim gid lid [run_process_replay]
#5662 merged Jul 23, 2024
update UOp.SPECIAL arg spec [run_process_replay]
#5661 merged Jul 23, 2024
fix acc folding for NV tensor cores
#5658 merged Jul 23, 2024
skip test_failure_39 in CI
#5660 merged Jul 23, 2024
reorder UOps.DEFINE_VAR in runtime [run_process_replay]
#5659 merged Jul 23, 2024
simple UOp lt/ge folding
#5657 merged Jul 23, 2024
start scheduler process replay
#5656 merged Jul 23, 2024
uop mod-mod simplification
#5650 merged Jul 23, 2024
hcq profile tests
#5654 merged Jul 23, 2024
more work toward non-blocking process replay
#5653 merged Jul 23, 2024
hcq move out program call to base class
#5638 merged Jul 23, 2024
merge gated stores spec
#5652 merged Jul 23, 2024
amd tiny cleanups
#5651 merged Jul 23, 2024
add tests for uops stats
#5649 merged Jul 23, 2024
uop symbolic simple mul mod
#5648 merged Jul 23, 2024
memory estimate of cache also
#5646 merged Jul 23, 2024
reuse UOp.sparents in UOps.vars [run_process_replay]
#5647 merged Jul 23, 2024
dumb linearizer example that max is not simplified
#5644 merged Jul 22, 2024
typo in ops_amd invalidate_caches
#5643 merged Jul 22, 2024
fix arange 4096 with more folding rules
#5641 merged Jul 22, 2024
UOp.const(x.dtype, y) -> x.const(y) [run_process_replay]
#5642 merged Jul 22, 2024
UOp mul div simplification
#5637 merged Jul 22, 2024
hcq move out synchronize to base class
#5634 merged Jul 22, 2024
amd more accurate cache managment
#5631 merged Jul 22, 2024
more actionable verify_lazyop assert
#5635 merged Jul 22, 2024
hcq: remove duplicate allocation of kernel args by abstracting
#5633 merged Jul 22, 2024
hcq cache invalidation for beam
#5630 merged Jul 22, 2024
replace gates in uopgraph [run_process_replay]
#5632 merged Jul 22, 2024
test: put conv in one reduce
#4441 merged Jul 22, 2024
folding without UNMUL
#5628 merged Jul 22, 2024
helpers: remove duplicate data64 helpers in amd/nv
#5627 merged Jul 21, 2024
parallel mcts
#5626 merged Jul 21, 2024
move ufix inside UOp [run_process_replay]
#5621 merged Jul 21, 2024
mcts exit condition wasn't right, also use it with BEAM>=100
#5619 merged Jul 21, 2024
simpler pattern matcher rules [run_process_replay]
#5620 merged Jul 21, 2024
mcts graph and dedup support
#5618 merged Jul 21, 2024
tests if the linearizer is generating dumb code
#5611 merged Jul 21, 2024
MCTS tweaks
#5616 merged Jul 21, 2024
BEAM bugfix, kernels dedup now
#5617 merged Jul 21, 2024
one more test case for symbolic mod mul
#5615 merged Jul 20, 2024
copy mlperf 4.0 to mlperf 4.1
#5614 merged Jul 20, 2024
hcq move map to allocator
#5610 merged Jul 20, 2024
casual work on mcts improvements
#5606 merged Jul 20, 2024
test argmax multi reduce failure in uopgraph
#5609 merged Jul 20, 2024
small input_st reorder
#5608 merged Jul 20, 2024
elf loader touchups
#5607 merged Jul 20, 2024
hcq simpler _gpu2cpu_time
#5605 merged Jul 20, 2024
docs: fix synchronization example in hcq
#5604 merged Jul 20, 2024
mcts search
#5598 merged Jul 20, 2024
move UPat and PatternMatcher from uopgraph.py to uops.py
#5597 merged Jul 19, 2024
CLIP Vision
#5595 merged Jul 19, 2024
remove obsolete code
#5596 merged Jul 19, 2024
fix no locals behavior
#5593 merged Jul 19, 2024
lowerer img index
#5592 merged Jul 19, 2024
doc: variable names in abstractions2.py
#5591 merged Jul 19, 2024
correct IDIV dtype check error msg
#5589 merged Jul 19, 2024
hcq refactor signal into class
#5575 merged Jul 19, 2024
Fix typo in Runtime Overview docs
#5588 merged Jul 19, 2024
careful memory counting (with tests to specify behavior)
#5587 merged Jul 19, 2024
always reverse global dim
#5586 merged Jul 19, 2024
push contract through cast to fix test_float2_acc (try 2)
#5585 merged Jul 19, 2024

17 Pull requests opened by 12 people

allow specify splits in shard, handle multiple different splits in MLB.e
#5599 opened Jul 20, 2024
MLB support reshape for uneven shards
#5600 opened Jul 20, 2024
Intel XMX Tensor Core Support
#5622 opened Jul 21, 2024
merge gated stores
#5636 opened Jul 22, 2024
Shape changing bitcast final
#5640 opened Jul 22, 2024
allow bitcasts types and testing
#5645 opened Jul 22, 2024
Pretty print LazyBuffer
#5655 opened Jul 23, 2024
[WIP] amx support as TC
#5693 opened Jul 24, 2024
start triton backend
#5695 opened Jul 24, 2024
UOp const folding in `__post_init__`
#5696 opened Jul 24, 2024
late load merging gets 144 TFLOPS matmul on 4090
#5699 opened Jul 25, 2024
Multiple gradients for force-matching problems
#5701 opened Jul 25, 2024
skip hashing unrealized children
#5717 opened Jul 25, 2024
optimize symbolic-related updates in graphs
#5727 opened Jul 26, 2024
named UOp class "NOP"
#5728 opened Jul 26, 2024
PTX render vec CONST
#5729 opened Jul 26, 2024
process replay diffs 3 things now
#5731 opened Jul 26, 2024

5 Issues closed by 2 people

Backward pass convs have two reduces
#3572 closed Jul 25, 2024
`UPat` `__repr__` does not include permutation from list
#5700 closed Jul 25, 2024
Matching engine is slow ($500 bounty)
#4878 closed Jul 19, 2024
Unify lazy.py reduce split with OptOpts.GROUP
#4910 closed Jul 19, 2024
IDIV may return float. Integer division by zero does not error like PyTorch
#5005 closed Jul 19, 2024

1 Issue opened by 1 person

Fail to run a simple example with nv backend
#5730 opened Jul 26, 2024

15 Unresolved conversations

Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.

Clang jit
#4492 commented on Jul 21, 2024 • 2 new comments
Make vectorization of CONST explicit [run_process_replay]
#5322 commented on Jul 26, 2024 • 2 new comments
Bounty: Fast parallel scan (Mamba, etc).
#3039 commented on Jul 20, 2024 • 0 new comments
Apple M1 Max cannot load llama3-8b-sfr weights (because no bfloat support?)
#5549 commented on Jul 23, 2024 • 0 new comments
simple linear kernel not fusing
#5527 commented on Jul 24, 2024 • 0 new comments
Improve reduceop elementwise fusion
#4323 commented on Jul 25, 2024 • 0 new comments
Fuse double expands
#4589 commented on Jul 25, 2024 • 0 new comments
[DRAFT PROPOSAL] Outline for AMD >100TFLOPS matmul for 7900XTX bounty
#5569 commented on Jul 26, 2024 • 0 new comments
UNet3D MLPerf
#3470 commented on Jul 25, 2024 • 0 new comments
[MLPERF] Retinanet
#4245 commented on Jul 25, 2024 • 0 new comments
qcom: driver init
#5213 commented on Jul 25, 2024 • 0 new comments
RDNA3 assembler (WIP)
#5232 commented on Jul 24, 2024 • 0 new comments
isolate Tensor.sin error in LLVM and NV=1
#5463 commented on Jul 23, 2024 • 0 new comments
Multireduce Lowerer
#5515 commented on Jul 26, 2024 • 0 new comments
draft: move SPLIT_REDUCEOP into kernel.py
#5572 commented on Jul 24, 2024 • 0 new comments