Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ktensor.full() improvements #299

Open
ghbrown opened this issue Jun 21, 2024 · 2 comments
Open

ktensor.full() improvements #299

ghbrown opened this issue Jun 21, 2024 · 2 comments

Comments

@ghbrown
Copy link

ghbrown commented Jun 21, 2024

The current implementation of .full() for ktensors is extremely memory and compute intensive. It's easy to run out of memory with tensor of modest size but high rank, since the current algorithm requires memory proportional to rank.
I believe an einsum based approach would be much more efficient.

Consider the timing results for the code below.

n = 1000:
  full tensor size: ~8 GB
  current .full() profiling: 107 s, 120 GB peak memory usage
  einsum based .full():      65 s,  8 GB peak memory usage

n = 100:  (as evidence that einsum outperforms for small tensors too)
  current .full() profiling: 0.1s
  einsum based .full():      0.06s
import pyttb
import numpy as np
import time

def einfull(w,fm_list):
    return np.einsum('r,ir,jr,kr->ijk',w,*fm_list)

n = 1000
r = 10
w = np.random.rand(r)
A = np.random.rand(n,r)
B = np.random.rand(n,r)
C = np.random.rand(n,r)
K = pyttb.ktensor()
K.weights = w
K.factor_matrices = [A,B,C]
t_0 = time.perf_counter()
T = K.full()
# T = einfull(w,[A,B,C])
t_1 = time.perf_counter()
print(t_1 - t_0)
@tgkolda
Copy link

tgkolda commented Jun 22, 2024

@ghbrown This is not being done according to the efficient version from MATLAB (aka Version 2). If you translate that algorithm, it should be much better.

@ghbrown
Copy link
Author

ghbrown commented Jun 25, 2024

Yes, thanks for the suggestions. I'll write a PR for this shortly, but for now here is a mock up with benchmarks.

n = 1000:
  full tensor size: ~8 GB
  current .full() profiling: 107 s, 120 GB peak memory usage
       einsum based .full(): 65 s,  8 GB peak memory usage
        split based .full(): 1.6 s,  8 GB peak memory usage

Interesting that in the case of two splits with the same memory footprint it happens to be optimal to choose the one with smaller row dimension (and so larger column dimension). I assume this because the result of the outer product is laid out in row-major order, but I'm not positive.

Outer product timings:
  (1e3 x 10) @ (10 x 1e6): 1.4 s
  (1e6 x 10) @ (10 x 1e3): 5.6 s
import pyttb
import numpy as np
import time

def einfull(w,fm_list):
    return np.einsum('r,ir,jr,kr->ijk',w,*fm_list)

def min_split_dims(dims):
    """
    solve
      min_{i in range(1,d)}  product(dims[:i]) + product(dims[i:])
    note: returning the first minimizer in the the sum_of_prods
          (argmin() default behavior) seems to be reduce the runtime
          of full(), since forming a row-vector-like outer product is
          faster than forming the transposed column-vector-like outer
          product (due to memory layout of the outer product?)
    """
    sum_of_prods = [np.prod(dims[:i])+np.prod(dims[i:]) for i in range(1,len(dims))]
    i_min = np.argmin(sum_of_prods) + 1  # note range above starts at 1
    return i_min

def full2(w,fm_list):
    dims = [fm.shape[0] for fm in fm_list]
    i_split = min_split_dims(dims)
    data = ( (pyttb.khatrirao(*fm_list[:i_split],reverse=True) * w) @
            pyttb.khatrirao(*fm_list[i_split:],reverse=True).T )
    return pyttb.tensor(data, tuple(dims), copy=False)

# benchmark short-fat vs tall-skinny outer products
A_test = np.random.rand(1000,10)
B_test = np.random.rand(1000**2,10)
w_test = np.random.rand(10)
t_0 = time.perf_counter()
C_test_1 = (A_test*w_test) @ B_test.T
t_1 = time.perf_counter()
print(f'row-vector-like time: {t_1 - t_0}')
t_0 = time.perf_counter()
C_test_2 = (B_test*w_test) @ A_test.T
t_1 = time.perf_counter()
print(f'column-vector-like time: {t_1 - t_0}')
del(A_test,B_test,C_test_1,C_test_2,w_test)

n = 1000
r = 10
w = np.random.rand(r)
A = np.random.rand(n,r)
B = np.random.rand(n,r)
C = np.random.rand(n,r)
K = pyttb.ktensor()
K.weights = w
K.factor_matrices = [A,B,C]
t_0 = time.perf_counter()
# T_full_1 = K.full()
# T_ein    = einfull(w,[A,B,C])
T_full_2 = full2(w,[A,B,C])
t_1 = time.perf_counter()
print(f'time to construct full tensor: {t_1 - t_0}')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants