Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scan is overly specialized #68371

Open
mydogisbox opened this issue Jan 19, 2020 · 22 comments
Open

Scan is overly specialized #68371

mydogisbox opened this issue Jan 19, 2020 · 22 comments
Labels
A-iterators Area: Iterators C-feature-request Category: A feature request, i.e: not implemented / a PR. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@mydogisbox
Copy link

Here is the existing type signature for std::iter::scan:
fn scan<St, B, F>(self, initial_state: St, f: F) -> Scan<Self, St, F> where F: FnMut(&mut St, Self::Item) -> Option<B>,

If I were to use a verbose name for scan as implemented now, I would call it map_with_state_until, this means that if I just want map_with_state, things can get really awkward using the existing scan if I want to return an option which doesn't terminate the Iterator for collect etc. e.g. vecter.iter().scan( |state, x| if x > 1 then Some(None) else Some(Some(x)) ).flatten()

If we instead have F just turn B instead of Option<B> like:
fn scan<St, B, F>(self, initial_state: St, f: F) -> Scan<Self, St, F> where F: FnMut(&mut St, Self::Item) -> B, then that becomes a lot more simple:
vecter.iter().scan( |state, x| if x > 1 then None else Some(x) )

and achieving the existing scan behavior is trivially achieved by simply adding fuse().

This allows scan to have behavior more in-line with similar functions like map and fold:

fn fold<B, F>(self, init: B, f: F) -> B where
    F: FnMut(B, Self::Item) -> B, 

fn map<B, F>(self, f: F) -> Map<Self, F> where
    F: FnMut(Self::Item) -> B,

and also brings it more in-line with how other languages define scan : (a -> b -> a) -> a -> [b] -> [a] (Haskell), ('State -> 'T -> 'State) -> 'State -> seq<'T> -> seq<'State> (F#) etc.

I think this also gives a clear way forward to addressing other issues which previously ended up unresolved like:
#14425

So I think there ought to be scan as defined above, and scan_until which is the existing implementation.

References:
http:https://zvon.org/other/haskell/Outputprelude/scanl_f.html

https://msdn.microsoft.com/en-us/visualfsharpdocs/conceptual/seq.scan%5b't,'state%5d-function-%5bfsharp%5d

@jonas-schievink jonas-schievink added A-iterators Area: Iterators C-feature-request Category: A feature request, i.e: not implemented / a PR. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Jan 19, 2020
@jonas-schievink
Copy link
Contributor

We cannot change the existing definition of scan, but maybe adding another method makes sense, provided that this is useful enough.

@RustyYato
Copy link
Contributor

RustyYato commented Jan 20, 2020

Isn't map_with_state just map? I don't see what the difference is given that Rust has stateful closures. scan takes an iterator and can arbitrarily change how many elements are yielded (which no other adapter does, and may be too powerful and unweildy). This makes it useful in some niche cases.

@mydogisbox
Copy link
Author

mydogisbox commented Jan 26, 2020

@KrishnaSannasi

can arbitrarily change how many elements are yielded

This is exactly the part of the existing scan implementation which I think is problematic. Its non-standard to not yield an output element for each input element (see https://en.wikipedia.org/wiki/Prefix_sum for the generally accepted definition). This makes it less generally applicable than otherwise.

You can certainly implement the standard scan definition using map and a stateful closure, but its much less obvious what is happening and requires application developers to implement more logic. scan is a standard function in most modern languages.

In my case I'm laying out lines of text with word wrapping so I need to keep track of the y-offset of the end of the previous line and yield the character offset of where to split each line. The wikipedia article I linked lists other applications.

@mydogisbox
Copy link
Author

Building on the naming in the wikipedia article, maybe we could add inclusive_scan and exclusive_scan and deprecate scan.

@0-Kala-0
Copy link

0-Kala-0 commented Nov 21, 2020

If we look at the different map and fold methods that exist, we can infer how scan methods should behave and be named.

Mapping methods

Regular mapping :

fn map<B, F>(self, f: F) -> Map<Self, F>
where
    F: FnMut(Self::Item) -> B

Filter mapping (F returns an Option<B>. When encountering None, the value is skipped and the mapping goes on):

fn filter_map<B, F>(self, f: F) -> FilterMap<Self, F>
where
    F: FnMut(Self::Item) -> Option<B>

While Mapping (F returns an Option<B>. When encountering None, the mapping ends):

fn map_while<B, P>(self, predicate: P) -> MapWhile<Self, P>
where
    P: FnMut(Self::Item) -> Option<B>

Folding methods

Seeded folding:

fn fold<B, F>(self, init: B, f: F) -> B
where
    F: FnMut(B, Self::Item) -> B

First-seeded folding:

fn fold_first<F>(self, f: F) -> Option<Self::Item>
where
    F: FnMut(Self::Item, Self::Item) -> Self::Item

A scan is literally meant to map each step of a folding operation. As it's somehow a combination of both operations, the concept of "scanning" might be any combination of their variants. If we stick to consistency in the naming, the possible scanning methods would be :

Seeded First as seed
Regular
fn scan<St, B, F>(self, initial_state: St, f: F) -> Scan<Self, St, F>
where
    F: FnMut(&mut St, Self::Item) -> B
fn scan_first<F>(self, f: F) -> ScanFirst<Self, F>
where
    F: FnMut(Self::Item , Self::Item) -> Self::Item
Filter
fn filter_scan<St, B, F>(self, initial_state: St, f: F) -> Scan<Self, St, F>
where
    F: FnMut(&mut St, Self::Item) -> Option<B>
fn filter_scan_first<F>(self, f: F) -> FilterScanFirst<Self, F>
where
    F: FnMut(Self::Item, Self::Item) -> Option<Self::Item>
While
fn scan_while<St, B, F>(self, initial_state: St, f: F) -> Scan<Self, St, F>
where
    F: FnMut(&mut St, Self::Item) -> Option<B>
fn scan_first_while<F>(self, f: F) -> ScanFirstWhile<Self, F>
where
    F: FnMut(Self::Item, Self::Item) -> Option<Self::Item>

(note that I'm not 100% certain that the _first versions actually need to require a function returning a Self::Item. They might be implemented requiring functions returning any B with a Self::Item accmulator internally, but it's non-trivial and I've done enough research one today ^^)

We can see from this table that the current behavior of scan is the behavior we could expect of scan_while(). It's made even more obvious by the observation that map_while() is just a particular case of the current implementation of scan() (where you just ignore the accumulator) : #68537 (comment).

It is indeed very hard to change the current definition of scan as it would completely break existing code. Changing the naming convention could be the way to get around the problem (and indeed depreciating scan()). Additionally, if we want to keep some form of consistency, it would also be needed to change the name of fold() (and depreciate the current one). Here's a proposal :

fold_seed() // currently: fold()
fold_first() // currently: fold_first(), no change here

scan_seed() // no method in rust
scan_first() // no method in rust

filter_scan_seed() // no method in rust
filter_scan_first() // no method in rust

scan_seed_while() // currently: scan()
scan_first_while() // no method in rust

Obviously, not all of them are not needed. Some of them can be easily expressed using other ones (see the Wikipedia article above for details). Some of them have very few practical use and might not need to be implemented in Rust.

Though, I strongly agree with the original concern of the issue : having as the only available implementation of scan the one we currently have is quite terrible : more often than not, the "end scanning when F returns None" behavior is unrevelant for the current use case and sometimes it's even getting in the way of what's being done, resulting in quite unreadable code like the one described by @mydogisbox or even worst if you're trying to achieve a filter_scan().

PS : I'm not sure how "seeded" and "first-seeded" relate to "inclusive" and "exclusive" scanning/folding. It seems to me that they're different concepts, and the Wikipedia page seems to disagree with me but in a very unclear way, so I chose not to confuse the two. They might be the same thing, IDK

@mydogisbox
Copy link
Author

@Mayt38 great points!

@ghost
Copy link

ghost commented Nov 27, 2020

I was trying to implement a function which accepts an iterator and reduces adjacent repeating elements into only one element. Say we have [1, 1, 2, 3, 4, 4, 5], and I want the output to be [1, 2, 3, 4, 5]. I found that I had to resort to the nested Option solution. I really think that it is a great point that we should have scan not fusing the iterator. The current implementation violates the Single Responsibility Rule and really leads to confusion in that it behaves different from the common scan's as in other languages.

@Johannesd3
Copy link

Besides the fact that the current scan should actually be called scan_filter, I don't like that the closure takes a mutable reference. Aren't iterators all about avoiding mutability?

The proposed scan (without _while) would be equivalent to

let mut state = init;
iter.map(move |x| f(state, x));

This is at least as simple and pretty as scan (map and its closure both take one argument while scan and its closure require two arguments which might get ugly), and more importantly, it doesn't hide the fact that there is some kind of mutability. Maybe that's a reason why a "simple" scan never made it to the standard library. However, after introducing map_while, the current scan could also be replaced in this way. In my opinion, scan becomes quite useless.

Let's consider Haskell's scanl. It is very similar to folding but it preserves all intermediate accumulator states and creates a new list of it. Implementing this in Rust gets more complicated.

First try:

let mut state = init;
iter.map(move |x| {
   state = f(state, x);
   state
}))

...but this does only work for copy types and it differs from Haskell's scanl: The first item of scanl's resulting list is always the init element (so that it has one more item than the original list).

My next try, now passing the state to f by reference:

let mut state = init;
iter.map(move |x| {
   let new_state = f(&state, x);
   std::mem::replace(&mut state, new_state)
})

...but now the last item is missing and we cannot simply get it out of the closure's state. Unless anyone else has a better idea, you would have to create your own iterator struct to emulate the behaviour of Haskell.

Haskell is a well-thought-out language and I think that scanl is useful, in contrary to Rust's current scan. So if you consider deprecating and replacing scan, you might as well take these thoughts into consideration.

@mydogisbox
Copy link
Author

@Johannesd3 I didn't even think about how odd it is to have the state being mutable. Good point.

@RustyYato
Copy link
Contributor

RustyYato commented Dec 29, 2020

Aren't iterators all about avoiding mutability?

I disagree with this, iterators are about being declarative instead of imperative, mutability is an orthogonal concern. Rust doesn't shy away from mutability like in more functional languages, because sometimes mutability is required for performance or making the algorithm easier to understand. Limiting mutability artificially makes apis harder to use in Rust.

For example: Vec::retain could give you &mut T, but it doesn't, it gives you a &T. This makes it painful to use. So painful, that other Vec-like abstractions changed their retain to give a &mut T (SmallVec::retain). This will eventually be "fixed" by drain_filter, which does give you a &mut T, whenever that stabilizes. Because of this, I don't believe chasing after immutability, when it is easy to allow mutability is useful.

All that said, I also agree that the current implementation of Scan is pretty niche and not fit for std.

Let's consider Haskell's scanl. It is very similar to folding but it preserves all intermediate accumulator states and creates a new list of it. Implementing this in Rust gets more complicated.

scan is useless in Rust precisely because Rist has mutability, and is useful in Haskell precisely because it lacks mutability. In Rust map can do pretty much everything that scanl does in Haskell, so it doesn't seem useful to have an additional scan functionality. Mutability isn't bad, unconstrained mutability is bad, and Rust provides the tools to constrain mutability without making it useless.

scanl in Rust
-- from prelude
scanl :: (b -> a -> b) -> b -> [a] -> [b]

is equivalent to

mut_scanl :: (a -> State b ()) -> b -> [a] -> [b]
mut_scanl mut_f b as = scanl f b as
    where f b a = next_b
            where ((), next_b) = runState (mut_f a) b

is equivalent to the following Rust

fn scanl<B, F, I>(f: F, mut b: B, iter: I) -> impl Iterator<Item = B>
where
    I: Iterator,
     // in Haskell everything is `Clone`, so this is equivalent
    B: Clone,
    // 1. may as well provide `FnMut` because we can
    // 2. `&mut B` is equivalent to `State B ()` in Haskell
    F: FnMut(I::Item, &mut B),
{
    iter.map(move |a| {
        f(a, &mut b);
        b.clone()
    })
}

We could take it a step further:

fn scanl<B, F, I>(f: F, mut b: B, iter: I) -> impl Iterator<Item = C>
where
    I: Iterator,
    F: FnMut(I::Item, &mut B) -> C,
{
    iter.map(move |a| f(a, &mut b))
}

But what does this offer over map? Since map already gives you &mut _ access to you surrounding scope, this scanl gives you no additional power.

I can see a use case for the first implementation of scanl in Rust: When you want to partition your state and return a portion of it, but even that first version of scanl seems super niche.

We transform this back to Haskell too

mut_scanl :: (a -> State b c) -> b -> [a] -> [c]
mut_scanl mut_f b = join . map pick_c . scanl f (b, Nothing)
    where f (b, _) x = (next_b, Just next_c)
            where (next_c, next_b) = runState (mut_f x) b
          pick_c (_, c) = maybeToList c

(my Haskell isn't great, I'm just trying so the actual implementations aren't the focus, just the signatures)

@Johannesd3
Copy link

Johannesd3 commented Dec 29, 2020

I wasn't claiming that scanl in Rust would be super useful. But I think it is at least more useful than the current scan.

One example: Creating an iterator that yields factorials using num_bigint.

Using scan:

iter::once(one()).chain((1..).scan(one(), |state: &mut BigUint, x: usize| {
    *state *= x;
    Some(state.clone())
}))

Using map makes this not worse in my opinion:

let mut fac = one();

(0..).map(move |n: usize| { 
    if n > 0 { 
        fac *= n;
    }
    fac.clone()
})

And using a Haskell-like myscan:

myscan(1.., one(), |state: &BigUint, x: usize| x * state)

Link to playground, containg the implementation of myscan.

But the last one has two disadvantages: 1. The next factorial is calculated one step too early. 2. It is noticeably slower, but I am not sure why.

So at least there seems to be a reason why the way with the mutable reference was chosen. (However, I am curious what exactly makes it so much slower, even though it works without clone)

EDIT: Wrong link

@mydogisbox
Copy link
Author

@RustyYato the more general scan is useful in rust when you want to chain iterator functions like (actual example from my code):

    font.layout(todo, *scale, point)
        .enumerate()
        .scan(0., |row_offset, (index, item)| {
            let item_offset = item
                .pixel_bounding_box()
                .map_or_else(|| item.position().x, |x| x.max.x as f32);
            if index == 0 {
                Some(Some(0))
            } else if item_offset - *row_offset <= (WIDTH as f32) {
                Some(None)
            } else {
                *row_offset = *row_offset + item_offset;
                Some(Some(index))
            }
        })
        .flatten()
        .chain(once(todo.len()))
        .collect::<Vec<usize>>()
        .windows(2)
        .map(|indices| &todo[indices[0]..indices[1]])
        .collect::<Vec<&'a str>>()

In this example using map with a mutable collection would be much less readable than using scan due to how it changes the scope of what is needed to understand a particular step in the chain.

retain seems like a poor example for why mutability would be a good thing generally since its an inherently mutating operation in the first place, whereas map, scan etc are not.

I'm not against mutability, but I do think that separating mutable operations from non-mutable operations improves readability because expectations are more clear. If I wanted to both transform each element and update a collection I would just use a for loop. This is how its done in other languages I've used that have mutability controls (e.g. F#).

Whether this is done conventionally or enforced by the standard library definitions is, however, a different question. Maybe there is a case to be made that people ought to be allowed to mutate at will for performance or other reasons.

@RustyYato
Copy link
Contributor

RustyYato commented Dec 29, 2020

@Johannesd3
The perf difference is probably because it's cheaper to do *= on BigInt then clone than to multiply &BigInt by usize.

This is because the former can reuse the allocation, while the latter cannot. This means that *= on average does less reallocations, which is a perf improvement. The clone is really fast because it only needs to do one allocation because the size is known ahead of time. As opposed to &BigInt * usize, which lowers to big_int.clone() * usize, which needs to reallocate more than *= on average.

@mydogisbox

the more general scan is useful in rust when you want to chain iterator functions like (actual example from my code):

If every return of the Scan is Some(_), then you can replace it with a map, with no leakage into the outer scope like so:

code
font.layout(todo, *scale, point)
    .enumerate()
    .map({ let row_offset = 0.0; |(index, item)| {
        let item_offset = item
            .pixel_bounding_box()
            .map_or_else(|| item.position().x, |x| x.max.x as f32);
        if index == 0 {
            Some(0)
        } else if item_offset - row_offset <= (WIDTH as f32) {
            None
        } else {
            row_offset += item_offset;
            Some(index)
        }
    }})
    .flatten()
    .chain(once(todo.len()))
    .collect::<Vec<usize>>()
    .windows(2)
    .map(|indices| &todo[indices[0]..indices[1]])
    .collect::<Vec<&'a str>>()

I use this pattern when spawning threads, which is another place where you want some stuff stuffed into the closure, but not the containing scope. There are a number of ways to format this, I chose this way because it most closely mirrors scan, but normally I put the let binding on its own line.

retain seems like a poor example for why mutability would be a good thing generally since its an inherently mutating operation in the first place, whereas map, scan etc are not.

Sure, retain itself is a mutable operation, but most of the time you don't need to mutate the elements of the Vec while the retain is in progress. But it does come up, and when it does the fact that retain doesn't provide the more permissive API is frustrating.

I would argue that scan is a mutable operation, you're providing a state that is packaged with the iterator that will be mutated on each iteration. I can see map as being normally immutable.

@Johannesd3
Copy link

@RustyYato Instead of using fold to determine the n-th factorial, would it be a faster to implement it with *=? More generally: Are there cases in which fold is not the fastest option? In that case, would it be fine to sacrifice some performance of scan if it increases the consistency between fold and scan?

@RustyYato
Copy link
Contributor

There was recently an internals thread on exactly this topic here. There it was also shown that for_each can be more performant than fold, because for_each mutates in place, where as fold needs to move things around, so even there minimizing copies is enough to get a perf improvement.

I don't know how you can change scan without breaking back-compat.

@Johannesd3
Copy link

I wasn't aware of that topic, but it is quite related. I'm not sure whether to continue the disscussion here or there.

However, one interesting point from the discussion: Map still implements DoubleEndedIterator, so it does not guarantee that it is consumed in the right order. That's a reason why scan could make sense.

When I talk about "replacing" scan I think of adding a new function similar to scan with different name.

@RustyYato
Copy link
Contributor

I wasn't aware of that topic, but it is quite related. I'm not sure whether to continue the disscussion here or there.

I would just continue here

However, one interesting point from the discussion: Map still implements DoubleEndedIterator, so it does not guarantee that it is consumed in the right order. That's a reason why scan could make sense.

Another option is to add a combinator that removes the DoubleEndedIterator impl so that you can get that guarantee, but that doesn't seem worth it. But in general, I don't see how this can be a problem. In particular, either you consume the iterator where you create it, then you know when not to call rev, or you return the iterator, which usually means returning impl Iterator<Item = ???>, which prevents rev anyways. Without specialization, there isn't a way to get around this, and with specialization, there are many other ways to break traits, so you need to be careful anyways.

When I talk about "replacing" scan I think of adding a new function similar to scan with different name.

Ahh, that makes more sense.

@optozorax
Copy link

There was some suggestions about scan replacement on internals: scan is too complicated, suggestion for new iterator methods (There also was useful conversations and ideas). Here is the text from this topic:


AFAIK scan iterator method is the only iterator method that provides a mechanism to share some mutable state across iterations. But I think this method is too complicated and wide, therefore it should be replaced by more convenient and simple methods.

Let's look at scan usage example:

let a = [1, 2, 3];

let mut iter = a.iter().scan(1, |state, &x| {
    // each iteration, we'll multiply the state by the element
    *state = *state * x;

    // then, we'll yield the negation of the state
    Some(-*state)
});

assert_eq!(iter.next(), Some(-1));
assert_eq!(iter.next(), Some(-2));
assert_eq!(iter.next(), Some(-6));
assert_eq!(iter.next(), None);

Observations:

  1. We must provide an initial value of the state, we can't get it from the first iteration.
  2. state is provided by &mut reference, so we must change it by using *state = ... syntax, not by returning a value. So, we always need curly brackets. But it's a powerful tool: we can compute a return value before changing state, or after.
  3. Then we return Some() of next value. And by this we do as same as filter_map. Because of this Scan doesn't implement ExactSizeIterator, so we can't use it to allocate exact space when using the collect method.

I think:

  1. We should able to provide initial value either by telling an exact value or from first iteration.
  2. state should be changed not by mutable reference, but by function Fn(State, &Item) -> State, because it is more simple and don't require {} in simple cases.
  3. There shouldn't be filter_map or map or filter behavior inside this iterator, it should be done by the next iterator adaptors.

So, I suggest these new methods to replace scan: state_before, state_after with equal signatures.

Example:

let a = [1, 2, 3];
assert_eq!(a.iter().state_before(None,    |state, &item| state * item).eq(&[(1, 1), (2, 2), (6, 3)]));
assert_eq!(a.iter().state_before(Some(4), |state, &item| state * item).eq(&[(4, 1), (8, 2), (24, 3)]));
assert_eq!(a.iter().state_after(None,     |state, &item| state * item).eq(&[(1, 1), (1, 2), (2, 3)]));
assert_eq!(a.iter().state_after(Some(4),  |state, &item| state * item).eq(&[(4, 1), (4, 2), (8, 3)]));
  • First argument of these methods - is an initial value. If it is None, then the initial value is yielded from the first iteration. If it is Some(...), then it initialized by the inner value.
  • Second arguments of this methods - is function Fn(State, &Item) -> State. That function receives current state and current item by reference and it must return next state value.
  • The main name is state because these methods provide state across iterations.
  • Suffix before means that the state will be modified before returning the element. Suffix after means that the state will be modified after returning the element.
  • These methods return an iterator with Item = (State, OldItem) by analogy as enumerate.
  • These methods return an iterator that implements ExactSizeIterator.
  • Then, we can apply filter_map next and get the same behavior as a scan.

What do you think?


I used scan_before and scan_after in my code for competitive programming to find maximum in prefix sum array:

let max1 = r.iter().scan_after(0, |acc, x| acc + *x).map(|(x, _)| x).max().unwrap();

Mostly it's useful for something like creating or searching in prefix sum/max arrays.

@rami3l
Copy link
Member

rami3l commented Nov 12, 2021

@optozorax IMHO using integers in your example is also a bit specialized. The abstraction provided by scan should be general enough to play well with the overall ownership model.

What's annoying to me is that since scanning produces a list of (owned) partial sums, where do those ownerships come from? (In the current version of scan you need to explicitly clone, which is quite clear; in fold however, you are free from such problems since you're merging multiple ownerships into one).

@stiff
Copy link

stiff commented Oct 30, 2023

Main problem with current scan or map with external state is that the mutation of state usually returns () which forces to use multiline closure with {} block, which leads to ugly code. Simple processing like prefix sums mentioned above should be simple and elegant like iter.map_scan(0, |acc, x| acc + x), not 5 lines of curly braces and parenthesis. Otherwise it defeats the whole purpose of these helper functions, might as well just use for everywhere, it has maximum possible flexibility (and hence unreadability).

@usamoi
Copy link
Contributor

usamoi commented Jul 19, 2024

scan feels awkward to me:

  1. scan forces the closure to maintain mutable state, which always decreases code readability.
  2. scan should have good composability, but it does too many things. It should return internal states and then caller could use map to produce an iterator that returns results.

I just want an auxiliary function with a signature as simple as fold to generate prefix sums, but scan is completely unsuitable.

@ronnodas
Copy link

Prefix sums, or more generally Haskell's scanl, is easier to write using successors instead of scan but it would be nice to have it available on its own. It does seem to be relatively niche, so I have suggested it to the itertools crate here. See also another discussion at the users forum.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-iterators Area: Iterators C-feature-request Category: A feature request, i.e: not implemented / a PR. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

10 participants