Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add aggregation functions to Mango #1254

Open
katsel opened this issue Mar 29, 2018 · 10 comments
Open

Add aggregation functions to Mango #1254

katsel opened this issue Mar 29, 2018 · 10 comments

Comments

@katsel
Copy link
Contributor

katsel commented Mar 29, 2018

The Mango query language provides CRUD operations and basic selector syntax for document retrieval.
There are no aggregation functions.
Thus, Mango does currently not have an equivalent to 'reduce' and 'rereduce' functions as provided by JavaScript MapReduce.

Expected Behavior

The _find endpoint should provide syntax and sensible keywords such as sum, average, minimum, maximum to facilitate the aggregation of values into a single field.
This change would enable users to write more powerful queries in Mango without prior knowledge of JavaScript or the MapReduce framework.

Current Behavior

None.
Mango is limited to selector/'map' syntax, so to aggregate/'reduce' data, users have to turn to JavaScript MapReduce.

Context

I'm a student in the final stages of my master's degree in computer science. I made a few non-code contributions to CouchDB. I would like to add this functionality to Mango, and make this my first code contribution to the project.

@wohali
Copy link
Member

wohali commented Apr 4, 2018

@katsel Hi! Really glad to see this pop up here. Feel free to post a pull request when you're ready.

If you want a code review prior to the code being 100% ready, just mark your PR with [WIP] and people will know that it's not in its final state.

Thank you again for taking the initiative on this work!

@janl janl added the mango label Jul 14, 2018
@janl
Copy link
Member

janl commented Jul 14, 2018

I believe #1323 could make use of this

@wohali wohali added the roadmap label Aug 7, 2018
@wohali
Copy link
Member

wohali commented Aug 7, 2018

This is the destination of the original roadmap ticket, janl/couchdb-next#19

@glynnbird
Copy link
Contributor

I was asked by @willholley to contribute this Markdown which describes a possible aggregator syntax for supporting aggregations in Mango.


What would Mango aggregation look like?

At the moment, the Mango query language only performs data selection - a portion of a larger data set can be returned by providing a JSON query. If my data looks like this:

{
  _id: "someid",
  date: "2018-08-24",
  status: "provisional",
  invoiceAddress: {
    street: "10 Front Street",
    city: "Dallas",
    state: "Texas"
  },
  amount: 7.99,
  tax: 0,
  totalAmount: 7.99,
  customerId: "A65522",
  lineItems: [
    {
      productId: "P1",
      name: "fish",
      cost: 6.0
    },
     {
      productId: "P2",
      name: "chips",
      cost: 1.99
    }
  ]
}

and I want only the "complete" orders from a database I could perform a query.

{
  selector: {
     status : "complete"
  },
  fields: ["totalAmount", "date"]
}

which would give a paged result set in blocks of 200 records:

{
	"docs": [
	  {
	    "totalAmount": 7.99,
	    "date": "2018-08-24"
          },
	  {
	    "totalAmount": 4.50,
	    "date": "2018-08-24"
          },
	  ...
	],
	"bookmark": "g1AAAAA6eJzLYWBgYMpgSmHgKy5JLCrJTq2MT8lPzkzJBYqzVqUmFSWCJDlgkgjhLADXERDn",
	"execution_stats": {
		"total_keys_examined": 0,
		"total_docs_examined": 10,
		"total_quorum_docs_examined": 0,
		"results_returned": 8,
		"execution_time_ms": 2.75
	},
	"warning": "no matching index found, create an index to optimize query time"
}

But what I really want is a grand total of the totalAmount fields, grouped by date but Mango only supports selection not aggregation. I would have to page through all the results and perform the aggregation in my code.

This document imagines that Mango magically does support aggregation. Everything that follows is fictional syntax in my imaginary world.

--- start of imaginary world ---

My first Mango aggregation

I can use the new "aggregator" object in a Mango query. My query still has a selector, because I don't want to aggregate ALL the documents, only the "complete" orders as before:

{
  selector: {
    status : "complete"
  },
  aggregator: {
    operation: "sum",
    of: ["totalAmount"]
  }
}

Note:

  • I no longer need to supply a fields object at the top level. Mango knows the fields it needs from those specified in the aggregator object.
  • the aggregator.operation field can be one of sum, count or stats.
  • aggregator.of is an array - we'll see how this works later.

If I'm using the count aggregation, I needn't supplied an aggregation.of array:

{
  selector: {
    status : "complete"
  },
  aggregator: {
    operation: "count"
  }
}

A grand total example

Using an aggreator.operation of sum allows me to get a grand total of a field:

{
  selector: {
    status : "complete"
  },
  aggregator: {
    operation: "sum",
    of: ["totalAmount"]
  }
}

or fields:

{
  selector: {
    status : "complete"
  },
  aggregator: {
    operation: "sum",
    of: ["totalAmount", "tax"]
  }
}

or by emptying the "selector", all documents are aggregated:

{
  selector: { },
  aggregator: {
    operation: "sum",
    of: ["totalAmount", "tax"]
  }
}

Grouping

In this example I am performing a more complex selector and introduce grouping:

{
  selector: {
    "$or": {
      status : "complete",
      status: "refunded"
    }
  },
  aggregator: {
    operation: "sum",
    of: ["totalAmount", "tax"],
    group: ["date"]
  }
}

Any valid Mango selector is allowed in the "selector" object. The "selector" is evaluated at index time and decides which portion of the data makes it to the index. If the selector is changed, a new index is required to calculate the result.

The optional "aggregator.group" is an array of keys by which the sum is grouped in the result set. It is an array, so I can have multi-dimensional grouping:

{
  selector: {
    "$or": {
      status : "complete",
      status: "refunded"
    }
  },
  aggregator: {
    operation: "sum",
    of: ["totalAmount", "tax"],
    group: ["date", "invoiceAddress.state"]
  }
}

The above example groups by "invoiceAddress.state" which demonstrates selecting data from a sub-object.

How this works

CouchDB cannot perform aggregations without an index, but instead of insisting that the user perform an additional "index" step, CouchDB will create the appropriate index (if it doesn't already exist) and return the results once indexing is complete.

** this is a big leap - but, hey, this is an imaginary world **

e.g. for an aggregation query like this:

{
  selector: {
    status : "complete"
  },
  aggregator: {
    operation: "sum",
    of: ["totalAmount"],
    group: ["date"]
  }
}

we need an index that looks like this in old-school JavaScript MapReduce

map: 
	function(doc) {
	  if (doc.status === 'complete') {
	    emit([doc.date], [doc.amount])
	  }
	}
reduce:
   "_sum"

The "selector" in an aggregation is really a partial_filter_selector so that the index only contains a sub-set of the documents.

Limit the result set with range

The trouble with this approach is that the aggregation query always returns all of the aggregations e.g. if one sum per day for every day that you have data. You may only want this month's per-day aggregates. This is where aggregator.range comes in:

{
  selector: {
    "$or": {
      status : "complete",
      status: "refunded"
    }
  },
  aggregator: {
    operation: "sum",
    of: ["totalAmount", "tax"],
    group: ["date", "invoiceAddress.state"],
    range: {
      start: ["2018-08-01"],
      end: ["2018-09-01"]
    }
  }
}

The aggregator.range object can be used at query time to return a smaller proportion of the aggregated result, whether grouping or not.

---- end of imaginary world ---

@garrensmith
Copy link
Member

@glynnbird a really nice concept. This has some really great potential. Thanks for sharing.

@htejwani
Copy link

Is there any time estimate for the aggregate feature to be released?

@garrensmith
Copy link
Member

garrensmith commented Jun 26, 2019 via email

@wohali wohali moved this from Proposed for 3.x to Proposed (release independent) in Roadmap Jul 11, 2019
@cslobodan
Copy link

@glynnbird Wow, that is how feature requests should be described. This is exactly what I'm missing in Mango queries. Too bad this was left in an imaginary world...
@garrensmith This is the oldest feature request in the "Proposed (backlog)" section of Roadmap. Hope it will be worked on soon. Thanks!

@NagaPranavi9
Copy link

Is it implemented? If yes, May I know in which version of couchdb it is implemented...as I had the same requirement, so I just want to know.
Thank you

@rnewson
Copy link
Member

rnewson commented Sep 29, 2023

@NagaPranavi9 it is not implemented yet, we'll close the issue as and when that happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Roadmap
  
Proposed (backlog)
Development

No branches or pull requests

9 participants