Aggregation Pipeline In MongoDB using $bucket & $facet  (Part 4)

Aggregation Pipeline In MongoDB using $bucket & $facet (Part 4)

Hello and welcome back readers in this next and last part of this series, In this part, we will learn about two other important operators i.e. $bucket and $facet.

Till now in this series, we have explored some of the major and most important aggregation operators that are most commonly used in industries, for different stages of aggregation pipelines such as $addToFeilds, $lookUp, $match, $group, $project the operators are the real deal when you want to execute a certain stage of the pipeline to extract the required data of choice.

And by the ideas of this aggregation Pipeline, we are going to learn another important aggregation operator called the $bucket operator which can reduce the number of aggregation stages while extracting certain data.

$bucket:

As the name suggests this bucket operator, categorizes the documents passed as an input stage into a set of groups called buckets, based on a specified expression and bucket boundaries and outputs a document per each bucket. Each output document from this stage will contain an _id field whose value specifies the inclusive lower bound of the bucket.

Note: $bucket only produces output documents for buckets that contain at least one input document.

The Syntax for using the $bucket operator is shown below

{
  $bucket: {
      groupBy: <expression>,
      boundaries: [ <lowerbound1>, <lowerbound2>, ... ],
      default: <literal>,
      output: {
         <output1>: { <$accumulator expression> },
         ...
         <outputN>: { <$accumulator expression> }
      }
   }
}
  1. groupBy: The Bucket operator first has the groupBy field which contains the field Name in which you want to group.

  2. boundaries: This field contains the boundary condition or the boundaries of the bucket set that you want to get data from.

  3. default: This is an Optional field that specifies the _id of an additional bucket that contains all documents whose groupBy expression result does not fall into a bucket specified by boundaries.

  4. output: This is also an optional field that resembles the the output field to be included in each bucket along with _id, To specify the field to include, you must use accumulator expressions.

Now, let us learn with the help of example, consider we have a collection of artist as shown below

 //This is our a collection 

{ "_id" : 1, "last_name" : "Bernard", "first_name" : "Emil", "year_born" : 1968, "nationality" : "France" },
  { "_id" : 2, "last_name" : "Rippl-Ronai", "first_name" : "Joszef", "year_born" : 1961, "nationality" : "Hungary" },
  { "_id" : 3, "last_name" : "Ostroumova", "first_name" : "Anna", "year_born" : 1971, "nationality" : "Russia" },
  { "_id" : 4, "last_name" : "Van Gogh", "first_name" : "Vincent", "year_born" : 1953, "nationality" : "Holland" },
  { "_id" : 5, "last_name" : "Maurer", "first_name" : "Alfred", "year_born" : 1968,  "nationality" : "USA" },
  { "_id" : 6, "last_name" : "Munch", "first_name" : "Edvard", "year_born" : 1963,  "nationality" : "Norway" },
  { "_id" : 7, "last_name" : "Redon", "first_name" : "Odilon", "year_born" : 1940,  "nationality" : "France" },
  { "_id" : 8, "last_name" : "Diriks", "first_name" : "Edvard", "year_born" : 1955,  "nationality" : "Norway" }

///now let us apply bucket operator here to create  bucket with a range of year born
 db.artist.aggregate([
                     {
                   $bucket:{
                     groupBy:"$year_born",  //mentioning the feild to group
                     boundaries:[1940, 1950, 1960, 1970], //boundaries to create a bucket
                     default: "Other",  //bucket id for the document which does not lie in the bucket range
                     output: {          // Output formaat that you want for each bucket
                       "count": { $sum: 1 },
                       "artists" :{
                     $push: {
                         "name": { $concat: [ "$first_name", " ", "$last_name"] },
                          "year_born": "$year_born"
                            }
                         }
                       }
                         }
                        }
                      ])

//the result of the aggregation is as shown below
//which shows the count of the document and the array 

{ "_id" : 1940, "count" : 1, "artists" : [ { "name" : "Odilon Redon", "year_born" : 1840 } ] }
{ "_id" : 1950, "count" : 2, "artists" : [ { "name" : "Vincent Van Gogh", "year_born" : 1853 },
                                           { "name" : "Edvard Diriks", "year_born" : 1855 } ] }
{ "_id" : 1960, "count" : 4, "artists" : [ { "name" : "Emil Bernard", "year_born" : 1868 },
                                           { "name" : "Joszef Rippl-Ronai", "year_born" : 1861 },
                                           { "name" : "Alfred Maurer", "year_born" : 1868 },
                                           { "name" : "Edvard Munch", "year_born" : 1863 } ] }
{ "_id" : 1970, "count" : 1, "artists" : [ { "name" : "Anna Ostroumova", "year_born" : 1871 } ] }

Now, let's understand the above result of an aggregation

  1. The field in which we have groupBy is converted to _id of the field

  2. The number of output documents present is equal to the boundaries for the bucket which we created by passing the array to the boundaries field.

  3. In the bucket operator, the boundaries are created as

    [1940,1950) as the lower inclusive 1940 and upper exclusive boundary 1950.

    [1950,1960) as the lower inclusive 1950 and upper exclusive boundary 1960.

    [1960,1970) as the lower inclusive 1960 and upper exclusive boundary 1970.

    [1970,1980) as the lower inclusive 1970 and upper exclusive boundary 1980.

  4. The output field is mentioned as we have to find the count of that particular document which lies in that particular bucket, such as for the year born 1940 the count of the data is 1 and the artist field represents that object.

So, In summary the bucket operator is mostly used when you want to group our processing stage output in the form group of different bucket ranges called as boundaries.

Most of the time, we are required to process our data based on the single bucket, but what is the case when you want to group your output data in the form of two groups based on two buckets, then there came to rescue a superhero Operator known as $facet operator.

Let us learn a little more about this superhero operator,

$facet :

facet operator is generally used to process the multiple stages pipelines within a single stage on the same set of input documents where each sub-pipeline has its field present in the output in the form of an array.

Think of the facet operators as a multi-moulding casting, where you want to pass the various pipeline aggregations on the same set of input documents, without needing to retrieve the input documents multiple times.

The common syntax for using facet is as shown below,

{ $facet:
    {
      <outputField1>: [ <stage1>, <stage2>, ... ], //first aggregation stage
      <outputField2>: [ <stage1>, <stage2>, ... ], //second aggregation stage
      ...

    }
}

To apply facet, on our collection but this time we willl create two buckets the first is based on year born and second bucket will of age group,

{ "_id" : 1, "last_name" : "Bernard", "first_name" : "Emil", "year_born" : 1968, "nationality" : "France", "age":58 },
  { "_id" : 2, "last_name" : "Rippl-Ronai", "first_name" : "Joszef", "year_born" : 1961, "nationality" : "Hungary","age":65 },
  { "_id" : 3, "last_name" : "Ostroumova", "first_name" : "Anna", "year_born" : 1971, "nationality" : "Russia", "age": 55 },
  { "_id" : 4, "last_name" : "Van Gogh", "first_name" : "Vincent", "year_born" : 1953, "nationality" : "Holland", "age":71 },
  { "_id" : 5, "last_name" : "Maurer", "first_name" : "Alfred", "year_born" : 1968,  "nationality" : "USA","age":58 },
  { "_id" : 6, "last_name" : "Munch", "first_name" : "Edvard", "year_born" : 1963,  "nationality" : "Norway","age":53 },
  { "_id" : 7, "last_name" : "Redon", "first_name" : "Odilon", "year_born" : 1940,  "nationality" : "France","age":84 },
  { "_id" : 8, "last_name" : "Diriks", "first_name" : "Edvard", "year_born" : 1955,  "nationality" : "Norway","age":69 }


 db.artist.aggregate([
              {
             $facet:{
               "year_born":[
                   {
                   $bucket:{
                     groupBy:"$year_born",  //mentioning the feild to group
                     boundaries:[1940, 1950, 1960, 1970], //boundaries to create a bucket
                     default: "Other",  //bucket id for the document which does not lie in the bucket range
                     output: {          // Output formaat that you want for each bucket
                       "count": { $sum: 1 },
                       "artists" :{
                     $push: {
                         "name": { $concat: [ "$first_name", " ", "$last_name"] },
                          "year_born": "$year_born"
                            }
                         }
                       }
                         }
                        }
                      ]
                   "age":[
                   {
                   $bucket:{
                     groupBy:"$age",  //mentioning the feild to group
                     boundaries:[50, 60, 70,80], //boundaries to create a bucket
                     default: "Unknown",  //bucket id for the document which does not lie in the bucket range
                     output: {          // Output formaat that you want for each bucket
                       "count": { $sum: 1 },
                       "artists" :{
                     $push: {
                         "name": { $concat: [ "$first_name", " ", "$last_name"] },
                          "age": "$age"
                            }
                         }
                       }
                         }
                        }
                      ]
                       }
                                                         }
             ])

the ouptout of the above two buckets created is as shown below

//the result of the aggregation is as shown below
//which shows the count of the document and the array 
{
  "year_born": [
{ "_id" : 1940, "count" : 1, "artists" : [ { "name" : "Odilon Redon", "year_born" : 1840 } ] }
{ "_id" : 1950, "count" : 2, "artists" : [ { "name" : "Vincent Van Gogh", "year_born" : 1853 },
                                           { "name" : "Edvard Diriks", "year_born" : 1855 } ] }
{ "_id" : 1960, "count" : 4, "artists" : [ { "name" : "Emil Bernard", "year_born" : 1868 },
                                           { "name" : "Joszef Rippl-Ronai", "year_born" : 1861 },
                                           { "name" : "Alfred Maurer", "year_born" : 1868 },
                                           { "name" : "Edvard Munch", "year_born" : 1863 } ] }
{ "_id" : 1970, "count" : 1, "artists" : [ { "name" : "Anna Ostroumova", "year_born" : 1871 } ] }
            ],
 "age":[
{ "_id" : 50, "count" : 4, "artists" : [{"name": "Edvard Munch","age": 5 },{"name": "Anna Ostroumova","age": 55},{"name": "Emil Bernard","age": 58 },{"name": "Alfred Maurer","age": 58}] }
{ "_id" : 60, "count" : 2, "artists" : [{"name": "Joszef Rippl-Ronai","age": 65 },{ "name" : "Edvard Munch", "age" : 69 } ] }
{ "_id" : 70, "count" : 1, "artists" : [ {"name": "Vincent Van Gogh","age":71] }
{ "_id" : 80, "count" : 1, "artists" : [{"name": "Odilon Redon","age": 84}  ] }
          ]
       }

From the above aggregation, we can observe that the output results of the two buckets are grouped into two arrays. The first bucket, based on year-born, is grouped inside the "year_born" array, while the results of the second bucket operator are presented within the "age" array."

Conclusion:

In summary, we gain a comprehensive understanding and overview of the bucket operator and its effective utilization in data processing. With the bucket operator, we can group our aggregations based on certain boundary conditions, creating buckets that categorize our data. While the bucket operator is a powerful tool on its own, using it for multiple groupings introduces the superhero operator known as the facet operator. This operator simplifies our workflow by consolidating multi-stage pipelines into a single stage, operating on the input documents simultaneously.

I hope you like this article a lot and would appreciate my work, I have learned this thing from the internet as well do check the below link, and also stay tuned for further part of this series.

Did you find this article valuable?

Support Ganesh Yadav by becoming a sponsor. Any amount is appreciated!