Tuesday, November 24, 2020

Is Querying A MongoDB View Optimised?

Views in MongoDB appear to database users like read-only collections, ready to be queried in the same way normal collections are. A View is defined by an Aggregation pipeline and when a query is issued on a View, using find(), there is the potential for the execution of the View to be optimised by MongoDB in the same way as MongoDB would optimise any aggregation pipeline that is executed.

In reality, most applications will not issue a find() without specifying a query filter as an argument. This begs the question: When issuing a find() with a query filter against a View (backed by an aggregation pipeline), how is the combination optimised, and can indexes be leveraged effectively?

In the rest of this post, I will explore this further and answer this question.


Source Collection Data

The data I am using for the investigation is a music based data-set sourced from the Discogs website, imported from Discog's XML data dump, using an XML MongoDB import utility.

The resulting releases collection, representing the albums and singles released by all artists, has over 1.5 million documents in it. I've defined various obvious indexes for the collection in anticipation of wanting to run finds and aggregations efficiently against it. Below is a screenshot showing some of the data in this collection, illustrating each document's typical shape...


As you can see, the releases collection contains fields for the artist, the title of the release, the year of the release and the music genres & styles associated with the release.

Let's now look at using two different Views, with different degrees of complexity, against this same collection, to see if and how these Views are optimised at runtime, when a find() is issued...


Using A View Which Filters Out Some Records & Fields

So let's create a View which only shows music released since the start of the year 2000, concatenates the array of one or more styles into a new 'style' string field and then excludes the 'styles' and '_id' fields from the result.

 var pipeline = [

    {$match: {'year': {'$gte': 2000}}},

    {$set: {'style': {

        $reduce: {

            input: '$styles',

            initialValue: '',

            in: {$concat: ['$$value', '$$this', '. ']}

        }

    }}},

    {$unset: ['styles']},

    {$unset: ['_id']},

 ];

 db.createView('millennium_releases_view', 'releases', pipeline);

Below is an example of the shape of result documents, when the View is queried for a specific artist:

If I ask MongoDB to provide the explain plan for an 'empty' query on the View, using the following command...

 db.millennium_releases_view.find().explain();

...the resulting explain plan shows the database runs the following steps in the order shown:

  1. MATCH using INDEX SCAN hitting an index for the Year field
  2. SET new Style string field to concatenate values from existing Styles array field
  3. UNSET Styles array field
  4. UNSET _id field

It's good to see here that the 'year greater than or equal' clause in the aggregation pipeline defined for the View is being run as the first step and is targeting an index to avoid a 'full table scan'. However, what happens when I include a query filter when issuing a find() against the View, to only show releases for a specific artist?

 db.millennium_releases_view.find({'artist': 'Fugazi'}).explain();

This time the resulting explain plan shows the following steps executed:

  1. MATCH using INDEX SCAN hitting a compound index composed of both the Artist & Year fields
  2. SET new Style string field to concatenate values from existing Styles array field
  3. UNSET Styles array field
  4. UNSET _id field

This is great news, because when I am specifying a query filter for the find() on this View, the optimiser is converting the regular find() filter syntax into an aggregation match expression and pushing it to the existing $match stage at the start of the pipeline. As a result, the optimum compound index of (artist, year) is being used, to entirely satisfy the find's 'artist=Fugazi' expression combined with the View's 'year>=2000' expression,.

Does this mean a find() with a query filter will always be pushed to the top of the View's aggregation pipeline, at runtime?

Well actually, no. Let's see why, in this second example...


Using A View Which Rolls Up Some Data

This time let's create a View which groups releases (albums & singles) for each artist by the style associated with the release. For example, if an artist has five albums categorised with the style 'Stoner Rock' and 7 albums categorised by 'Post Rock', the resulting View will contain 2 documents for the artist, one for each of the two styles. This is the command for creating this View:

 var pipeline = [

    {$unwind: {path: '$styles'}}, 

    {$group: {

        _id: {artist: '$artist', style: '$styles'}, 

        titles: {'$push': '$title'},

    }}, 

    {$set: {'artist': '$_id.artist'}},

    {$set: {'style': '$_id.style'}}, 

    {$unset: ['_id']},

 ];

 db.createView('styles_view', 'releases', pipeline);

Below is an example of the shape of result documents from querying this new second View, for a specific artist:

If I ask MongoDB to provide the explain plan for an 'empty' query on this View, using the following command...

 db.styles_view.find().explain();

    ...the resulting explain plan shows the database runs the following steps in the order shown:

    1. COLLECTION_SCAN with PROJECTION of Artist, Styles & Title fields only
    2. UNWIND of Styles array field producing a record for each array element
    3. GROUP on Artist + Style fields, adding each associated release title to a new Titles array field
    4. SET Artist string field to the first of element of the group's id
    5. SET Style string field to the second of element of the group's id
    6. UNSET _id field which was created by the group stage

    As expected here, because the aggregation pipeline defined for the View does not contain a $match, the first step will result in a 'full table scan', where all the documents in the collection are inspected, and then the required fields only, are projected out.

    What happens this time when I include a query filter for the find() run against the View, to only show results for a specific artist, using the following command to explain?

     db.styles_view.find({'artist': 'Fugazi'}).explain();

      This time, the resulting explain plan shows the following ordered steps executed:

      1. COLLECTION_SCAN with PROJECTION of ArtistStyles & Title fields only
      2. UNWIND of Styles array field producing a record for each array element
      3. GROUP on Artist + Style fields, adding each associated release title to a new Titles array field
      4. SET Artist string field to the first of element of the group's id
      5. MATCH on Artist filed (no index used)
      6. SET Style string field to the second of element of the group's id
      7. UNSET _id field which was created by the group stage

      Here the new $match generated by MongoDB to capture the find() expression run against the View, is included in the executed aggregation pipeline, but the $match cannot be pushed all the way up to the first step of the pipeline. This is to be expected...

      Essentially what happens when a find() with filter is run on a View is as follows. The filter expression is initially placed in a new $match stage appended to the end of the aggregation pipeline. Then the normal aggregation pipeline runtime optimiser kicks in and attempts to move the newly added $match step as near to the top of the pipeline as possible. However, the $group stage (and related $set on artist, in this case), acts as a barrier. The $group operator stage completely changes the shape of documents and effectively drops any existing fields that preceded it. The optimiser has no way of knowing that a filter on an artist field, being applied to the outcome of a View, is definitively referring to a field called artist that existed in the original source collection used by the View. Instead, for all it knows, the expression on artist could be referring to some other intermediate aggregation pipeline field of similar name. In the example above, even if we don't use $set in the View's pipeline to set a new field called artist, the new $match expression is still blocked by $group and so is only executed straight after $group (a scenario which I also tested).

      So even though I only want to see the results for one artist which relates to only a few 10s of documents in the database, the find() which applies a filter on the View, will result in the total data set of 1.5 million documents being 'full table scanned', adding considerable latency to the response.

      If I wasn't querying the View and instead running my own hand-crafted aggregation pipeline directly against the source collection, to achieve the same functional outcome, my pipeline could be composed of the following stages where I explicitly include the match on artist as the first stage:

       var pipeline = [

          {$match: {'artist': 'Fugazi'}},

          {$unwind: {path: '$styles'}}, 

          {$group: {

              _id: {artist: '$artist', style: '$styles'}, 

              titles: {'$push': '$title'},

          }}, 

          {$set: {'artist': '$_id.artist'}},

          {$set: {'style': '$_id.style'}}, 

          {$unset: ['_id']},

       ];

       db.releases.aggregate(pipeline);

      Then when I ask for the explain plan...

       db.releases.explain().aggregate(pipeline);

        ...I see that the following steps are executed:

        1. MATCH using INDEX_SCAN on the Artist field with PROJECTION of ArtistStyles & Title fields only
        2. UNWIND of Styles array field producing a record for each array element
        3. GROUP on Artist + Style fields, adding each release title to a new Titles array field
        4. SET Artist string field from the first of element in the group id
        5. SET Style string field from the second of element in the group id
        6. UNSET _id field which was created by the group stage

        This time an index will be leveraged so that only the few 10s of records, corresponding to the desired artist, are retrieved, ready for unwinding and grouping. The aggregation does not attempt to grab 1.5 million records. This is only possible because, as the developer of the aggregation pipeline logic, I have extra knowledge which the MongoDB runtime does not have. Specifically, I know that the $match on the artist field should actually be applied to the field named artist in the View's source collection and not to the result of the $group stage.


        Wrapping Up

        What these findings show for Views is that at runtime, when MongoDB receives a find() containing query filter expressions, these expressions are dynamically appended to the end of the View's aggregation pipeline, before the resulting composite pipeline is executed. Then, as is the case if you are just issuing a regular aggregation against a normal collection, MongoDB's aggregation pipeline runtime optimiser attempts to re-order the pipeline on the fly, without changing its functional behaviour, to be more efficient. These runtime optimisations include attempting to push any $match stages as near to the start of the pipeline as possible, to help promote maximum use of indexes when executed. However, stages like the $group stage, which completely transform the shape of documents, mean that the optimiser cannot move a $match ahead of such stages, without risking changing the functional behaviour and ultimately the resulting output. 

        In practice, where Views are used to filter a subset of records and/or a subset of fields. the system should be able to fully optimise the find() run against the collection, pushing query filter expressions to the first step of the executed aggregation pipeline, to best leverage indexes. Only in places where there is a loss of fidelity (e.g when using a $group stage), will it be the case that the find() query filter cannot be placed earlier in the pipeline being executed against the View. 



        Song for today: Runaway Return by Fugazi

        No comments: