Tuesday, November 24, 2020

Is Querying A MongoDB View Optimised?

Views in MongoDB appear to database users like read-only collections, ready to be queried in the same way normal collections are. A View is defined by an Aggregation pipeline and when a query is issued on a View, using find(), there is the potential for the execution of the View to be optimised by MongoDB in the same way as MongoDB would optimise any aggregation pipeline that is executed.

In reality, most applications will not issue a find() without specifying a query filter as an argument. This begs the question: When issuing a find() with a query filter against a View (backed by an aggregation pipeline), how is the combination optimised, and can indexes be leveraged effectively?

In the rest of this post, I will explore this further and answer this question.


Source Collection Data

The data I am using for the investigation is a music based data-set sourced from the Discogs website, imported from Discog's XML data dump, using an XML MongoDB import utility.

The resulting releases collection, representing the albums and singles released by all artists, has over 1.5 million documents in it. I've defined various obvious indexes for the collection in anticipation of wanting to run finds and aggregations efficiently against it. Below is a screenshot showing some of the data in this collection, illustrating each document's typical shape...


As you can see, the releases collection contains fields for the artist, the title of the release, the year of the release and the music genres & styles associated with the release.

Let's now look at using two different Views, with different degrees of complexity, against this same collection, to see if and how these Views are optimised at runtime, when a find() is issued...


Using A View Which Filters Out Some Records & Fields

So let's create a View which only shows music released since the start of the year 2000, concatenates the array of one or more styles into a new 'style' string field and then excludes the 'styles' and '_id' fields from the result.

 var pipeline = [

    {$match: {'year': {'$gte': 2000}}},

    {$set: {'style': {

        $reduce: {

            input: '$styles',

            initialValue: '',

            in: {$concat: ['$$value', '$$this', '. ']}

        }

    }}},

    {$unset: ['styles']},

    {$unset: ['_id']},

 ];

 db.createView('millennium_releases_view', 'releases', pipeline);

Below is an example of the shape of result documents, when the View is queried for a specific artist:

If I ask MongoDB to provide the explain plan for an 'empty' query on the View, using the following command...

 db.millennium_releases_view.find().explain();

...the resulting explain plan shows the database runs the following steps in the order shown:

  1. MATCH using INDEX SCAN hitting an index for the Year field
  2. SET new Style string field to concatenate values from existing Styles array field
  3. UNSET Styles array field
  4. UNSET _id field

It's good to see here that the 'year greater than or equal' clause in the aggregation pipeline defined for the View is being run as the first step and is targeting an index to avoid a 'full table scan'. However, what happens when I include a query filter when issuing a find() against the View, to only show releases for a specific artist?

 db.millennium_releases_view.find({'artist': 'Fugazi'}).explain();

This time the resulting explain plan shows the following steps executed:

  1. MATCH using INDEX SCAN hitting a compound index composed of both the Artist & Year fields
  2. SET new Style string field to concatenate values from existing Styles array field
  3. UNSET Styles array field
  4. UNSET _id field

This is great news, because when I am specifying a query filter for the find() on this View, the optimiser is converting the regular find() filter syntax into an aggregation match expression and pushing it to the existing $match stage at the start of the pipeline. As a result, the optimum compound index of (artist, year) is being used, to entirely satisfy the find's 'artist=Fugazi' expression combined with the View's 'year>=2000' expression,.

Does this mean a find() with a query filter will always be pushed to the top of the View's aggregation pipeline, at runtime?

Well actually, no. Let's see why, in this second example...


Using A View Which Rolls Up Some Data

This time let's create a View which groups releases (albums & singles) for each artist by the style associated with the release. For example, if an artist has five albums categorised with the style 'Stoner Rock' and 7 albums categorised by 'Post Rock', the resulting View will contain 2 documents for the artist, one for each of the two styles. This is the command for creating this View:

 var pipeline = [

    {$unwind: {path: '$styles'}}, 

    {$group: {

        _id: {artist: '$artist', style: '$styles'}, 

        titles: {'$push': '$title'},

    }}, 

    {$set: {'artist': '$_id.artist'}},

    {$set: {'style': '$_id.style'}}, 

    {$unset: ['_id']},

 ];

 db.createView('styles_view', 'releases', pipeline);

Below is an example of the shape of result documents from querying this new second View, for a specific artist:

If I ask MongoDB to provide the explain plan for an 'empty' query on this View, using the following command...

 db.styles_view.find().explain();

    ...the resulting explain plan shows the database runs the following steps in the order shown:

    1. COLLECTION_SCAN with PROJECTION of Artist, Styles & Title fields only
    2. UNWIND of Styles array field producing a record for each array element
    3. GROUP on Artist + Style fields, adding each associated release title to a new Titles array field
    4. SET Artist string field to the first of element of the group's id
    5. SET Style string field to the second of element of the group's id
    6. UNSET _id field which was created by the group stage

    As expected here, because the aggregation pipeline defined for the View does not contain a $match, the first step will result in a 'full table scan', where all the documents in the collection are inspected, and then the required fields only, are projected out.

    What happens this time when I include a query filter for the find() run against the View, to only show results for a specific artist, using the following command to explain?

     db.styles_view.find({'artist': 'Fugazi'}).explain();

      This time, the resulting explain plan shows the following ordered steps executed:

      1. COLLECTION_SCAN with PROJECTION of ArtistStyles & Title fields only
      2. UNWIND of Styles array field producing a record for each array element
      3. GROUP on Artist + Style fields, adding each associated release title to a new Titles array field
      4. SET Artist string field to the first of element of the group's id
      5. MATCH on Artist filed (no index used)
      6. SET Style string field to the second of element of the group's id
      7. UNSET _id field which was created by the group stage

      Here the new $match generated by MongoDB to capture the find() expression run against the View, is included in the executed aggregation pipeline, but the $match cannot be pushed all the way up to the first step of the pipeline. This is to be expected...

      Essentially what happens when a find() with filter is run on a View is as follows. The filter expression is initially placed in a new $match stage appended to the end of the aggregation pipeline. Then the normal aggregation pipeline runtime optimiser kicks in and attempts to move the newly added $match step as near to the top of the pipeline as possible. However, the $group stage (and related $set on artist, in this case), acts as a barrier. The $group operator stage completely changes the shape of documents and effectively drops any existing fields that preceded it. The optimiser has no way of knowing that a filter on an artist field, being applied to the outcome of a View, is definitively referring to a field called artist that existed in the original source collection used by the View. Instead, for all it knows, the expression on artist could be referring to some other intermediate aggregation pipeline field of similar name. In the example above, even if we don't use $set in the View's pipeline to set a new field called artist, the new $match expression is still blocked by $group and so is only executed straight after $group (a scenario which I also tested).

      So even though I only want to see the results for one artist which relates to only a few 10s of documents in the database, the find() which applies a filter on the View, will result in the total data set of 1.5 million documents being 'full table scanned', adding considerable latency to the response.

      If I wasn't querying the View and instead running my own hand-crafted aggregation pipeline directly against the source collection, to achieve the same functional outcome, my pipeline could be composed of the following stages where I explicitly include the match on artist as the first stage:

       var pipeline = [

          {$match: {'artist': 'Fugazi'}},

          {$unwind: {path: '$styles'}}, 

          {$group: {

              _id: {artist: '$artist', style: '$styles'}, 

              titles: {'$push': '$title'},

          }}, 

          {$set: {'artist': '$_id.artist'}},

          {$set: {'style': '$_id.style'}}, 

          {$unset: ['_id']},

       ];

       db.releases.aggregate(pipeline);

      Then when I ask for the explain plan...

       db.releases.explain().aggregate(pipeline);

        ...I see that the following steps are executed:

        1. MATCH using INDEX_SCAN on the Artist field with PROJECTION of ArtistStyles & Title fields only
        2. UNWIND of Styles array field producing a record for each array element
        3. GROUP on Artist + Style fields, adding each release title to a new Titles array field
        4. SET Artist string field from the first of element in the group id
        5. SET Style string field from the second of element in the group id
        6. UNSET _id field which was created by the group stage

        This time an index will be leveraged so that only the few 10s of records, corresponding to the desired artist, are retrieved, ready for unwinding and grouping. The aggregation does not attempt to grab 1.5 million records. This is only possible because, as the developer of the aggregation pipeline logic, I have extra knowledge which the MongoDB runtime does not have. Specifically, I know that the $match on the artist field should actually be applied to the field named artist in the View's source collection and not to the result of the $group stage.


        Wrapping Up

        What these findings show for Views is that at runtime, when MongoDB receives a find() containing query filter expressions, these expressions are dynamically appended to the end of the View's aggregation pipeline, before the resulting composite pipeline is executed. Then, as is the case if you are just issuing a regular aggregation against a normal collection, MongoDB's aggregation pipeline runtime optimiser attempts to re-order the pipeline on the fly, without changing its functional behaviour, to be more efficient. These runtime optimisations include attempting to push any $match stages as near to the start of the pipeline as possible, to help promote maximum use of indexes when executed. However, stages like the $group stage, which completely transform the shape of documents, mean that the optimiser cannot move a $match ahead of such stages, without risking changing the functional behaviour and ultimately the resulting output. 

        In practice, where Views are used to filter a subset of records and/or a subset of fields. the system should be able to fully optimise the find() run against the collection, pushing query filter expressions to the first step of the executed aggregation pipeline, to best leverage indexes. Only in places where there is a loss of fidelity (e.g when using a $group stage), will it be the case that the find() query filter cannot be placed earlier in the pipeline being executed against the View. 



        Song for today: Runaway Return by Fugazi

        Sunday, October 4, 2020

        Rust & MongoDB - Perfect Bedfellows

        I've been learning Rust over the last month or so and I'm really enjoying it. It's a really elegant and flexible programming language despite being the most strongly typed and compile-time strict programming language I've ever used (bearing in mind I used to be a professional C & C++ developer way back in the day). 

        I'd recently read the really good and commonly referenced blog post Creating a REST API in Rust with warp, which shows how to create a simple example Groceries stock management REST API service, and which uses an in-memory HashMap as its backing store. As part of my learning I thought I'd have a go at porting this to use MongoDB as its data store instead, using the fairly new MongoDB Rust Driver.

        It turns out that this was really easy to do, also due to how well engineered the new MongoDB Rust Driver turned out to be, with its rich yet easy to use API. 

        You can see my resulting MongoDB version of this sample Groceries application, in the Github project rust-groceries-mongo-api I created. Check out that project link to view the source code showing how MongoDB was integrated with for the Groceries REST API and how to test the application using a REST client.

        What was even more surprising was how easy it was to integrate MongoDB's flexible data model with a programming language as strict as Rust, and I encountered no friction between the two at all. In fact, this was even easier to achieve by leveraging the option of using the driver team's additional contribution of BSON translation to the open source Rust Serde framework, which makes it easy to serialize/deserialize Rust data structures to/from other formats (e.g. JSON, Avro and now BSON).

        I plan to blog again in the future, in more detail, about how to combine Rust's strict typing and MongoDB's flexible schema, especially when the data model and consuming microservices inevitably change over time.


        Song for today: Dissolution by Cloud Nothings

        Sunday, May 3, 2020

        Converting Gnarly Date Strings to Proper Date Types Using a MongoDB Aggregation Pipeline


        Introduction
        I recently received some example bank payments data in a CSV file which had been exported from a relational database with that database's default export settings. After using mongoimport to import this data 'as-is', into a MongoDB database, I noticed that there was a particularly gnarly date string field in each record. For example:
        • 23-NOV-20 22.57.36.827000000
        Why do I say gnarly? Well if you lived through Y2K you should be horrified by the 'year' field shown above. How would you know from the data, without any context, what century this applies to? Is it 1920? Is it 2020? Is it 2120? There's no way of knowing from just the exported data alone. Also, there is no indication of which time zone this applies to. Is it British Summer Time? Is it Eastern Daylight Time? Who knows? Also the month element appears to be an abbreviation of a month expressed in a specific spoken language. Which spoken language?

        I needed to get this into a proper Date type in MongoDB so I could then easily index it, perform date range queries natively, perform sort by date natively, etc.. My usual tool of choice for this is MongoDB's Aggregation pipeline to generate a new collection from the existing collection with the 'date' string fields converted to proper date type fields. To perform the string to date conversion, the usual operator of choice to use is $dateFromString (introduced in MongoDB 3.6). 

        However, $dateFromString [rightly] expects an input string which isn't missing crucial date related text, indicating things like the century or timezone. Also, the $dateFromString operator contains no format specifiers to indicate that the text 'NOV' maps to the 11th month of a year in a specific spoken language.

        Therefore, armed with the extra context of knowing this exported data refers to dates in the 21st century (the '2000s') with a UTC 'time zone' and in the English language (only inferred by asking the owner of the data), I had to perform some additional string manipulation in the aggregation pipeline before using $dateFromString to generate a true and accurate date type. The rest of this blog post shows how I achieved this for date strings like '23-NOV-20 22.57.36.827000000'.


        Converting Incomplete Date Strings to Date Types Example

        In the Mongo Shell targeting a running MongoDB test database, run the following code to insert 12 sample 'payment' records, with example 'bad date string' fields for testing each month of a sample year.

        use test;
        db.rawpayments.insert([
          {'account_id': '010101', 'pymntdate': '01-JAN-20 01.01.01.123000000', 'amount': 1.01},
          {'account_id': '020202', 'pymntdate': '02-FEB-20 02.02.02.456000000', 'amount': 2.02},
          {'account_id': '030303', 'pymntdate': '03-MAR-20 03.03.03.789000000', 'amount': 3.03},
          {'account_id': '040404', 'pymntdate': '04-APR-20 04.04.04.012000000', 'amount': 4.04},
          {'account_id': '050505', 'pymntdate': '05-MAY-20 05.05.05.345000000', 'amount': 5.05},
          {'account_id': '060606', 'pymntdate': '06-JUN-20 06.06.06.678000000', 'amount': 6.06},
          {'account_id': '070707', 'pymntdate': '07-JUL-20 07.07.07.901000000', 'amount': 7.07},
          {'account_id': '080808', 'pymntdate': '08-AUG-20 08.08.08.234000000', 'amount': 8.08},
          {'account_id': '090909', 'pymntdate': '09-SEP-20 09.09.09.567000000', 'amount': 9.09},
          {'account_id': '101010', 'pymntdate': '10-OCT-20 10.10.10.890000000', 'amount': 10.10},
          {'account_id': '111111', 'pymntdate': '11-NOV-20 11.11.11.111000000', 'amount': 11.11},
          {'account_id': '121212', 'pymntdate': '12-DEC-20 12.12.12.999000000', 'amount': 12.12}
        ]);

        Then execute the following Aggregation pipeline to copy the contents of the 'rawpayments' collection, populated above, into a new collection named 'payments', but with the 'pymntdate' field values converted from string types to date types.

        db.rawpayments.aggregate([
          {$set: {
            pymntdate: {
              $dateFromString: {format: '%d-%m-%Y %H.%M.%S.%L', dateString:
                {$concat: [
                  {$substrCP: ['$pymntdate', 0, 3]},  // USE FIRST 3 CHARS IN DATE STRING
                  {$switch: {branches: [  // REPLACE MONTH 3 CHARS IN DATE STRING WITH 2 DIGIT MONTH
                    {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'JAN']}, then: '01'},
                    {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'FEB']}, then: '02'},
                    {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'MAR']}, then: '03'},
                    {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'APR']}, then: '04'},
                    {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'MAY']}, then: '05'},
                    {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'JUN']}, then: '06'},
                    {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'JUL']}, then: '07'},
                    {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'AUG']}, then: '08'},
                    {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'SEP']}, then: '09'},
                    {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'OCT']}, then: '10'},
                    {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'NOV']}, then: '11'},
                    {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'DEC']}, then: '12'},
                   ], default: 'ERROR'}},
                  '-20',  // ADD HYPEHN + HARDCODED CENTURY 2 DIGITS
                  {$substrCP: ['$pymntdate', 7, 15]}  // USE REMAINING PART OF DATE STRING UP UNTIL THE 3 MILLISECOND DIGITS (IGNORE REMAINING 6 NANOSECOND CHARS)
                ]
              }}
            },  
          }},
          {$out: 'payments'}
        ]);

        In this pipeline, the string '23-NOV-20 22.57.36.827000000' will be converted to 'ISODate("2020-11-23T22:57:36.827Z")' by concatenating the following four elements of text together before passing it to the $dateFromString operator to convert to a date:
        1. '23-' (from the input string)
        2. '11' (replacing 'NOV')
        3. '-20' (hard-coded hyphen + century)
        4. '20 22.57.36.827' (the rest of input string apart from last 6 nanosecond digits)
        Note: A $set stage is used in this pipeline, which is a type of stage first introduced in MongoDB 4.2. $set is an alias for $addFields, so if using an earlier version of MongoDB, replace $set with $addFields in the pipeline.

        To see what the converted records look like, containing new date types, query the new collection:

        db.payments.find({}, {_id:0});

        Which will show the following results:

        { "account_id" : "010101", "pymntdate" : ISODate("2020-01-01T01:01:01.123Z"), "amount" : 1.01 }
        { "account_id" : "020202", "pymntdate" : ISODate("2020-02-02T02:02:02.456Z"), "amount" : 2.02 }
        { "account_id" : "030303", "pymntdate" : ISODate("2020-03-03T03:03:03.789Z"), "amount" : 3.03 }
        { "account_id" : "040404", "pymntdate" : ISODate("2020-04-04T04:04:04.012Z"), "amount" : 4.04 }
        { "account_id" : "050505", "pymntdate" : ISODate("2020-05-05T05:05:05.345Z"), "amount" : 5.05 }
        { "account_id" : "060606", "pymntdate" : ISODate("2020-06-06T06:06:06.678Z"), "amount" : 6.06 }
        { "account_id" : "070707", "pymntdate" : ISODate("2020-07-07T07:07:07.901Z"), "amount" : 7.07 }
        { "account_id" : "080808", "pymntdate" : ISODate("2020-08-08T08:08:08.234Z"), "amount" : 8.08 }
        { "account_id" : "090909", "pymntdate" : ISODate("2020-09-09T09:09:09.567Z"), "amount" : 9.09 }
        { "account_id" : "101010", "pymntdate" : ISODate("2020-10-10T10:10:10.890Z"), "amount" : 10.1 }
        { "account_id" : "111111", "pymntdate" : ISODate("2020-11-11T11:11:11.111Z"), "amount" : 11.11 }
        { "account_id" : "121212", "pymntdate" : ISODate("2020-12-12T12:12:12.999Z"), "amount" : 12.12 }



        Song for today: For Everything by The Murder Capital