Sunday, May 3, 2020

Converting Gnarly Date Strings to Proper Date Types Using a MongoDB Aggregation Pipeline


Introduction
I recently received some example bank payments data in a CSV file which had been exported from a relational database with that database's default export settings. After using mongoimport to import this data 'as-is', into a MongoDB database, I noticed that there was a particularly gnarly date string field in each record. For example:
  • 23-NOV-20 22.57.36.827000000
Why do I say gnarly? Well if you lived through Y2K you should be horrified by the 'year' field shown above. How would you know from the data, without any context, what century this applies to? Is it 1920? Is it 2020? Is it 2120? There's no way of knowing from just the exported data alone. Also, there is no indication of which time zone this applies to. Is it British Summer Time? Is it Eastern Daylight Time? Who knows? Also the month element appears to be an abbreviation of a month expressed in a specific spoken language. Which spoken language?

I needed to get this into a proper Date type in MongoDB so I could then easily index it, perform date range queries natively, perform sort by date natively, etc.. My usual tool of choice for this is MongoDB's Aggregation pipeline to generate a new collection from the existing collection with the 'date' string fields converted to proper date type fields. To perform the string to date conversion, the usual operator of choice to use is $dateFromString (introduced in MongoDB 3.6). 

However, $dateFromString [rightly] expects an input string which isn't missing crucial date related text, indicating things like the century or timezone. Also, the $dateFromString operator contains no format specifiers to indicate that the text 'NOV' maps to the 11th month of a year in a specific spoken language.

Therefore, armed with the extra context of knowing this exported data refers to dates in the 21st century (the '2000s') with a UTC 'time zone' and in the English language (only inferred by asking the owner of the data), I had to perform some additional string manipulation in the aggregation pipeline before using $dateFromString to generate a true and accurate date type. The rest of this blog post shows how I achieved this for date strings like '23-NOV-20 22.57.36.827000000'.


Converting Incomplete Date Strings to Date Types Example

In the Mongo Shell targeting a running MongoDB test database, run the following code to insert 12 sample 'payment' records, with example 'bad date string' fields for testing each month of a sample year.

use test;
db.rawpayments.insert([
  {'account_id': '010101', 'pymntdate': '01-JAN-20 01.01.01.123000000', 'amount': 1.01},
  {'account_id': '020202', 'pymntdate': '02-FEB-20 02.02.02.456000000', 'amount': 2.02},
  {'account_id': '030303', 'pymntdate': '03-MAR-20 03.03.03.789000000', 'amount': 3.03},
  {'account_id': '040404', 'pymntdate': '04-APR-20 04.04.04.012000000', 'amount': 4.04},
  {'account_id': '050505', 'pymntdate': '05-MAY-20 05.05.05.345000000', 'amount': 5.05},
  {'account_id': '060606', 'pymntdate': '06-JUN-20 06.06.06.678000000', 'amount': 6.06},
  {'account_id': '070707', 'pymntdate': '07-JUL-20 07.07.07.901000000', 'amount': 7.07},
  {'account_id': '080808', 'pymntdate': '08-AUG-20 08.08.08.234000000', 'amount': 8.08},
  {'account_id': '090909', 'pymntdate': '09-SEP-20 09.09.09.567000000', 'amount': 9.09},
  {'account_id': '101010', 'pymntdate': '10-OCT-20 10.10.10.890000000', 'amount': 10.10},
  {'account_id': '111111', 'pymntdate': '11-NOV-20 11.11.11.111000000', 'amount': 11.11},
  {'account_id': '121212', 'pymntdate': '12-DEC-20 12.12.12.999000000', 'amount': 12.12}
]);

Then execute the following Aggregation pipeline to copy the contents of the 'rawpayments' collection, populated above, into a new collection named 'payments', but with the 'pymntdate' field values converted from string types to date types.

db.rawpayments.aggregate([
  {$set: {
    pymntdate: {
      $dateFromString: {format: '%d-%m-%Y %H.%M.%S.%L', dateString:
        {$concat: [
          {$substrCP: ['$pymntdate', 0, 3]},  // USE FIRST 3 CHARS IN DATE STRING
          {$switch: {branches: [  // REPLACE MONTH 3 CHARS IN DATE STRING WITH 2 DIGIT MONTH
            {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'JAN']}, then: '01'},
            {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'FEB']}, then: '02'},
            {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'MAR']}, then: '03'},
            {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'APR']}, then: '04'},
            {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'MAY']}, then: '05'},
            {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'JUN']}, then: '06'},
            {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'JUL']}, then: '07'},
            {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'AUG']}, then: '08'},
            {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'SEP']}, then: '09'},
            {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'OCT']}, then: '10'},
            {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'NOV']}, then: '11'},
            {case: {$eq: [{$substrCP: ['$pymntdate', 3, 3]}, 'DEC']}, then: '12'},
           ], default: 'ERROR'}},
          '-20',  // ADD HYPEHN + HARDCODED CENTURY 2 DIGITS
          {$substrCP: ['$pymntdate', 7, 15]}  // USE REMAINING PART OF DATE STRING UP UNTIL THE 3 MILLISECOND DIGITS (IGNORE REMAINING 6 NANOSECOND CHARS)
        ]
      }}
    },  
  }},
  {$out: 'payments'}
]);

In this pipeline, the string '23-NOV-20 22.57.36.827000000' will be converted to 'ISODate("2020-11-23T22:57:36.827Z")' by concatenating the following four elements of text together before passing it to the $dateFromString operator to convert to a date:
  1. '23-' (from the input string)
  2. '11' (replacing 'NOV')
  3. '-20' (hard-coded hyphen + century)
  4. '20 22.57.36.827' (the rest of input string apart from last 6 nanosecond digits)
Note: A $set stage is used in this pipeline, which is a type of stage first introduced in MongoDB 4.2. $set is an alias for $addFields, so if using an earlier version of MongoDB, replace $set with $addFields in the pipeline.

To see what the converted records look like, containing new date types, query the new collection:

db.payments.find({}, {_id:0});

Which will show the following results:

{ "account_id" : "010101", "pymntdate" : ISODate("2020-01-01T01:01:01.123Z"), "amount" : 1.01 }
{ "account_id" : "020202", "pymntdate" : ISODate("2020-02-02T02:02:02.456Z"), "amount" : 2.02 }
{ "account_id" : "030303", "pymntdate" : ISODate("2020-03-03T03:03:03.789Z"), "amount" : 3.03 }
{ "account_id" : "040404", "pymntdate" : ISODate("2020-04-04T04:04:04.012Z"), "amount" : 4.04 }
{ "account_id" : "050505", "pymntdate" : ISODate("2020-05-05T05:05:05.345Z"), "amount" : 5.05 }
{ "account_id" : "060606", "pymntdate" : ISODate("2020-06-06T06:06:06.678Z"), "amount" : 6.06 }
{ "account_id" : "070707", "pymntdate" : ISODate("2020-07-07T07:07:07.901Z"), "amount" : 7.07 }
{ "account_id" : "080808", "pymntdate" : ISODate("2020-08-08T08:08:08.234Z"), "amount" : 8.08 }
{ "account_id" : "090909", "pymntdate" : ISODate("2020-09-09T09:09:09.567Z"), "amount" : 9.09 }
{ "account_id" : "101010", "pymntdate" : ISODate("2020-10-10T10:10:10.890Z"), "amount" : 10.1 }
{ "account_id" : "111111", "pymntdate" : ISODate("2020-11-11T11:11:11.111Z"), "amount" : 11.11 }
{ "account_id" : "121212", "pymntdate" : ISODate("2020-12-12T12:12:12.999Z"), "amount" : 12.12 }



Song for today: For Everything by The Murder Capital

No comments: