We have an high-loaded service with MongoDB used as the primary data storage. We have to do code deploy and change schema of MongoDB’s collection. Naturally, the service shouldn’t be closed for maintenance. Before migration we have following collection of users:

{
    "_id": ObjectID("5ad711459ecc372328a910b4"),
    "login": "ariel",
    "schema_version": 1,
    "name": "Ariel Whiteaker",
    "phone": "+45751700142",
    "last_update": ISODate("2018-04-18T14:35:01.250Z"),
    "created": ISODate("2018-04-18T14:35:01.250Z")
}

But we would like to separate name from single field to first and last name. And also we want to add ability to store multiple contacts. So our new schema will be following:

{
    "_id": ObjectID("5ad711459ecc372328a910b4"),
    "login": "ariel",
    "schema_version": 2,
    "first_name": "Ariel",
    "last_name": "Whiteaker",
    "contacts": [
        {
            "type": "phone",
            "value": "+45751700142"
        }
    ],
    "last_update": ISODate("2018-04-18T14:35:01.250Z"),
    "created": ISODate("2018-04-18T14:35:01.250Z")
}

Let’s look at possible solutions of this problem

Easiest way - don’t migrate at all, and store multiple versions of the document instead. Among the benefits of using versioning for your document’s structure is that no extra processing is needed on the database side as the structure is only updated when they are changed. Collections will contain documents with different structure at the same time. So we have the following restrictions:

  1. We should use more complex queries to search documents. We will user $or in order to find all “Ariel Whiteaker”: db.users.find({"$or": [{"name": "Ariel Whiteaker"}, {"first_name": "Ariel", "last_name": "Whiteaker"}]}). It is not so convenient, especially if there are more than two versions.
  2. Certainly, you will have to maintain a separate index for each field. We should use Partial or Sparse Indexes to optimize memory usage and performance (By indexing a subset of the documents in a collection, partial indexes have lower storage requirements and reduced performance costs for index creation and maintenance).
  3. Application have to be able to work with every version of document, of course, when we save document we should update the structure to the latest version, but this process can stretch for a long time.

Here is an example code:

from datetime import datetime

from schematics.models import Model
from schematics.types import StringType, IntType, DateTimeType, ListType, DictType
from pymongo import MongoClient
from bson import ObjectId

client = MongoClient()
db = client.migration_db

class User(Model):
    SCHEMA_VERSION = 2

    _id = ObjectIdType()
    login = StringType(required=True)
    schema_version = IntType(required=True)
    first_name = StringType()
    last_name = StringType()
    contacts = ListType(DictType(StringType))
    last_update = DateTimeType(default=datetime.utcnow)
    created = DateTimeType(default=datetime.utcnow)

    @classmethod
    def fetch(cls, _id):
        _id = _id if isinstance(_id, ObjectId) else ObjectId(_id)
        data = db.users.find_one({'_id': _id})
        if data['schema_version'] == 1:
            if data.get('phone'):
                phone = data.pop('phone')
                data['contacts'] = [{'type': 'phone', 'value': phone}]
            if data.get('name'):
                name = data.pop('name')
                first_name, last_name = name.split(' ')
                data['first_name'] = first_name
                data['last_name'] = last_name

        data['schema_version'] = cls.SCHEMA_VERSION
        return UserV2(data)

    def save(self):
        document = self.to_native()
        document['last_update'] = datetime.utcnow()
        db.users.replace_one(
            {'_id': self._id},
            document,
            upsert=True,
        )

I realise that this of course a simplified example, but the general idea should be applicable in many situations. I used schematics to implement the model (https://github.com/schematics/schematics). The method fetch performs all the magic of data migration.

To implement the second way, we pull the documents from the monga, and create new documents to BulkWrite. But first, we update the code so that business logic can reads both new and old document structures, and also saves documents in a new format. Hence, we minimize the probability that up to date changes will be lost by migration. In total, the code is able to work with two versions, but document’s structure is converted to one form by migration, migration is carried out with the help of bulk write. The main disadvantage of this approach is that it can take a very long time, but it removes the limitations of the previous approach. The code that illustrates this method will be minimal, including only the migration.

from pymongo import ReplaceOne
from pymongo.errors import BulkWriteError
from pymongo import MongoClient

client = MongoClient()
db = client.migration_db


def migration():
    skip = 0
    limit = 10_000
    while True:
        cursor = db.users.find(
            filter=None,
            skip=skip, limit=limit,
        )
        if skip >= cursor.count():
            return
        skip += limit
        bulk_requests = []
        for doc in cursor:
            replace = False
            if "name" in doc:
                name = doc.pop("name")
                first_name, last_name = name.split(' ')
                doc['first_name'] = first_name
                doc['last_name'] = last_name
                replace = True
            if "phone" in doc:
                phone = doc.pop('phone')
                doc['contacts'] = [{'type': 'phone', 'value': phone}]
                replace = True
            if replace:
                bulk_requests.append(ReplaceOne({'_id': doc['_id']}, doc))

        if bulk_requests:
            try:
                db.users2.bulk_write(bulk_requests, ordered=False)
            except BulkWriteError as bwe:
                print(bwe.details)

Conclusion

Schema modification process depends on app and app’s limitation. Among the benefits of using versioning for your document’s structure is that no extra processing is needed on the database side as the structure is only updated when they are changed. It means that you don’t have to create a long running script to update all the documents. Of course the second approach might take a long time before all the documents in a collection are updated, depending on their use case. But the migration of all data takes away the limitations of the versioning approach.


Peter Pavlov

Full-stack Developer, Python expert

yogip Linkedin