On Improving Data Consistency: A Retrospective

Let’s pick a scenario (using a real-life example). You are tasked with building a social media app. Your users will create posts and share them with the public. This content will receive interactions from other users. Your users are interested in how well the content is performing. How do you ensure this information is as consistent and correct as possible?

To answer this, let’s look back at some things I did wrong when trying to figure it out early on.

Building a Social Media App: The Beginning

In one of my previous lives, I was tasked with developing a social media application that was supposed to allow our users to carry out the following operations:

Create a post.
Like a post.
Make comments on a post.

To build this MVP, we used Node.js, TypeScript and MongoDB as our database, among others. We tracked certain stats in this application, such as the number of posts a user has made, the number of likes a post has and the number of comments on a post. This information was represented in our database as follows:

We had a collection that kept information about a user in our system (we will call it “user_collection”).
We had another collection (we will call it “post_collection”) that kept information about posts created in our system.
We also had a collection (we will call it “like_collection”) that kept information about the posts liked and the users who liked them.

The representation (in a simplified manner) looked like this:

user_collection: {
  _id: string,
  fullname: string,
  [...]
  stats: {
    number_of_posts: {
      type: number,
      default: 0
    },
    number_of_posts_liked: {
      type: number,
      default: 0
    }
  }
}

post_collection: {
  _id: string,
  owner_id: {
    type: string,
    refs: <user_collection>
  }
  [...]
  stats: {
    number_of_comments: {
      type: number,
      default: 0
    },
    number_of_likes: {
      type: number,
      default: 0
    }
  }
}

like_collection: {
  _id: string,
  user_id: {
    type: string,
    refs: <user_collection>
  },
  post_id: {
    type: string,
    refs: <post_collection>
  },
}

So what happens when a user carries out any of these operations? Let’s say a user creates a post. What we did was:

// post.ts
await post_repository.create({ ...post_data });
await user_repository.updateOne({ _id: user_id }, { $inc: { 'stats.number_of_posts': 1 } });

And if a user likes a post, our code looks like this:

// like.ts
await like_reposistory.create({ user_id, post_id });
await post_repository.updateOne({ video_id }, { $inc: { 'stats.number_of_likes': 1 } });
await user_repository.updateOne({ _id: user_id }, { $inc: { 'stats.number_of_posts_liked': 1 } });

When a post is created or liked, we create a record in the database and keep count, too. So if you create a post, we increment the “number_of_posts” count in the “user_collection” by 1, and so on.

Enter Inconsistent Data

All was going well until we realized that the stats reported ("number_of_comments," “number_of_likes”) did not match the records that were in the database. What went wrong?

Let’s take a look at our code again:

// like.ts
await like_reposistory.create({ user_id, post_id });
await post_repository.updateOne({ video_id }, { $inc: { 'stats.number_of_likes': 1 } });
await user_repository.updateOne({ _id: user_id }, { $inc: { 'stats.number_of_posts_liked': 1 } });

There is a big chance that when a user carries out a “like” operation, one of these might fail. What if multiple users are trying to like a post at the same time? Network failure? There is no guarantee that this block of code will execute successfully.

At this point, you might be thinking, “Transactions!" And you may be right. But in our case, I don’t think it was a good solution. Say a popular influencer on our app makes a new post. In minutes (or even seconds), the post starts getting thousands of likes and comments. Imagine the concurrency issues that will appear.

Without considering transactions (because of the issues we forecasted), we moved the count operation code to an update trigger.

Triggers Happy? Still More Inconsistent Data

Database triggers (in MongoDB) allow you to execute server-side logic whenever a document is added, updated or removed in a linked MongoDB cluster. It uses change streams to listen for changes to documents in a collection and passes database events to their associated trigger functions.

We moved our code around and it looked like this with this implementation:

import { ChangeStream, ChangeStreamDocument } from 'mongodb';
import LikeModel from '@schemas/like';
import LockModel from '@schemas/lock';

LikeModel.watch([], { fullDocument: 'updateLookup' }).on('change', async (data) => {
  try {
    const uniqueId = `${data.operationType}-${(data._id as any)._data}-like`;
    await LockModel.create({ uniqueId });

    if (data.operationType === 'insert') {
      /** a whole lot of other boilerplate code */
      await like_reposistory.create({ user_id, post_id });
      await post_repository.updateOne({ video_id }, { $inc: { 'stats.number_of_likes': 1 } });
      await user_repository.updateOne({ _id: user_id }, { $inc: { 'stats.number_of_posts_liked': 1 } });
    }
    // some other logic too to check for when a document is deleted
  } catch (error) {
    // ...
  }
}) as ChangeStream

The solution failed as much as the reason for the failure of our previous solution. We still ended up getting inconsistent counts with respect to the number of records in the database.

The Reliable Source?

Having gone through all of this, we were still unable to keep the count up-to-date. And a lot of times, we had to go into the database, run a query and update the counter to match the records in the database. It wasn’t ideal, and it took a lot of time to debug issues that we shouldn’t have been dealing with in the first place. So, we did something very simple — count the records.

“A man with one watch always knows the time. A man with two watches is never sure."

Counting the records turned out to be the most reliable way to achieve this consistency. At the time, it was okay for us. We didn’t have a gazillion records, and upfront, there was no performance issue with counting these records on the fly. Getting information about those metrics was as simple as:

// for getting the number of likes for a post.
await like_repository.countDocument({ post_id });

What Did I Learn?

It is redundant to store the count of database records, but whether or not you should do it depends on your situation. (In our case, it wasn’t right.)

Unless you have known performance problems, calculate the counts and totals on the fly in your application and don’t store them. (This approach will work as a start.)
Database normalization rules say that you shouldn’t have any values in your database that you can programmatically construct based on other values in your database.
Of course, it’s a different situation if you have a gazillion records. In such a case, you will typically want to revisit your requirements.
One possible solution to this problem might be the implementation of sharded counters, but this was an MVP. For simplicity's sake, counting the records proved to be a faster approach.

Resources