Post

Subset Pattern: Optimizing MongoDB Working Sets

Subset Pattern: Optimizing MongoDB Working Sets

Subset Pattern: Optimizing MongoDB Working Sets

Large documents with infrequently used data can cause working sets to exceed RAM, leading to performance issues.

The Subset Pattern addresses this by splitting frequently accessed data into a main collection and infrequently accessed data into a secondary collection.

graph LR
    subgraph dataAccess["πŸ”„ Data Access"]
        query["πŸ” Query"]
        mainData["πŸ“Š Main Data"]
        secondaryData["πŸ“š Secondary Data"]
    end
    query --> |"Frequent Access"| mainData
    query --> |"Occasional Access"| secondaryData

Split Data

For example, a product document may contain both product information and all reviews. By splitting the reviews into a separate collection, the main collection only contains the most recent reviews, reducing the working set size.

graph LR
    subgraph originalData["πŸ“„ Original Document"]
        productInfo["πŸ‘œ Product Info"]
        allReviews["πŸ“ All Reviews"]
    end
    subgraph splitData["πŸ”€ Split Data"]
        subgraph mainCollection["πŸ“Š Main Collection"]
            productInfoSplit["🏷️ Product Info"]
            recentReviews["πŸ“ Recent Reviews"]
        end
        subgraph secondaryCollection["πŸ“š Secondary Collection"]
            oldReviews["πŸ“œ Old Reviews"]
        end
    end
    originalData --> |"Split"| splitData
    productInfo --> productInfoSplit
    allReviews --> recentReviews
    allReviews --> oldReviews

    style originalData fill:#eeeeee,stroke:#333,stroke-width:2px
    style splitData fill:#eeeeee,stroke:#333,stroke-width:2px
    style mainCollection fill:#dbf0fe,stroke:#333,stroke-width:2px
    style secondaryCollection fill:#e6f3ff,stroke:#333,stroke-width:2px
    style productInfo fill:#ffb3ba,stroke:#333,stroke-width:2px
    style allReviews fill:#ffdfba,stroke:#333,stroke-width:2px
    style productInfoSplit fill:#baffc9,stroke:#333,stroke-width:2px
    style recentReviews fill:#bae1ff,stroke:#333,stroke-width:2px
    style oldReviews fill:#ffffba,stroke:#333,stroke-width:2px

Before

1
2
3
4
5
6
7
{
  "_id": ObjectId("507f1f77bcf86cd799439011"),
  "name": "Super Widget",
  "reviews": [
    // Potentially hundreds of reviews
  ]
}

After

Product Collection:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
{
  "_id": ObjectId("507f1f77bcf86cd799439011"),
  "name": "Super Widget",
  "reviews": [
     {
       "review_id": 123,
       "review_text": "Great widget."
       },
      {
        "review_id": 456,
        "review_text": "Awesome widget."
     }

  ]
}

Review Collection:

1
2
3
4
5
6
7
8
9
10
11
12
[ {
  "review_id": 786,
  "product_id": ObjectId("507f1f77bcf86cd799439011"),
  "review_text": "Amazing widget."
},

{
  "review_id": 789,
  "product_id": ObjectId("507f1f77bcf86cd799439011"),
  "review_text": "Fantastic widget."
}
]

Mongoose Schema

If you’re using Mongoose with MongoDB, you can define separate schemas for the main collection and the secondary collection.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
const mongoose = require('mongoose');

// Review Schema
const reviewSchema = new mongoose.Schema({
  author: String,
  text: String,
  rating: Number,
  createdAt: { type: Date, default: Date.now }
});

// Product Schema
const productSchema = new mongoose.Schema({
  name: String,
  description: String,
  price: Number,
  recentReviews: [reviewSchema]
});

// Separate Review Schema
const separateReviewSchema = new mongoose.Schema({
  productId: { type: mongoose.Schema.Types.ObjectId, ref: 'Product' },
  author: String,
  text: String,
  rating: Number,
  createdAt: { type: Date, default: Date.now }
});

const Product = mongoose.model('Product', productSchema);
const SeparateReview = mongoose.model('SeparateReview', separateReviewSchema);

Add and Get Reviews

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// Function to add a new review
async function addReview(productId, reviewData) {
  const product = await Product.findById(productId);
  if (!product) throw new Error('Product not found');

  const newReview = new SeparateReview({
    productId: product._id,
    ...reviewData
  });
  await newReview.save();

  product.recentReviews.push(reviewData);
  if (product.recentReviews.length > 10) {
    product.recentReviews.shift();
  }
  await product.save();
}

// Function to get all reviews for a product
async function getAllReviews(productId) {
  const recentReviews = await Product.findById(productId).select('recentReviews');
  const oldReviews = await SeparateReview.find({ productId });
  return [...recentReviews, ...oldReviews];
}

More Use Cases

  • Social Media Platforms:
    • User Posts: Maintain recent posts in user document, older posts in a separate collection.
    • Friends List: Store active friends in user document, inactive friends separately.
  • Content Streaming Platforms:
    • Video Comments: Keep recent comments with video data, older comments in a separate collection.
    • Watch History: Store recent history in user profile, older history in a separate collection.
  • IoT Systems:
    • Device Logs: Maintain recent logs with device data, archive older logs separately.
    • Sensor Data: Keep recent readings in device document, historical data in a separate collection.
  • Financial Systems
    • Transaction History: Store recent transactions in account document, older transactions separately.
    • Portfolio Performance: Keep current holdings and recent performance in main document, historical data separately.

Considerations

  • Data Consistency: Ensure consistency between main and secondary collections.
  • Query Complexity: Manage queries that span both collections.
  • Data Growth: Plan for data growth and archival strategies.

Summary

The Subset Pattern is a powerful technique for optimizing MongoDB performance by reducing working set size. It involves splitting large documents into frequently and infrequently accessed data, storing them in separate collections. Key benefits include:

  • Improved query performance
  • Reduced memory usage
  • Better scalability for large datasets

However, it comes with trade-offs such as increased complexity in data management and potential for additional queries.

Rerferences

This post is licensed under CC BY 4.0 by the author.