Mongodb聚合vs客户端处理

Mongodb aggregation vs client side processing

本文关键字：处理客户端 vs 聚合 Mongodb 更新时间：2023-09-26

我有一个blogs集合，它几乎具有以下模式:

{ 
    title: { name: "My First Blog Post",
             postDate: "01-28-11" },
    content: "Here is my super long post ...",
    comments: [ { text: "This post sucks!"
              , name: "seanhess"
              , created: 01-28-14}
            , { text: "I know! I wish it were longer"
              , name: "bob"
              , postDate: 01-28-11} 
            ] 
}

我主要想运行三个查询:

只给 bob

comments

查找当天发布的所有comments，即comments.postDate = title.postDate。
查找bob在同一天发布的所有comments

我的问题如下:

这三个将是非常频繁的查询，所以使用聚合框架是一个好主意吗?
对于第三个查询，我可以简单地做一个像db.blogs.find({"comments.name":"bob"}, {comments.name:1, comments.postDate:1, title.postDate:1})这样的查询，然后做一个客户端后处理来循环遍历返回的结果。这是个好主意吗?我想指出的是，这可能会返回数千份文件。
如果你能提出一些方法来进行第三次查询，我会很高兴。

最好的做法是把你的多个问题"分解"成几个问题，如果不仅仅是这样的话，也许一个问题的答案会让你理解另一个问题。

我也不是很热衷于回答任何地方有没有的例子显示你已经试图做什么。但话说回来，"搬起石头砸自己的脚"，从设计角度来看，这些问题是合理的，所以我会回答。

第一点:bob的注释

标准$unwind并过滤结果。首先使用$match，这样就不会处理不需要的文档。

db.collection.aggregate([
    // Match to "narrow down" the documents.
    { "$match": { "comments.name": "bob" }},
    // Unwind the array
    { "$unwind": "$comments" },
    // Match and "filter" just the "bob" comments
    { "$match": { "comments.name": "bob" }},
    // Possibly wind back the array
    { "$group": {
       "_id": "$_id",
       "title": { "$first": "$title" },
       "content": { "$first": "$content" },
       "comments": { "$push": "$comments" }
    }}
])

要点二:当日所有评论

db.collection.aggregate([
    // Try and match posts within a date or range
    // { "$match": { "title.postDate": Date( /* something */ ) }},
    // Unwind the array
    { "$unwind": "$comments" },
    // Aha! Project out the same day. Not the time-stamp.
    { "$project": {
        "title": 1,
        "content": 1,
        "comments": 1,
        "same": { "$eq": [
            {
                "year"   : { "$year":  "$title.postDate" },
                "month"  : { "$month": "$title.postDate" },
                "day": { "$dayOfMonth": "$title.postDate" }
            },
            {
                "year"   : { "$year": "$comments.postDate" },
                "month"  : { "$month": "$comments.postDate" },
                "day": { "$dayOfMonth": "$comments.postDate" }
            }
        ]}
     }},
     // Match the things on the "same 
     { "$match": { "same": true } },     
    // Possibly wind back the array
    { "$group": {
       "_id": "$_id",
       "title": { "$first": "$title" },
       "content": { "$first": "$content" },
       "comments": { "$push": "$comments" }
    }}
])

第三点:"bob"在同一天

db.collection.aggregate([
    // Try and match posts within a date or range
    // { "$match": { "title.postDate": Date( /* something */ ) }},
    // Unwind the array
    { "$unwind": "$comments" },
    // Aha! Project out the same day. Not the time-stamp.
    { "$project": {
        "title": 1,
        "content": 1,
        "comments": 1,
        "same": { "$eq": [
            {
                "year"   : { "$year":  "$title.postDate" },
                "month"  : { "$month": "$title.postDate" },
                "day": { "$dayOfMonth": "$title.postDate" }
            },
            {
                "year"   : { "$year": "$comments.postDate" },
                "month"  : { "$month": "$comments.postDate" },
                "day": { "$dayOfMonth": "$comments.postDate" }
            }
        ]}
     }},
     // Match the things on the "same" field
     { "$match": { "same": true, "comments.name": "bob" } },     
    // Possibly wind back the array
    { "$group": {
       "_id": "$_id",
       "title": { "$first": "$title" },
       "content": { "$first": "$content" },
       "comments": { "$push": "$comments" }
    }}
])

结果

老实说，特别是如果你使用一些索引来为这些操作的初始$match阶段提供数据，那么很明显，这将在试图在代码中迭代它时"绕圈"。

至少在这减少了返回的记录"通过线"，所以有更少的网络流量。当然，一旦收到查询结果，就很少(或nothing)要post处理。

作为一般惯例，数据库服务器硬件在性能上往往比"应用服务器"硬件高出一个数量级。所以一般情况是，在服务器上执行的任何操作都会运行得更快。

聚合是正确的事情吗:"是"。还有很长的路要走。你甚至可以很快得到一个光标
你怎么能做你想要的查询:显示得很简单。在现实世界的代码中，我们从来不会"硬编码"它，我们动态地构建它。因此，添加条件和属性应该像所有常规的数据操作代码一样简单。

所以我通常不会回答这种风格的问题。但是说声谢谢!好吗?