top button
Flag Notify
    Connect to us
      Site Registration

Site Registration

MongoDB: How to deal with big collection

0 votes
355 views

Recently, I am involved in a project where I have a collection of posts (comments). In this collection, I have around 20 millions posts where each document contains some information about each post (id, URL, content, author, likes, dislikes).

Yesterday, I tried to identify the number of posts per user meaning that as a result I would like to have the list of author with their number of posts.

To this end, I have used a simple java code... After around 10 hours, I got the result of only 180 users whereas I have more than 500 000 users in my collection. So, using my code I will need months :)

Please can you recommend me some tips to improve the efficiently of my queries.

posted Mar 16, 2016 by anonymous

Share this question
Facebook Share Button Twitter Share Button LinkedIn Share Button

1 Answer

0 votes

you are provide some more information about your current schema and the Java code you have written?
Can you also tell if you have indexes, and if so what are they?

My first guess would be that you are looping over the complete collection in code instead of using a query / aggregation.
Maybe you would need to adjust your model to your query if this is needed/possible.

Also,

this is an example of what might work for you :

db.comments.aggregate([ {$group: {_id:"$author",total:{$sum:1}}},{ $sort: { total: -1 } } ])

The sort is optional and if you could, you should do a match first to make you working set smaller.

answer Apr 6, 2016 by Manikandan J
Similar Questions
0 votes

My collection is like this, but I want to update a nested object with {"env":"qa"} in my updated collection with java-driver:

{ "_id" : ObjectId("5b052eeff9290437b217b1ed"), "app" : "hike", "group" : [ { "env" : "prod" }, { "env" : "test" } ]}{ "_id" : ObjectId("5b052f36f9290437b217b1ee"), "app" : "viber", "group" : [ { "env" : "prod" }, { "env" : "test" } ]}

+1 vote

With the non-blocking asynchronous mongo java/scala driver, it is possible to define a wait time and a wait queue size for operations that cannot be executed directly with a free connection. When settings these values, the mongo driver will make the threads waiting for an available connection.

This behavior is very dangerous for an application written with non-blocking asynchronous IO in mind. These applications use a very limited number of threads (= numbers of cores). Blocking one of these thread can block the whole application.

What would be the recommended way to for this kind of applications? Should we set all these waiting settings to 0 and handle MongoWaitQueueFullException with retries in the application? Should the driver call an application callback when a connection is free?

0 votes

MONGODB 3.2.8, WIREDTIGER STORAGE ENGINE
SINGLE COLLECTION HAS MORE THAN 500G, WHETHER IT NEEDS TO BE SEPARATED?

WHAT PARAMETER SHOULD BE ADJUSTED ?

+1 vote

I have a DB with several (quite big) collections, on Mongo 3.2.5 and WiredTiger with snappy compression on. I would like to change the compression setting on one small collection I have to test it uncompressed. Is there a possibility to do that on a secondary without having to resync the whole DB?

...