This is quite a good article talking about how Reddit do view counting.
The problem has these following requirement:
- Counts must be real time or near-real time. No daily or hourly aggregates.
- Each user must only be counted once within a short time window.
- The displayed count must be within a few percentage points of the actual tally.
- The system must be able to run at production scale and process events within a few seconds of their occurrence. (Remarks: Reddit is No.8 in global visit count)
Actually such problem is categorized as “cardinality estimation problem“. (https://en.wikipedia.org/wiki/Count-distinct_problem)
A naive implementation of this solution would be to store the unique user set as a hash table in memory, with the post ID as the key.
However, it is not practical because several popular posts have over one million unique viewers and the memory & CPU usage for such solution would be too costly.
Turn out the engineers in Reddit solve it by using a combination of two algorithms for different scaling level:
1) Linear probabilistic counting
2) HyperLogLog(HLL)-based counting
Both of such algorithms used some tricky magic so that they can use extremely little memory to do the counting. (e.g: count 1M IDs using just 12 KB space)
I would not go through the details of the magic box but if you are interested you can see this article: Big Data Counting: How to count a billion distinct objects using only 1.5KB of Memory .
Some people had done a demo of the different counting techniques by counting the number of distinct words in all of Shakespeare’s works using three different counting techniques.
And the result is below:
|Linear probabilistic counting||3384||67080||1%|
Next time if you encountered unique item counting estimation problem, you may consider using the “Linear probabilistic counting” and “HyperLogLog” techniques to help you solve the problem.