[Article sharing] How Reddit do View Counting?

View Counting at Reddit

This is quite a good article talking about how Reddit do view counting.

 

 

The problem has these following requirement:

  • Counts must be real time or near-real time. No daily or hourly aggregates.
  • Each user must only be counted once within a short time window.
  • The displayed count must be within a few percentage points of the actual tally.
  • The system must be able to run at production scale and process events within a few seconds of their occurrence. (Remarks: Reddit is No.8 in global visit count)

Actually such problem is categorized as “cardinality estimation problem“. (https://en.wikipedia.org/wiki/Count-distinct_problem)

 

 

A naive implementation of this solution would be to store the unique user set as a hash table in memory, with the post ID as the key.
However, it is not practical because several popular posts have over one million unique viewers and the memory & CPU usage for such solution would be too costly.

Turn out the engineers in Reddit solve it by using a combination of two algorithms for different scaling level:
1) Linear probabilistic counting
2) HyperLogLog(HLL)-based counting

Both of such algorithms used some tricky magic so that they can use extremely little memory to do the counting. (e.g: count 1M IDs using just 12 KB space)
I would not go through the details of the magic box but if you are interested you can see this article: Big Data Counting: How to count a billion distinct objects using only 1.5KB of Memory .

 

Some people had done a demo of the different counting techniques by counting the number of distinct words in all of Shakespeare’s works using three different counting techniques.

And the result is below:

Counter Bytes Used Count Error
HashSet 10447016 67801 0%
Linear probabilistic counting 3384 67080 1%
HyperLogLog 512 70002 3%

 

 

Next time if you encountered unique item counting estimation problem, you may consider using the “Linear probabilistic counting” and “HyperLogLog” techniques to help you solve the problem.

Leave a Reply

Your email address will not be published.