Apr 23, 2022

Finding Heavy Hitters

9 Comments

Nitesh Tripathi

May 10, 2022Edited

Hi Ravi, thanks for sharing this insightful blog.

One question:

Computing global top k elements using local top k elements might not be always correct.

eg.

Suppose we are interested in top 2 elements and local frequency count is as below,

Local1 = A:8, B:7, C:6

Local2 = A:3, B:2, C:6

Top 2 from local1 is A, B and from local2 is C,A

Global top 2 will be A:11, B:7

But it should be C:12, A:11.

Please let me know what exactly i am missing here. Thanks

Expand full comment

Reply (2)

Ravi Tandon

May 10, 2022

Thanks Nitesh. In order to get global top-k-global, one will need to calculate local top-k-local elements. Note, here k-local has to be greater than or equal to k-global. If you calculate, top 1(less than 2) from each local bucket then yes we will have incorrect results.

Expand full comment

Reply (1)

Priyam Vora

Jun 5, 2022

I still don't think its clear to me, can you take the same example and explain how it will work in case of MR job for top K?

Expand full comment

Harvinder Bholowasia

Aug 28, 2022

This case shouldn't even come up as the 'local' partitions, prior to topK, are reduced by Key already => one partition for a key. So, 'A' can either be sourced from local1 or local2; not both

Expand full comment

Stanley

May 4, 2023

i think the major problem of the approach is

let say i have a CMS for every 1 min , K = 2

1:00pm - 1:01, total frequency: A:8, B:7, C:6, top K = A:8, B:7

1:01pm - 1:02, total frequency: A:3, B:2, C:6, top K = C:6, A:3

when user wanna see topK from 1:00pm - 1:02pm, we merge every topk list from each min

A:8, B:7 + C:6, A:3 = A:11, B:7,C:6 = A:11, B:7

but in fact, it should be C12, A11, how do we solve this problem? relie on slow path? it may have the same problem if we ask top K for 1:00pm - 3:02pm, it is never accuracy

Expand full comment

Priyam Vora

Jun 5, 2022

How is the data partitioner + map reduce approach helping us to accelerate the process of coming with more accurate (than count min sketch) top k lists?

Expand full comment

Reply (1)

Ravi Tandon

Sep 11, 2022

The data partitioner + map reducer is slower but gives us a more accurate result set.

Why slow? Data is spilled to disk and is not kept in-memory.

Why accurate? No information loss of data (which occurs due to the count min sketch data structure).

Sorry for the delayed response.

cc @priyam.

Expand full comment

puneet patwari

May 25, 2022

Hey Ravi, I like the way you are converting Mikhail Smarshchok's youtube videos to blog posts. You should be putting an appreciation note mentioning his efforts in the blog somewhere.

Expand full comment

Reply (1)

Ravi Tandon

May 25, 2022

Hi Puneet. That is a good point. I think I do put a reference to the channel in the introduction of each blog. I might have missed some blogs.

Expand full comment

Ravi's System Design Newsletter

Interview Preparation: Design A System To Get…