9 Comments
May 10, 2022·edited May 10, 2022Liked by Ravi Tandon

Hi Ravi, thanks for sharing this insightful blog.

One question:

Computing global top k elements using local top k elements might not be always correct.

eg.

Suppose we are interested in top 2 elements and local frequency count is as below,

Local1 = A:8, B:7, C:6

Local2 = A:3, B:2, C:6

Top 2 from local1 is A, B and from local2 is C,A

Global top 2 will be A:11, B:7

But it should be C:12, A:11.

Please let me know what exactly i am missing here. Thanks

Expand full comment

i think the major problem of the approach is

let say i have a CMS for every 1 min , K = 2

1:00pm - 1:01, total frequency: A:8, B:7, C:6, top K = A:8, B:7

1:01pm - 1:02, total frequency: A:3, B:2, C:6, top K = C:6, A:3

when user wanna see topK from 1:00pm - 1:02pm, we merge every topk list from each min

A:8, B:7 + C:6, A:3 = A:11, B:7,C:6 = A:11, B:7

but in fact, it should be C12, A11, how do we solve this problem? relie on slow path? it may have the same problem if we ask top K for 1:00pm - 3:02pm, it is never accuracy

Expand full comment

How is the data partitioner + map reduce approach helping us to accelerate the process of coming with more accurate (than count min sketch) top k lists?

Expand full comment

Hey Ravi, I like the way you are converting Mikhail Smarshchok's youtube videos to blog posts. You should be putting an appreciation note mentioning his efforts in the blog somewhere.

Expand full comment