1:00pm - 1:01, total frequency: A:8, B:7, C:6, top K = A:8, B:7
1:01pm - 1:02, total frequency: A:3, B:2, C:6, top K = C:6, A:3
when user wanna see topK from 1:00pm - 1:02pm, we merge every topk list from each min
A:8, B:7 + C:6, A:3 = A:11, B:7,C:6 = A:11, B:7
but in fact, it should be C12, A11, how do we solve this problem? relie on slow path? it may have the same problem if we ask top K for 1:00pm - 3:02pm, it is never accuracy
How is the data partitioner + map reduce approach helping us to accelerate the process of coming with more accurate (than count min sketch) top k lists?
Hey Ravi, I like the way you are converting Mikhail Smarshchok's youtube videos to blog posts. You should be putting an appreciation note mentioning his efforts in the blog somewhere.
Interview Preparation: Design A System To Get TopK Elements At Scale
Hi Ravi, thanks for sharing this insightful blog.
One question:
Computing global top k elements using local top k elements might not be always correct.
eg.
Suppose we are interested in top 2 elements and local frequency count is as below,
Local1 = A:8, B:7, C:6
Local2 = A:3, B:2, C:6
Top 2 from local1 is A, B and from local2 is C,A
Global top 2 will be A:11, B:7
But it should be C:12, A:11.
Please let me know what exactly i am missing here. Thanks
i think the major problem of the approach is
let say i have a CMS for every 1 min , K = 2
1:00pm - 1:01, total frequency: A:8, B:7, C:6, top K = A:8, B:7
1:01pm - 1:02, total frequency: A:3, B:2, C:6, top K = C:6, A:3
when user wanna see topK from 1:00pm - 1:02pm, we merge every topk list from each min
A:8, B:7 + C:6, A:3 = A:11, B:7,C:6 = A:11, B:7
but in fact, it should be C12, A11, how do we solve this problem? relie on slow path? it may have the same problem if we ask top K for 1:00pm - 3:02pm, it is never accuracy
How is the data partitioner + map reduce approach helping us to accelerate the process of coming with more accurate (than count min sketch) top k lists?
Hey Ravi, I like the way you are converting Mikhail Smarshchok's youtube videos to blog posts. You should be putting an appreciation note mentioning his efforts in the blog somewhere.