Thanks Nitesh. In order to get global top-k-global, one will need to calculate local top-k-local elements. Note, here k-local has to be greater than or equal to k-global. If you calculate, top 1(less than 2) from each local bucket then yes we will have incorrect results.
This case shouldn't even come up as the 'local' partitions, prior to topK, are reduced by Key already => one partition for a key. So, 'A' can either be sourced from local1 or local2; not both
1:00pm - 1:01, total frequency: A:8, B:7, C:6, top K = A:8, B:7
1:01pm - 1:02, total frequency: A:3, B:2, C:6, top K = C:6, A:3
when user wanna see topK from 1:00pm - 1:02pm, we merge every topk list from each min
A:8, B:7 + C:6, A:3 = A:11, B:7,C:6 = A:11, B:7
but in fact, it should be C12, A11, how do we solve this problem? relie on slow path? it may have the same problem if we ask top K for 1:00pm - 3:02pm, it is never accuracy
How is the data partitioner + map reduce approach helping us to accelerate the process of coming with more accurate (than count min sketch) top k lists?
Hey Ravi, I like the way you are converting Mikhail Smarshchok's youtube videos to blog posts. You should be putting an appreciation note mentioning his efforts in the blog somewhere.
Hi Ravi, thanks for sharing this insightful blog.
One question:
Computing global top k elements using local top k elements might not be always correct.
eg.
Suppose we are interested in top 2 elements and local frequency count is as below,
Local1 = A:8, B:7, C:6
Local2 = A:3, B:2, C:6
Top 2 from local1 is A, B and from local2 is C,A
Global top 2 will be A:11, B:7
But it should be C:12, A:11.
Please let me know what exactly i am missing here. Thanks
Thanks Nitesh. In order to get global top-k-global, one will need to calculate local top-k-local elements. Note, here k-local has to be greater than or equal to k-global. If you calculate, top 1(less than 2) from each local bucket then yes we will have incorrect results.
I still don't think its clear to me, can you take the same example and explain how it will work in case of MR job for top K?
This case shouldn't even come up as the 'local' partitions, prior to topK, are reduced by Key already => one partition for a key. So, 'A' can either be sourced from local1 or local2; not both
i think the major problem of the approach is
let say i have a CMS for every 1 min , K = 2
1:00pm - 1:01, total frequency: A:8, B:7, C:6, top K = A:8, B:7
1:01pm - 1:02, total frequency: A:3, B:2, C:6, top K = C:6, A:3
when user wanna see topK from 1:00pm - 1:02pm, we merge every topk list from each min
A:8, B:7 + C:6, A:3 = A:11, B:7,C:6 = A:11, B:7
but in fact, it should be C12, A11, how do we solve this problem? relie on slow path? it may have the same problem if we ask top K for 1:00pm - 3:02pm, it is never accuracy
How is the data partitioner + map reduce approach helping us to accelerate the process of coming with more accurate (than count min sketch) top k lists?
The data partitioner + map reducer is slower but gives us a more accurate result set.
Why slow? Data is spilled to disk and is not kept in-memory.
Why accurate? No information loss of data (which occurs due to the count min sketch data structure).
Sorry for the delayed response.
cc @priyam.
Hey Ravi, I like the way you are converting Mikhail Smarshchok's youtube videos to blog posts. You should be putting an appreciation note mentioning his efforts in the blog somewhere.
Hi Puneet. That is a good point. I think I do put a reference to the channel in the introduction of each blog. I might have missed some blogs.