COMPSCI 753 - Lecture Notes - Stream-Sampling
Motivation
Since we cannot store alllll the stream, we store a uniform sample of the stream to answer queries
Goal
- Choose small number to keep in memory out of large number of elements
- Sample single elements from the stream uniformly at random
- Same probability of the sample to be chosen
- Draw sample over the stream, not the universe
Two sampling problems
- P1: Sample a fixed proportion of elements in the stream AND the sample size is m/10 given the stream of m elements
- P2: Maintain a random sample of fixed size over an infinite stream && The sample size is fixed. s for at any time k
Fixed proportion
Scenario
- Search engine receives a stream of tuples (user, query, time)
- Question: How many users run the same query in a single day?
- Memory: We have space to store 1/10 of the stream
Naive solution:
- Generate a random integer in (0…9) for each query
- Store the query if the integer is 0; otherwise, discard.
Problem for naive solution:
- If each user send x queries once and d queries twice (x+2d)
- Question becomes what the fraction of queries by an average user are duplicated
- Because we need to categorizes the sample, the sample size becomes a problem
Answer would be: d/(x+d) (True solution)
But we cannot approximately answer the question
How to come up with approximate solution
Out of x query (singleton), we query 10% of them
- P = x/10
For 2d duplicate query. Case 1: Both {qi’, qi’} sampled. Case 2: First of {qi’, qi’} sampled. Case 3: Second of {qi’, qi’} sampled.
- p = d/100 + 9d/100 + 9d/100
Overall:
- (d/100) / (x/10 + d/100 + 18d/100) = d / 10x+19d
Above is the approximate solution according to the sampling scheme
Solution: Sample Users
We use the hash function and hash the users (not queries) into 10 buckets uniformly
Then pick user ID in the first bucket.
Generalized solution: Fixed proportion
Stream tuples with keys
- Key is some subset of each tuples components
- e.g. tuple is (user, serach, time) -> key is user
- Choice of key depends on applications
Task: Get a sample of a/b fraction of the stream (e.g. 30%)
- Hash each tuple’s key uniformly into b buckets (e.g. 10 buckets)
- Pick the tuple if its hash value is in the first a bucket (e.g., the first 3 buckets)
Fixed size
All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.