Motivation

Since we cannot store alllll the stream, we store a uniform sample of the stream to answer queries

Goal

  • Choose small number to keep in memory out of large number of elements
  • Sample single elements from the stream uniformly at random
  • Same probability of the sample to be chosen
  • Draw sample over the stream, not the universe

Two sampling problems

  • P1: Sample a fixed proportion of elements in the stream AND the sample size is m/10 given the stream of m elements
  • P2: Maintain a random sample of fixed size over an infinite stream && The sample size is fixed. s for at any time k

Fixed proportion

Scenario

  • Search engine receives a stream of tuples (user, query, time)
  • Question: How many users run the same query in a single day?
  • Memory: We have space to store 1/10 of the stream

Naive solution:

  • Generate a random integer in (0…9) for each query
  • Store the query if the integer is 0; otherwise, discard.

Problem for naive solution:

  • If each user send x queries once and d queries twice (x+2d)
  • Question becomes what the fraction of queries by an average user are duplicated
  • Because we need to categorizes the sample, the sample size becomes a problem

Answer would be: d/(x+d) (True solution)

But we cannot approximately answer the question

How to come up with approximate solution
  • Out of x query (singleton), we query 10% of them

    • P = x/10
  • For 2d duplicate query. Case 1: Both {qi’, qi’} sampled. Case 2: First of {qi’, qi’} sampled. Case 3: Second of {qi’, qi’} sampled.

    • p = d/100 + 9d/100 + 9d/100
  • Overall:

    • (d/100) / (x/10 + d/100 + 18d/100) = d / 10x+19d

Above is the approximate solution according to the sampling scheme

Solution: Sample Users

We use the hash function and hash the users (not queries) into 10 buckets uniformly
Then pick user ID in the first bucket.

Generalized solution: Fixed proportion

  • Stream tuples with keys

    • Key is some subset of each tuples components
    • e.g. tuple is (user, serach, time) -> key is user
    • Choice of key depends on applications
  • Task: Get a sample of a/b fraction of the stream (e.g. 30%)

    • Hash each tuple’s key uniformly into b buckets (e.g. 10 buckets)
    • Pick the tuple if its hash value is in the first a bucket (e.g., the first 3 buckets)

Fixed size