A logic test from a job selection process:
Web search engines A and B each crawl a random subset of the same size of the Web. Some of the pages crawled are duplicates – exact textual copies of each other at different URLs. Assume that duplicates are distributed uniformly amongst the pages crawled by A and B. Further, assume that a duplicate is a page that has exactly two copies – no pages have more than two copies. A indexes pages without duplicate elimination whereas B indexes only one copy of each duplicate page. The two random subsets have the same size before duplicate elimination. If, 45% of A’s indexed URLs are present in B’s index, while 50% of B’s indexed URLs are present in A’s index, what fraction of the Web consists of pages that do not have a duplicate?
Solution
Be:
- C_X the number of elements of X
- X = { 1, 2, 3, 2 } -> C_X = 4
- D_X the number of elements of X with a duplicate in X:
- X = { 1, 2, 3, 2 } -> D_X = 1
- N_X the number of elements of X without a duplicate in X
- X = { 1, 2, 3, 2 } -> N_X = 2
- XY the union of X and Y
-
X1 the set X before eliminating duplicates
-
X2 the set X after eliminating duplicates
Data:
- Web search engines A and B each crawl a random subset of the same size of the Web
- C_A1 + C_B1 = C_W
- The two random subsets have the same size before duplicate elimination
- C_A1 = C_B1
- Assume that duplicates are distributed uniformly amongst the pages crawled by A and B
- D_A1 = D_B1
- Further, assume that a duplicate is a page that has exactly two copies – no pages have more than two copies
- C_A1 = 2 * D_A1 + D_AB + N_A1
- C_B1 = 2 * D_B1 + D_AB + N_B1
- A indexes pages without duplicate elimination whereas B indexes only one copy of each duplicate page
- C_A2 = C_A1
- C_B2 = C_B1 – D_B1
- 45% of A’s indexed URLs are present in B’s index, while 50% of B’s indexed URLs are present in A’s index
- D_AB (in B2) = 45/100 * C_A2
- D_AB (in A2) = 50/100 * C_B2
- N_AB / C_W = ?
Example (without taking into account Data:6):
- A1 = 1 1 3 3 5 5 7 9 11 13 15
-
B1 = 2 2 4 4 6 6 7 9 10 12 14
-
A2 = A1
-
B2 = 2 4 6 7 9 10 12 14
Solution:
-
N_AB = N_A1 + N_B1 — by definition.
-
N_A1 = C_A1 – 2 * D_A1 – D_AB — because of Data:4.
-
N_B1 = C_B1 – 2 * D_B1 – D_AB — because of Data:4.
-
N_B1 = C_A1 – 2 * D_A1 – D_AB — because of Solution:3 + Data:2 + Data:3
-
N_AB = 2 * C_A1 – 4 * D_A1 – 2 * D_AB — because of Solution:1 + Solution:2 + Solution:4
-
C_B2 = C_A1 – D_A1 — because of Data:5 + Data:2 + Data:3
-
45/100 * C_A2 = 50/100 * C_B2 — because of Data:6
-
45/100 * C_A1 = 50/100 * ( C_B1 – D_B1 ) — because of Solution:7 + Data:5
-
45/100 * C_A1 = 50/100 * ( C_A1 – D_A1 ) — because of Data:2 + Data:3
-
D_A1 = 10/100 * C_A1 — because of Solution:9
-
N_AB = 200/100 * C_A1 – 4 * 10/100 * C_A1 – 2 * 45/100 * C_A1 — because of Solution:5 + Solution:10 + Data:6 + Data:5
-
N_AB = 70/100 * C_A1 — because of Solution:11
-
N_AB = 70/100 * 50/100 * C_W — because of Solution:12 + Data:1 + Data:2
-
N_AB / C_W === 35% — because of Solution:13