A Near-Optimal Parallel Algorithm for Joining Binary Relations

We present a constant-round algorithm in the massively parallel computation (MPC) model for evaluating a natural join where every input relation has two attributes. Our algorithm achieves a load of $\tilde{O}(m/p^{1/\rho})$ where $m$ is the total size of the input relations, $p$ is the number of machines, $\rho$ is the join's fractional edge covering number, and $\tilde{O}(.)$ hides a polylogarithmic factor. The load matches a known lower bound up to a polylogarithmic factor. At the core of the proposed algorithm is a new theorem (which we name the"isolated cartesian product theorem") that provides fresh insight into the problem's mathematical structure. Our result implies that the subgraph enumeration problem, where the goal is to report all the occurrences of a constant-sized subgraph pattern, can be settled optimally (up to a polylogarithmic factor) in the MPC model.


Introduction
Understanding the hardness of joins has been a central topic in database theory. Traditional efforts have focused on discovering fast algorithms for processing joins in the random access machine (RAM) model (see [1, 5, 16-18, 21, 22] and the references therein). Nowadays, massively parallel systems such as Hadoop [8] and Spark [2] have become the mainstream architecture for analytical tasks on gigantic volumes of data. Direct adaptations of RAM algorithms, which are designed to reduce CPU time, rarely give satisfactory performance on that architecture. In systems like Hadoop and Spark, it is crucial to minimize communication across the participating machines because usually the overhead of message exchanging overwhelms the CPU calculation cost. This has motivated a line of research -which A relation is a set R of tuples over the same set U of attributes. We say that the scheme of R is U , and write this fact as scheme(R) = U . R is unary or binary if |scheme(R)| = 1 or 2, respectively. A value x ∈ dom appears in R if there exist a tuple u ∈ R and an attribute X ∈ U such that u(X) = x; we will also use the expression that x is "a value on the attribute X in R".
A join query (sometimes abbreviated as a "join" or a "query") is a set Q of relations. Define attset(Q) = R∈Q scheme(R). The result of the query, denoted as Join(Q), is the following relation over attset(Q) tuple u over attset(Q) ∀R ∈ Q, u[scheme(R)] ∈ R .
Q is • simple if no distinct R, S ∈ Q satisfy scheme(R) = scheme(S); • binary if every R ∈ Q is binary. Our objective is to design algorithms for answering simple binary queries.
The integer m = R∈Q

|R|
(1.1) The load of a round is the largest number of words received by a machine in this round, that is, if machine i ∈ [1, p] receives x i words, the load is max p i=1 x i . The performance of an algorithm is measured by two metrics: (i) the number of rounds, and (ii) the load of the algorithm, defined as the total load of all rounds. CPU computation is for free. We will be interested only in algorithms finishing in a constant number of rounds. The load of such an algorithm is asymptotically the same as the maximum load of the individual rounds.
The number p of machines is assumed to be significantly less than m, which in this paper means p 3 ≤ m. For a randomized algorithm, when we say that its load is at most L, we mean that its load is bounded by L with probability at least 1 − 1/p c where c can be set to an arbitrarily large constant. The notationÕ(.) hides a factor that is polylogarithmic to m and p.

Previous Results.
Early work on join processing in the MPC model aimed to design algorithms performing only one round. Afrati and Ullman [3] explained how to answer a query Q with load O(m/p 1/|Q| ). Later, by refining their prior work in [6], Koutris, Beame, and Suciu [13] described an algorithm that can guarantee a load ofÕ(m/p 1/ψ ), where ψ is the query's fractional edge quasi-packing number. To follow our discussion in Section 1, the reader does not need the formal definition of ψ (which will be given in Section 2); it suffices to understand that ψ is a positive constant which can vary significantly depending on Q. In [13], the authors also proved that any one-round algorithm must incur a load of Ω(m/p 1/ψ ), under certain assumptions on the statistics available to the algorithm.
Departing from the one-round restriction, subsequent research has focused on algorithms performing multiple, albeit still a constant number of, rounds. The community already knows [13] that any constant-round algorithm must incur a load of Ω(m/p 1/ρ ) answering a query, where ρ is the query's fractional edge covering number. As far as Section 1 is concerned, the reader does not need to worry about the definition of ρ (which will appear in Section 2); it suffices to remember two facts: • Like ψ, ρ is a positive constant which can vary significantly depending on the query Q. • On the same Q, ρ never exceeds ψ, but can be much smaller than ψ (more details in Section 2). The second bullet indicates that m/p 1/ρ can be far less than m/p 1/ψ , suggesting that we may hope to significantly reduce the load by going beyond only one round. Matching the lower bound Ω(m/p 1/ρ ) with a concrete algorithm has been shown possible for several special query classes, including star joins [3], cycle joins [13], clique joins [13], line joins [3,13], Loomis-Whitney joins [13], etc. The simple binary join defined in Section 1.1 captures cycle, clique, and line joins as special cases. Guaranteeing a load of O(m/p 1/ρ ) for arbitrary simple binary queries is still open.
1.3. Our Contributions. The paper's main algorithmic contribution is to settle any simple binary join Q under the MPC model with loadÕ(m/p 1/ρ ) in a constant number rounds (Theorem 6.2). The load is optimal up to a polylogarithmic factor. Our algorithm owes to Set λ = Θ(p 1/(2ρ) ) where ρ is the fractional edge covering number of Q (Section 2). A value x ∈ dom is heavy if at least m/λ tuples in an input relation R ∈ Q carry x on the same attribute. The number of heavy values is O(λ). A value x ∈ dom is light if x appears in at least one relation R ∈ Q but is not heavy. A tuple in the join result may take a heavy or light value on each of the 12 attributes A, ..., L. As there are O(λ) choices on each attribute (i.e., either a light value or one of the O(λ) heavy values), there are t = O(λ 12 ) "choice combinations" from all attributes; we will refer to each combination as a configuration. Our plan is to partition the set of p servers into t subsets of sizes p 1 , p 2 , ..., p t with t i=1 p i = p, and then dedicate p i servers (1 ≤ i ≤ t) to computing the result tuples of the i-th configuration. This can be done in parallel for all O(λ 12 ) configurations. The challenge is to compute the query on each configuration with a load O(m/p 1/ρ ), given that only p i (which can be far less than p) servers are available for that subtask.  Since the black attributes have had their values fixed in the configuration, they can be deleted from the residual query, after which some relations in Q become unary or even disappear. Relation R {A,D} ∈ Q , for example, can be regarded as a unary relation over {A} where every tuple is "piggybacked" the value d on D. Let us denote this unary relation as R {A}|d , which is illustrated in Figure 1c  (2) Compute a cartesian product. The residual query Q can now be further simplified into a join query Q which includes (i) the relation R {X} for every isolated attribute X, and (ii) the relation R {X,Y } for every solid edge in Figure 1c. As mentioned earlier, we plan to use only a small subset of the p servers to compute Q . It turns out that the load of our strategy depends heavily on the cartesian product of the unary relations R {X} (one for every isolated attribute X, i.e., R {G} , R {H} , and R {L} in our example) in a configuration. Ideally, if the cartesian product of every configuration is small, we can prove a load ofÕ(m/p 1/ρ ) easily. Unfortunately, this is not true: in the worst case, the cartesian products of various configurations can differ dramatically.
Our isolated cartesian product theorem (Theorem 5.1) shows that the cartesian product size is small when averaged over all the possible configurations. This property allows us to allocate a different number of machines to process each configuration in parallel while ensuring that the total number of machines required will not exceed p. The theorem is of independent interest and may be useful for developing join algorithms under other computation models (e.g., the external memory model [4]; see Section 7).
1.4. An Application: Subgraph Enumeration. The joins studied in this paper bear close relevance to the subgraph enumeration problem, where the goal is to find all occurrences of a pattern subgraph G = (V , E ) in a graph G = (V, E). This problem is NP-hard [7] if the size of G is unconstrained, but is polynomial-time solvable when G has only a constant 6 weight of an edge e ∈ E Sec 2 ρ (or τ ) fractional edge covering (or packing) number of G relation on attribute X after semi-join reduction relation on e ∈ E after semi-join reduction Sec 5.2 Q isolated (η) query on the isolated attributes after semi-join reduction (5.5) Q light (η) query on the light edges after semi-join reduction (5.6) Q (η) reduced query under η (5.7) W I total weight of all vertices in I under fractional edge packing W (5.10) J non-empty subset of I Sec 5.4 Q J (η) query on the isolated attributes in J after semi-join reduction (5.14) W J total weight of all vertices in J under fractional edge packing W (5.15) Table 1: Frequently used notations number of vertices. In the MPC model, the edges of G are evenly distributed onto the p machines at the beginning, whereas an algorithm must produce every occurrence on at least one machine in the end. The following facts are folklore regarding a constant-size G : • Every constant-round subgraph enumeration algorithm must incur a load of Ω(|E|/p 1/ρ ), 1 where ρ is the fractional edge covering number (Section 2) of G . • The subgraph enumeration problem can be converted to a simple binary join with input size O(|E|) and the same fractional edge covering number ρ. Given a constant-size G , our join algorithm (Theorem 6.2) solves subgraph enumeration with loadÕ(|E|/p 1/ρ ), which is optimal up to a polylogarithmic factor.
1.5. Remarks. This paper is an extension of [12] and [20]. Ketsman and Suciu [12] were the first to discover a constant-round algorithm to solve simple binary joins with an asymptotically optimal load. Tao [20] introduced a preliminary version of the isolated cartesian product theorem and applied it to simplify the algorithm of [12]. The current work features a more powerful version of the isolated cartesian product theorem (see the remark in Section 5.5). Table 1 lists the symbols that will be frequently used.

Hypergraphs and the AGM Bound
We define a hypergraph G as a pair (V, E) where: An edge e is unary or binary if |e| = 1 or 2, respectively. G is binary if all its edges are binary.
Given a vertex X ∈ V and an edge e ∈ E, we say that X and e are incident to each other if X ∈ e. Two distinct vertices X, Y ∈ V are adjacent if there is an e ∈ E containing X and Y . All hypergraphs discussed in this paper have the property that every vertex is incident to at least one edge.
Fractional Edge Coverings and Packings. Let G = (V, E) be a hypergraph and W be a function mapping E to real values in [0, 1]. We call W (e) the weight of edge e and e∈E W (e) the total weight of W . Given a vertex X ∈ V, we refer to e∈E:X∈e W (e) (i.e., the sum of the weights of all the edges incident to X) as the weight of X.
W is a fractional edge covering of G if the weight of every vertex X ∈ V is at least 1. The fractional edge covering number of G -denoted as ρ(G) -equals the smallest total weight of all the fractional edge coverings. W is a fractional edge packing if the weight of every vertex X ∈ V is at most 1. The fractional edge packing number of G -denoted as τ (G) -equals the largest total weight of all the fractional edge packings. A fractional edge packing W is tight if it is simultaneously also a fractional edge covering; likewise, a fractional edge covering W is tight if it is simultaneously also a fractional edge packing. Note that in a tight fractional edge covering/packing, the weight of every vertex must be exactly 1.
Binary hypergraphs have several interesting properties: where the equality holds if and only if G admits a tight fractional edge packing (a.k.a. tight fractional edge covering). • G admits a fractional edge packing W of total weight τ (G) such that (1) the weight of every vertex X ∈ V is either 0 or 1; Proof. The first bullet is proved in Theorem 2.2.7 of [19]. The fractional edge packing W in Theorem 2.1.5 of [19] satisfies Property (1)   Plugging this into ρ(G) + τ (G) = |V| yields ρ(G) = (|V| + |Z|)/2. Hence, Property (2) follows.
Example. Suppose that G is the binary hypergraph in Figure 1a. It has a fractional edge covering number ρ(G) = 6.5, as is achieved by the function Hypergraph of a Join Query and the AGM Bound. Every join Q defines a hypergraph When Q is simple, for each edge e ∈ E we denote by R e the input relation R ∈ Q with e = scheme(R). The following result is known as the AGM bound: Lemma 2.2 [5]. Let Q be a simple binary join and W be any fractional edge covering of The fractional edge covering number of Q equals ρ(G) and, similarly, the fractional edge packing number of Q equals τ (G).
Remark on the Fractional Edge Quasi-Packing Number. Although the technical development in the subsequent sections is irrelevant to "fractional edge quasi-packing number", we provide a full definition of the concept here because it enables the reader to better distinguish our solution and the one-round algorithm of [13] (reviewed in Section 1.2). Consider a hypergraph G = (V, E). For each subset U ⊆ V, let G \U be the graph obtained by removing U from all the edges of E, or formally: where τ (G \U ) is the fractional edge packing number of G \U .
If G is the hypergraph defined by a query Q, ψ(G) is said to be the query's fractional edge covering number. It is evident from the above discussion that, when G is a clique or a cycle, the loadÕ(m/p 1/ρ(G) ) of our algorithm improves the loadÕ(m/p 1/ψ(G) ) of [13] by a polynomial factor.

Fundamental MPC Algorithms
This subsection will discuss several building-block routines in the MPC model useful later. Cartesian Products. Suppose that R and S are relations with disjoint schemes. Their cartesian product, denoted as R × S, is a relation over scheme(R) ∪ scheme(S) that consists of all the tuples u over scheme(R) ∪ scheme(S) such that u[scheme(R)] ∈ R and u[scheme(S)] ∈ S.
The lemma below gives a deterministic algorithm for computing the cartesian product: have been labeled with ids 1, 2, ..., |R i |, respectively. We can deterministically compute Join using p machines. Alternatively, if we assume |R 1 | ≥ |R 2 | ≥ ... ≥ |R t |, then the load can be written as In Proof.
Next, we will explain how to obtain Join(Q t ) with load O(L t ). If t < t, this implies that Join(Q) can be obtained with load O(L t + L t +1 ) because R t +1 , ..., R t can be broadcast to all the machines with an extra load O(L t +1 · (t − t )) = O(L t +1 ).
Align the machines into a t -dimensional p 1 × p 2 × ... × p t grid where = p. Each machine can be uniquely identified as a t -dimensional point (x 1 , ..., x t ) in the grid where x i ∈ [1, p i ] for each i ∈ [1, t ]. For each R i , we send its tuple with id j ∈ [1, |R i |] to all the machines whose coordinates on dimension i are (j mod p i ) + 1. Hence, a machine receives O(|R i |/p i ) = O(L t ) tuples from R i ; and the overall load is O(L t · t ) = O(L t ). For each combination of u 1 , u 2 , ..., u t where u i ∈ R i , some machine has received all of u 1 , ..., u t . Therefore, the algorithm is able to produce the entire Join(Q t ).
The load in (3.2) matches a lower bound stated in Section 4.1.5 of [14]. The algorithm in the above proof generalizes an algorithm in [10] for computing the cartesian product of t = 2 relations. The randomized hypercube algorithm of [6] incurs a load higher than (3.2) by a logarithmic factor and can fail with a small probability.
Composition by Cartesian Product. If we already know how to solve queries Q 1 and Q 2 separately, we can compute the cartesian product of their results efficiently: • with probability at least 1−δ 1 , we can compute in one round Join(Q 1 ) with loadÕ(m/p 1/t 1 1 ) using p 1 machines; • with probability at least 1−δ 2 , we can compute in one round Join(Q 2 ) with loadÕ(m/p 1/t 2 2 ) using p 2 machines. Then, with probability at least 1 − δ 1 − δ 2 , we can compute Join(Q 1 ) × Join(Q 2 ) in one round with loadÕ(max{m/p 1/t 1 1 , m/p 1/t 2 2 }) using p 1 p 2 machines. Proof. Let A 1 and A 2 be the algorithm for Q 1 and Q 2 , respectively. If a tuple u ∈ Join(Q 1 ) is produced by A 1 on the i-th (i ∈ [1, p 1 ]) machine, we call u an i-tuple. Similarly, if a tuple v ∈ Join(Q 2 ) is produced by A 2 on the j-th (j ∈ [1, p 2 ]) machine, we call v a j-tuple.
Arrange the p 1 p 2 machines into a matrix where each row has p 1 machines and each column has p 2 machines (note that the number of rows is p 2 while the number of columns is p 1 ). For each row, we run A 1 using the p 1 machines on that row to compute Join(Q 1 ); this creates p 2 instances of A 1 (one per row). If A 1 is randomized, we instruct all those instances to take the same random choices. 2 This ensures: • with probability at least 1 − δ 1 , all the instances succeed simultaneously; • for each i ∈ [1, p 1 ], all the machines at the i-th column produce exactly the same set of i-tuples.
The load incurred isÕ(m/p 1/t 1 1 ). Likewise, for each column, we run A 2 using the p 2 machines on that column to compute Join(Q 2 ). With probability at least 1 − δ 2 , for each j ∈ [1, p 2 ], all the machines at the j-th row produce exactly the same set of j-tuples. The load is O(m/p 1/t 2 2 ). Therefore, it holds with probability at least 1 − δ 1 − δ 2 that, for each pair (i, j), some machine has produced all the i-and j-tuples. Hence, every tuple of Join(Q 1 ) × Join(Q 2 ) appears on a machine. The overall load is the larger betweenÕ(m/p Skew-Free Queries. It is possible to solve a join query Q on binary relations in a single round with a small load if no value appears too often. To explain, denote by m the input size of Q; set k = |attset(Q)|, and list out the attributes in attset(Q) as X 1 , ..., X k . For i ∈ [1, k], let p i be a positive integer referred to as the share of X i . A relation R ∈ Q with scheme {X i , X j } is skew-free if every value x ∈ dom fulfills both conditions below: Define share(R) = p i · p j . If every R ∈ Q is skew-free, Q is skew-free. We know: 2 The random choices of an algorithm can be modeled as a sequence of random bits. Once the sequence is fixed, a randomized algorithm becomes deterministic. An easy way to "instruct" all instances of A1 to make the same random choices is to ask all the participating machines to pre-agree on the random-bit sequence. For example, one machine can generate all the random bits and send them to the other machines. Such communication happens before receiving Q and hence does not contribute to the query's load. The above approach works for a single Q (which suffices for proving Lemma 3.2). There is a standard technique [15] to extend the approach to work for any number of queries. The main idea is to have the machines pre-agree on a sufficiently large number of random-bit sequences. Given a query, a machine randomly picks a specific random-bit sequence and broadcasts the sequence's id (note: only the id, not the sequence itself) to all machines. As shown in [15], such an id can be encoded inÕ (1)  Lemma 3.3 [6]. With probability at least 1 − 1/p c where p = k i=1 p i and c ≥ 1 can be set to an arbitrarily large constant, a skew-free query Q with input size m can be answered in one round with loadÕ(m/ min R∈Q share(R)) using p machines.

A Taxonomy of the Join Result
Given a simple binary join Q, we will present a method to partition Join(Q) based on the value frequencies in the relations of Q. Denote by G = (V, E) the hypergraph defined by Q and by m the input size of Q.
Heavy and Light Values. Fix an arbitrary integer λ ∈ [1, m]. A value x ∈ dom is • heavy if |{u ∈ R u(X) = x}| ≥ m/λ for some relation R ∈ Q and some attribute X ∈ scheme(R); • light if x is not heavy, but appears in at least one relation R ∈ Q. It is easy to see that each attribute has at most λ heavy values. Hence, the total number of heavy values is at most λ · |attset(Q)| = O(λ). We will refer to λ as the heavy parameter. Residual Relations/Queries. Consider an edge e ∈ E; define e = e \ H. We say that e is active on H if e = ∅, i.e., e has at least one attribute outside H. An active e defines a residual relation under η -denoted as R e (η) -which • is over e and • consists of every tuple v that is the projection (on e ) of some tuple w ∈ R e "consistent" with η, namely: -w(X) = η(X) for every X ∈ e ∩ H;

Configurations. Let
The residual query under η is Q (η) = R e (η) e ∈ E, e active on H . For each configuration η ∈ config(Q, H), denote by m η the total size of all the relations in Q (η). We have: Proof. Let e be an edge in E and fix an arbitrary tuple u ∈ R e . Tuple u contributes 1 to the term m η only if η(X) = u(X) for every attribute X ∈ e ∩ H. How many such configurations η can there be? As these configurations must have the same value on every attribute in e ∩ H, they can differ only in the attributes of H \ e. Since each attribute has at most λ heavy values, we conclude that the number of those configurations η is at most λ |H\e| . |H \ e| is at most k − 2 because |H| ≤ k and e has two attributes. The lemma thus follows.

A Join Computation Framework
Answering a simple binary join Q amounts to producing the right-hand side of (4.2). Due to symmetry, it suffices to explain how to do so for an arbitrary subset H ⊆ attset(Q), i.e., the computation of At a high level, our strategy (illustrated in Section 1.3) works as follows. Let G = (V, E) be the hypergraph defined by Q. We will remove the vertices in H from G, which disconnects G into connected components (CCs). We divide the CCs into two groups: (i) the set of CCs each involving at least 2 vertices, and (ii) the set of all other CCs, namely those containing only 1 vertex. We will process the CCs in Group 1 together using Lemma 3.3, process the CCs in Group 2 together using Lemma 3.1, and then compute the cartesian product between Groups 1 and 2 using Lemma 3.2.
Sections 5.1 and 5.2 will formalize the strategy into a processing framework. Sections 5.3 and 5.4 will then establish two important properties of this framework, which are the key to its efficient implementation in Section 6.  Figure 2 shows the subgraph of G induced by L, where a unary edge is represented by a box and a binary edge by a segment. The isolated vertices are G, H, and L.

Semi-Join Reduction.
Recall from Section 4 that every configuration η of H defines a residual query Q (η). Next, we will simplify Q (η) into a join Q (η) with the same result.
Observe that the hypergraph defined by Q (η) is always G = (L, E ), regardless of η. Consider a border attribute X ∈ L and a cross edge e of G = (V, E) incident to X. As explained in Section 4, the input relation R e ∈ Q defines a unary residual relation R e (η) ∈ Q (η). Note that R e (η) has scheme {X}. We define: cross edge e ∈ E s.t. X ∈ e R e (η). Recall that every light edge e = {X, Y } in G defines a residual relation R e (η) with scheme e. We define R e (η) as a relation over e that contains every tuple u ∈ R e (η) satisfying: Note that if neither X nor Y is a border attribute, then R e (η) = R e (η).  Every vertex X ∈ I must be a border attribute and, thus, must now be associated with R X (η). We can legally define: where all the relation names follow those in Section 1.3.

5.3.
The Isolated Cartesian Product Theorem. As shown in (5.5), Q isolated (η) contains |I| unary relations, one for each isolated attribute in I. Hence, Join(Q isolated (η)) is the cartesian product of all those relations. The size of Join(Q isolated (η)) has a crucial impact on the efficiency of our join strategy because, as shown in Lemma 3.1, the load for computing a cartesian product depends on the cartesian product's size. To prove that our strategy is efficient, we want to argue that η∈config(Q,H) Join(Q isolated (η)) (5.9) is low, namely, the cartesian products of all the configurations η ∈ config(Q, H) have a small size overall.
It is easy to place an upper bound of λ |H| · m |I| on (5.9). As each relation (trivially) has size at most m, we have |Join(Q isolated (η))| ≤ m |I| . Given that H has at most λ |H| different configurations, (5.9) is at most λ |H| · m |I| . Unfortunately, the bound is not enough to establish the claimed performance of our MPC algorithm (to be presented in Section 6). For that purpose, we will need to prove a tighter upper bound on (5.9) -this is where the isolated cartesian product theorem (described next) comes in.
Given an arbitrary fractional edge packing W of the hypergraph G, we define Recall that the weight of a vertex Y under W is the sum of W (e) for all the edges e ∈ E containing Y . Theorem 5.1 is in the strongest form when W I is maximized. Later in Section 5.5, we will choose a specific W that yields a bound sufficient for us to prove the efficiency claim on our join algorithm.
Proof of Theorem 5.1. We will construct a set Q * of relations such that Join(Q * ) has a result size at least the left-hand side of (5.11). Then, we will prove that the hypergraph of Q * has a fractional edge covering that (by the AGM bound; Lemma 2.2) implies an upper bound on |Join(Q * )| matching the right-hand side of (5.11).
Initially, set Q * to ∅. For every cross edge e ∈ E incident to a vertex in I, add to Q * a relation R * e = R e . For every X ∈ H, add a unary relation R * {X} to Q * which consists of all the heavy values on X; note that R * {X} has at most λ tuples. Finally, for every Y ∈ I, add a unary relation R * {Y } to Q * which contains all the heavy and light values on Y .
Define G * = (V * , E * ) as the hypergraph defined by Q * . Note that V * = I ∪ H, while E * consists of all the cross edges in G incident to a vertex in I, |H| unary edges {X} for every X ∈ H, and |I| unary edges {Y } for every Y ∈ I.
Example (cont.). Figure 3 shows the hypergraph of the Q * constructed. As before, a box and a segment represent a unary and a binary edge, respectively. Recall that H = {D, E, F, K} and I = {G, H, L}.
Take a tuple u from the left-hand side of (5.12), and set η = u [H]. Based on the definition of Q isolated (η ), it is easy to verify that u[e] ∈ R e for every cross edge e ∈ E incident a vertex in I; hence, u[e] ∈ R * e . Furthermore, u(X) ∈ R * {X} for every X ∈ H because u(X) = η (X) is a heavy value. Finally, obviously u(Y ) ∈ R * {Y } for every Y ∈ I. All these facts together ensure that u ∈ Join(Q * ). Proof. We will construct a desired function W * from the fractional edge packing W in Theorem 5.1.
For every cross edge e ∈ E incident to a vertex in I, set W * (e) = W (e). Every edge in E incident to Y ∈ I must be a cross edge. Hence, binary e∈E * :Y ∈e W * (e) is precisely the weight of Y under W .
Next, we will ensure that each attribute Y ∈ I has a weight 1 under W * . Since W is a fractional edge packing of G, it must hold that binary e∈E * :Y ∈e W (e) ≤ 1. This permits us to assign the following weight to the unary edge {Y }: Finally, in a similar way, we make sure that each attribute X ∈ H has a weight 1 under W * by assigning: This finishes the design of W * , which is now a tight fractional edge covering of G * . The AGM bound in Lemma 2.2 tells us that which completes the proof of Theorem 5.1.

5.4.
A Subset Extension of Theorem 5.1. Remember that Q isolated (η) contains a relation R X (η) (defined in (5.4)) for every attribute X ∈ I. Given a non-empty subset J ⊆ I, define Note that Join(Q J (η)) is the cartesian product of the relations in Q J (η).
Take an arbitrary fractional edge packing W of the hypergraph G. Define We now present a general version of the isolated cartesian product theorem: where λ is the heavy parameter (see Section 4), config(Q, H) is the set of configurations of H (Section 4), Q J is defined in (5.14), and W J is defined in (5.15).
Proof. We will prove the theorem by reducing it to Theorem 5.1. Define J = I \ J and One can constructQ alternatively as follows. First, discard from Q every relation whose scheme contains an attribute in J . Then,Q consists of the relations remaining in Q.
Denote byG = (Ṽ,Ẽ) the hypergraph defined byQ. SetH = H ∩ attset(Q) and L = attset(Q) \H. J is precisely the set of isolated attributes decided byQ andH. 3 Define a functionW :Ẽ → [0, 1] by settingW (e) = W (e) for every e ∈Ẽ.W is a fractional edge packing ofG. Because every edge e ∈ E containing an attribute in J is preserved inẼ, 4 we have W J =W J . Applying Theorem 5.1 toQ gives: η∈config(Q,H) Join(Q isolated (η)) ≤ λ |H|−W J · m |J | = λ |H|−W J · m |J | . (5.17) 3 LetĨ be the set of isolated attributes after removingH fromG. We want to prove J =Ĩ. It is easy to show J ⊆Ĩ. To proveĨ ⊆ J , suppose that there is an attribute X such that X ∈Ĩ but X / ∈ J . As X appears inG, we know X / ∈ I. Hence, G must contain an edge {X, Y } with Y / ∈ H. This means Y / ∈ I, because of which the edge {X, Y } is disjoint with J and thus must belong toG. But this contradicts the fact X ∈Ĩ. 4 Suppose that there is an edge e = {X, Y } such that X ∈ J and yet e / ∈Ẽ. It means that Y ∈J ⊆ I. But then e is incident on two attributes in I, which is impossible. Overall, the load of our algorithm isÕ(p 1/ρ + p 2 + m/p 1/ρ ). This brings us to our second main result: Theorem 6.2. Given a simple binary join query with input size m ≥ p 3 and a fractional edge covering number ρ, we can answer it in the MPC model using p machines in constant rounds with loadÕ(m/p 1/ρ ), subject to a failure probability of at most 1/p c where c can be set to an arbitrarily large constant.

Concluding Remarks
This paper has introduced an algorithm for computing a natural join over binary relations under the MPC model. Our algorithm performs a constant number of rounds and incurs a load ofÕ(m/p 1/ρ ) where m is the total size of the input relations, p is the number of machines, and ρ is the fractional edge covering number of the query. The load matches a known lower bound up to a polylogarithmic factor. Our techniques heavily rely on a new finding, which we refer to as the isolated cartesian product theorem, on the join problem's mathematical structure.
We conclude the paper with two remarks: • The assumption p 3 ≤ m can be relaxed to p ≤ m 1− for an arbitrarily small constant > 0.
Recall that our algorithm incurs a load ofÕ(p 1/ρ + p 2 + m/p 1/ρ ) where the termsÕ(p 1/ρ ) andÕ(p 2 ) are both due to the computation of statistics (in preprocessing and Step 2, respectively). In turn, these statistics are needed to allocate machines for subproblems. By using the machine-allocation techniques in [10], we can avoid most of the statistics communication and reduce the load toÕ(p + m/p 1/ρ ). • In the external memory (EM) model [4], we have a machine equipped with M words of internal memory and an unbounded disk that has been formatted into blocks of size B words. An I/O either reads a block of B words from the disk to the memory, or overwrites a block with B words in the memory. A join query Q is considered solved if every tuple u ∈ Q has been generated in memory at least once. The challenge is to design an algorithm to achieve the purpose with as few I/Os as possible. There exists a reduction [13] that can be used to convert an MPC algorithm to an EM counterpart. Applying the reduction on our algorithm gives an EM algorithm that solves Q withÕ( m ρ B·M ρ−1 ) I/Os, provided that M ≥ m c for some positive constant c < 1 that depends on Q. The I/O complexity can be shown to be optimal up to a polylogarithmic factor using the lower-bound arguments in [11,18]. We suspect that the constraint M ≥ m c can be removed by adapting the isolated cartesian product theorem to the EM model.