Recherche d'une paire de vecteurs de bits sans chevauchement

17

Je vous donne une liste de n vecteurs bit de largeur k . Votre objectif est de renvoyer deux vecteurs de bits de la liste qui n'ont pas de 1 en commun, ou bien de signaler qu'aucune telle paire n'existe.

Par exemple, si je vous donne [00110,01100,11000] la seule solution est {00110,11000} . Alternativement, l'entrée [111,011,110,101] n'a pas de solution. Et toute liste qui contient le vecteur de bits tout à zéro 000...0 et un autre élément e a une solution triviale {e,000...0} .

Voici un exemple légèrement plus difficile, sans solution (chaque ligne est un vecteur de bits, les carrés noirs sont 1s et les carrés blancs sont 0s):

■ ■ ■ ■ □ □ □ □ □ □ □ □ □
■ □ □ □ ■ ■ ■ □ □ □ □ □ □ 
■ □ □ □ □ □ □ ■ ■ ■ □ □ □
■ □ □ □ □ □ □ □ □ □ ■ ■ ■
□ ■ □ □ □ ■ □ □ □ ■ ■ □ □
□ ■ □ □ ■ □ □ □ ■ □ □ □ ■
□ ■ □ □ □ □ ■ ■ □ □ □ ■ □ <-- All row pairs share a black square
□ □ ■ □ □ □ ■ □ ■ □ ■ □ □
□ □ ■ □ □ ■ □ ■ □ □ □ □ ■
□ □ ■ □ ■ □ □ □ □ ■ □ ■ □
□ □ □ ■ ■ □ □ ■ □ □ ■ □ □
□ □ □ ■ □ □ ■ □ □ ■ □ □ ■
□ □ □ ■ □ ■ □ □ ■ □ □ ■ □

Dans quelle mesure peut-on trouver deux vecteurs binaires non superposés ou s’il n’existe pas?

L'algorithme naïf, où vous comparez simplement chaque paire possible, est O(n2k) . Est-il possible de faire mieux?

Craig Gidney
la source
Une réduction possible: Vous avez un graphe G avec un sommet pour chaque vecteur et une arête entre deux sommets si les deux vecteurs correspondants ont un 1 en commun. Vous voulez savoir si le diamètre du graphique est 2 . Mais il semble difficile d'aller plus vite que O(n2k) .
François
@ FrançoisGodi Tout composant graphique connecté avec trois nœuds et une arête manquante a un diamètre d'au moins deux. Avec une représentation de liste d'adjacence, il faut du temps O(V) pour vérifier cela.
Craig Gidney
@Strilanc Bien sûr, s'il n'y a pas de solution, le graphique est complet (plus clair que diamètre = 1, vous avez raison), mais le calcul de la représentation de la liste d'adjacence peut être long.
François
est-il kplus petit que la largeur des mots de votre machine?
Raphael
1
@TomvanderZanden On dirait que cela violerait les invariants sur lesquels s'appuie probablement la structure de données. En particulier, cette égalité doit être transitive. J'ai déjà pensé à utiliser un trie et je ne vois pas comment éviter une explosion de facteur 2 à chaque fois que le masque de bits de la requête a un 0.
Craig Gidney

Réponses:

10

Warmup: bitvectors aléatoires

En guise d'échauffement, nous pouvons commencer avec le cas où chaque bitvector est choisi de manière uniforme au hasard. Il s'avère ensuite que le problème peut être résolu en O(n1.6min(k,lgn)) temps (plus précisément, le 1.6 peut être remplacé par lg3 ).

Nous considérerons la variante à deux ensembles suivante du problème:

Étant donné les ensembles S,T{0,1}k de vecteurs binaires, déterminer où il existe une paire non superposée sS,tT .

La technique de base pour résoudre ce problème consiste à diviser pour mieux régner. Voici un algorithme de temps utilisant la division et la conquête:O(n1.6k)

  1. Divisez et T en fonction de la première position de bit. En d'autres termes, forme S 0 = { s S : s 0 = 0 } , S 1 = { s S : s 0 = 1 } , T 0 = { t T : t 0 = 0 } , T 1 = { t T : 0STS0={sS:s0=0}S1={sS:s0=1}T0={tT:t0=0}T1={tT:t0=1} .

  2. Recherchez maintenant récursivement une paire non superposée de , de S 0 , T 1 et de T 1 , S 0 . Si un appel récursif trouve une paire non chevauchante, sortez-la, sinon sortez "Aucune paire chevauchante existe".S0,T0S0,T1T1,S0

Puisque tous les vecteurs binaires sont choisis au hasard, nous pouvons nous attendre et | T b | | T | / 2 . Ainsi, nous avons trois appels récursifs et nous avons réduit la taille du problème d'un facteur deux (la taille des deux ensembles est réduite d'un facteur deux). Après la séparation de lg min ( | S | , | T | ) , l'un des deux ensembles est réduit à la taille 1 et le problème peut être résolu en temps linéaire. Nous obtenons une relation de récurrence le long des lignes de|Sb||S|/2|Tb||T|/2lgmin(|S|,|T|) , dont la solution est T ( n ) = O ( n 1,6 k )T(n)=3T(n/2)+O(nk)T(n)=O(n1.6k) . En tenant compte du temps d'exécution plus précisément dans le cas des deux ensembles, nous voyons que le temps d'exécution est .O(min(|S|,|T|)0.6max(|S|,|T|)k)

Ceci peut être encore amélioré, en notant que si , alors la probabilité qu'il existe une paire non chevauchante est exponentiellement petite. En particulier, si x , y sont deux vecteurs aléatoires, la probabilité qu'ils ne chevauche pas ( 3 / quatre ) k . Si | S | = | T | = n , il existe n 2 de ces paires, donc par une union liée, la probabilité qu'il existe une paire non chevauchante est au plus n 2 ( 3k2.5lgn+100x,y(3/4)k|S|=|T|=nn2 . Lorsque k 2,5 lg n + 100 , ceci est1 / deux cent . Donc, en tant qu'étape de prétraitement, si k 2,5 lg n + 100 , alors nous pouvons immédiatement retourner "Aucune paire non chevauchante existe" (la probabilité que cela soit incorrect est négligeable), sinon nous exécutons l'algorithme ci-dessus.n2(3/4)kk2.5lgn+1001/2100k2.5lgn+100

On obtient ainsi un temps de fonctionnement de (ou O ( min ( | S | , | T | ) 0,6 max ( | S | , | T | ) min ( k , lg n) ) ) pour la variante à deux ensembles proposée ci-dessus), pour le cas spécial où les vecteurs binaires sont choisis uniformément au hasard.O(n1.6min(k,lgn))O(min(|S|,|T|)0.6max(|S|,|T|)min(k,lgn))

Of course, this is not a worst-case analysis. Random bitvectors are considerably easier than the worst case -- but let's treat it as a warmup, to get some ideas that perhaps we can apply to the general case.

Lessons from the warmup

We can learn a few lessons from the warmup above. First, divide-and-conquer (splitting on a bit position) seems helpful. Second, you want to split on a bit position with as many 1's in that position as possible; the more 0's there are, the less reduction in subproblem size you get.

Third, this suggests that the problem gets harder as the density of 1's gets smaller -- if there are very few 1's among the bitvectors (they are mostly 0's), the problem looks quite hard, as each split reduces the size of the subproblems a little bit. So, define the density Δ to be the fraction of bits that are 1 (i.e., out of all nk bits), and the density of bit position i to be the fraction of bitvectors that are 1 at position i.

Handling very low density

As a next step, we might wonder what happens if the density is extremely small. It turns out that if the density in every bit position is smaller than 1/k, we're guaranteed that a non-overlapping pair exists: there is a (non-constructive) existence argument showing that some non-overlapping pair must exist. This doesn't help us find it, but at least we know it exists.

Why is this the case? Let's say that a pair of bitvectors x,y is covered by bit position i if xi=yi=1. Note that every pair of overlapping bitvectors must be covered by some bit position. Now, if we fix a particular bit position i, the number of pairs that can be covered by that bit position is at most (nΔ(i))2<n2/k. Summing across all k of the bit positions, we find that the total number of pairs that are covered by some bit position is <n2. This means there must exist some pair that's not covered by any bit position, which implies that this pair is non-overlapping. So if the density is sufficiently low in every bit position, then a non-overlapping pair surely exists.

However, I'm at a loss to identify a fast algorithm to find such a non-overlapping pair, in these regime, even though one is guaranteed to exist. I don't immediately see any techniques that would yield a running time that has a sub-quadratic dependence on n. So, this is a nice special case to focus on, if you want to spend some time thinking about this problem.

Towards a general-case algorithm

In the general case, a natural heuristic seems to be: pick the bit position i with the most number of 1's (i.e., with the highest density), and split on it. In other words:

  1. Find a bit position i that maximizes Δ(i).

  2. Split S and T based upon bit position i. In other words, form S0={sS:si=0}, S1={sS:si=1}, T0={tT:ti=0}, T1={tT:ti=1}.

  3. Now recursively look for a non-overlapping pair from S0,T0, from S0,T1, and from T1,S0. If any recursive call finds a non-overlapping pair, output it, otherwise output "No overlapping pair exists".

The challenge is to analyze its performance in the worst case.

Let's assume that as a pre-processing step we first compute the density of every bit position. Also, if Δ(i)<1/k for every i, assume that the pre-processing step outputs "An overlapping pair exists" (I realize that this doesn't exhibit an example of an overlapping pair, but let's set that aside as a separate challenge). All this can be done in O(nk) time. The density information can be maintained efficiently as we do recursive calls; it won't be the dominant contributor to running time.

What will the running time of this procedure be? I'm not sure, but here are a few observations that might help. Each level of recursion reduces the problem size by about n/k bitvectors (e.g., from n bitvectors to nn/k bitvectors). Therefore, the recursion can only go about k levels deep. However, I'm not immediately sure how to count the number of leaves in the recursion tree (there are a lot less than 3k leaves), so I'm not sure what running time this should lead to.

D.W.
la source
ad low density: this seems to be some kind of pigeon-hole argument. Maybe if we use your general idea (split w.r.t. the column with the most ones), we get better bounds because the (S1,T1)-case (we don't recurse to) already gets rid of "most" ones?
Raphael
The total number of ones may be a useful parameter. You have already shown a lower bound we can use for cutting off the tree; can we show upper bounds, too? For example, if there are more than ck ones, we have at least c overlaps.
Raphael
By the way, how do you propose we do the first split; arbitrarily? Why not just split the whole input set w.r.t. some column i? We only need to recurse in the 0-case (there is no solution among those that share a one at i). In expectation, that gives via T(n)=T(n/2)+O(nk) a bound of O(nk) (if k fixed). For a general bound, you have shown that we can (assuming the lower-bound-cutoff you propose) that we get rid of at least n/k elements with every split, which seems to imply an O(nk) worst-case bound. Or am I missing something?
Raphael
Ah, that's wrong, of course, since it does not consider 0-1-mismatches. That's what I get for trying to think before breakfast, I guess.
Raphael
@Raphael, there are two issues: (a) the vectors might be mostly zeros, so you can't count on getting a 50-50 split; the recurrence would be something more like T(n)=T((nn/k)k)+O(nk), (b) more importantly, it's not enough to just recurse on the 0-subset; you also need to examine pairings between a vector from the 0-subset and a vector from the 1-subset, so there's an additional recursion or two to do. (I think? I hope I got that right.)
D.W.
8

Faster solution when nk, using matrix multiplication

Suppose that n=k. Our goal is to do better than an O(n2k)=O(n3) running time.

We can think of the bitvectors and bit positions as nodes in a graph. There is an edge between a bitvector node and a bit position node when the bitvector has a 1 in that position. The resulting graph is bipartite (with the bitvector-representing nodes on one side and the bitposition-representing nodes on the other), and has n+k=2n nodes.

Given the adjacency matrix M of a graph, we can tell if there is a two-hop path between two vertices by squaring M and checking if the resulting matrix has an "edge" between those two vertices (i.e. the edge's entry in the squared matrix is non-zero). For our purposes, a zero entry in the squared adjacency matrix corresponds to a non-overlapping pair of bitvectors (i.e. a solution). A lack of any zeroes means there's no solution.

Squaring an n x n matrix can be done in O(nω) time, where ω is known to be under 2.373 and conjectured to be 2.

So the algorithm is:

  • Convert the bitvectors and bit positions into a bipartite graph with n+k nodes and at most nk edges. This takes O(nk) time.
  • Compute the adjacency matrix of the graph. This takes O((n+k)2) time and space.
  • Square the adjacency matrix. This takes O((n+k)ω) time.
  • Search the bitvector section of the adjacency matrix for zero entries. This takes O(n2) time.

The most expensive step is squaring the adjacency matrix. If n=k then the overall algorithm takes O((n+k)ω)=O(nω) time, which is better than the naive O(n3) time.

This solution is also faster when k grows not-too-much-slower and not-too-much-faster than n. As long as kΩ(nω2) and kO(n2ω1), then (n+k)ω is better than n2k. For w2.373 that translates to n0.731kn1.373 (asymptotically). If w limits to 2, then the bounds widen towards nϵkn2ϵ.

Craig Gidney
la source
1. This is also better than the naive solution if k=Ω(n) but k=o(n1.457). 2. If kn, a heuristic could be: pick a random subset of n bit positions, restrict to those bit positions and use matrix multiplication to enumerate all pairs that don't overlap in those n bit positions; for each such pair, check if it solves the original problem. If there aren't many pairs that don't overlap in those n bit positions, this provides a speedup over the naive algorithm. However I don't know a good upper bound on the number of such pairs.
D.W.
4

This is equivalent to finding a bit vector which is a subset of the complement of another vector; ie its 1's occur only where 0's occur in the other.

If k (or the number of 1's) is small, you can get O(n2k) time by simply generating all the subsets of the complement of each bitvector and putting them in a trie (using backtracking). If a bitvector is found in the trie (we can check each before complement-subset insertion) then we have a non-overlapping pair.

If the number of 1's or 0's is bounded to an even lower number than k, then the exponent can be replaced by that. The subset-indexing can be on either each vector or its complement, so long as probing uses the opposite.

There's also a scheme for superset-finding in a trie that only stores each vector only once, but does bit-skipping during probes for what I believe is similar aggregate complexity; ie it has o(k) insertion but o(2k) searches.

KWillets
la source
thanks. The complexity of your solution is n2(1p)k, where p is the probability of 1's in the bitvector. A couple of implementation details: though this is a slight improvement, there's no need to compute and store the complements in the trie. Just following the complementary branches when checking for a non-overlapping match is enough. And, taking the 0's directly as wildcards, no special wildcard is needed, either.
Mauro Lacy
2

Represent the bit vectors as an n×k matrix M. Take i and j between 1 and n.

(MMT)ij=lMilMjl.

(MMT)ij, the dot product of the ith and jth vector, is non-zero if, and only if, vectors i and j share a common 1. So, to find a solution, compute MMT and return the position of a zero entry, if such an entry exists.

Complexity

Using naive multiplication, this requires O(n2k) arithmetic operations. If n=k, it takes O(n2.37) operations using the utterly impractical Coppersmith-Winograd algorithm, or O(n2.8) using the Strassen algorithm. If k=O(n0.302), then the problem may be solved using n2+o(1) operations.

Ben
la source
How is this different from Strilanc's answer?
D.W.
1
@D.W. Using an n-by-k matrix instead of an (n+k)-by-(n+k) matrix is an improvement. Also it mentions a way to cut off the factor of k when k << n, so that might be useful.
Craig Gidney