The Hackerlab at regexps.com

Bitsets

up: libhackerlab
next: Hashing
prev: Arrays

The Hackerlab C library provides interfaces for managing several kinds of bitset.

The most familiar bitsets are "flat bitsets" -- packed arrays of bits. See Flat Bitsets.

"Bitset trees" are a space and time efficient representation for very large bitsets which contain long sub-sequences of all 0 or all 1 bits. These sets are stored in a tree with ordinary bitsets for leaf nodes. As an important optimization, homogenous subtrees (those which are all 1 or 0 ) are represented by one word -- regardless of how many bits they contain. See Bitset Trees.

"Shared bitset trees" are bitset trees augmented with copy-on-write semantics. See Shared Bitset Trees.


Flat Bitsets

up: Bitsets
next: Bitset Trees


#include <hackerlab/bitset/bitset.h>

A bitset (or flat bitset ) is a packed array whose elements are each 1 bit. A bitset can be considered a finite, ordered set of numbered entities.

The type bitset_subset is a bitset small enough to fit in a variable of type void * . (See Portability Assumptions in the Hackerlab C Library.

The type bitset is pointer to bitset_subset . Bitset functions generally operate on a pair of values: the size (measured in bits, an integer of type bit_t ) and the set (a pointer of type bitset_subset * ).

Some functions operate on an arbitrary subsequence of a bitset. Their arguments include a starting bit (from ), and an ending bit (to ); these have names that end in the suffix _range . When a function operates on a range of bits, the range is specified as a half-open interval. For example,

     bitset_clear_range (bits, from, to)

changes the bits numbered:

     from ... to - 1

and the bit numbered to is unaffected. This permits bit addresses to be used similarly to pointers: a range of n bits beginning at bit pos is:

     pos, pos + n

Bitsets are ordinarily allocated in sizes rounded up to the nearest sizeof (bitset_subset) . Operations on bitsets, however, are precise: they modify only the bits in a bitset and no others. For that reason, it is safe to operate on a bitset with N elements as if it had only M elements (with M < N ). Bits M..N-1 will be unaffected.


Bitset Types and Macros

up: Flat Bitsets
next: Allocating and Freeing Bitsets

Type bit_t

typedef long bit_t;

Values of type bit_t represent a bit address within a bitset. The address of the first bit is 0 .

bit_t is a signed integer type. Some functions which return a bit address use the value -1 to indicate no such bit .



Type bitset_subset

typedef unsigned long bitset_subset;

A fragment of a bitset. A bitset is an array of these.



Type bitset

typedef bitset_subset * bitset;

A packed array of bits.



Macro bitset_which_subset

#define bitset_which_subset(N)  ((N) / bits_per_subset)

A macro useful for finding the subset index within a bitset of a particular bitset element. The subset containing bit N bit bitset B is:

     B[bitset_which_subset(N)]



Macro bitset_subset_offset

#define bitset_subset_offset(N) \
       (((N) / bits_per_subset) * bits_per_subset)

A macro useful for finding a subset offset within a bitset of the subset containing a bitset element. Bit 0 (the low order bit) of the subset containing bit N is the bit whose bit address is bitset_subset_offset(N) .



Macro bitset_which_bit

#define bitset_which_bit(N)  ((N) & bitset_subset_mask)

The bit number of a bitset index within its subset.



Macro bitset_numb_subsets

#define bitset_numb_subsets(N)

As a function:

     bit_t bitset_numb_subsets (bit_t n);

Return the number of values of type subset necessary to represent a bitset with n elements. Because this is a macro, it can be used in a declaration:

     {
       // declare a bitset that can hold 12 elements:
       //
       bitset_subset options_set [bitset_numb_subsets(12)];
     }



Macro sizeof_bitset

#define sizeof_bitset(N)

As a function:

     size_t sizeof_bitset (bit_t n);

Return the size, in bytes, of a bitset large enough to hold n elements:

     // allocate a bitset that can hold 12 elements:
     //
     options_set = (bitset)must_malloc (sizeof_bitset (12));




Allocating and Freeing Bitsets

up: Flat Bitsets
next: Tests on Bitsets
prev: Bitset Types and Macros

Function bitset_alloc

bitset bitset_alloc (alloc_limits limits, bit_t size);

Allocate an empty bitset large enough to hold size members.

Allocation is performed by lim_malloc using limits . See Allocation With Limitations.



Function bitset_free

void bitset_free (alloc_limits limits, bitset a);

Free the bitset a .

Deallocation is performed by lim_free using limits . See Allocation With Limitations.



Function bitset_dup

bitset bitset_dup (alloc_limits limits, bit_t size, bitset a);

Allocate a copy of bitset a which has size members.

Allocation is performed by lim_malloc using limits . See Allocation With Limitations.




Tests on Bitsets

up: Flat Bitsets
next: Set Operations
prev: Allocating and Freeing Bitsets

Function bitset_is_member

int bitset_is_member (bitset b, bit_t n);

Return bit n of bitset b (either 0 or 1 ).



Function bitset_is_equal

int bitset_is_equal (bit_t size, bitset a, bitset b);

Compare two bitsets for equality. Both bitsets have size members. Return 1 if they are equal, 0 otherwise.



Function bitset_is_subset

int bitset_is_subset (bit_t size, bitset a, bitset b);

Return 1 if bitset b is a subset of bitset a , 0 otherwise.



Function bitset_is_empty

int bitset_is_empty (bit_t size, bitset a);

Return 1 if bitset a has no members.



Function bitset_is_empty_range

int bitset_is_empty_range (bitset a, bit_t from, bit_t to);

Return 1 if bitset a has no members in the closed interval [from ... to-1] .



Function bitset_is_full

int bitset_is_full (bit_t size, bitset a);

Return 1 if bitset a is filled (all ones).



Function bitset_is_full_range

int bitset_is_full_range (bitset a, bit_t from, bit_t to);

Return 1 if bitset a is filled in the closed interval [from .. to-1] .




Set Operations

up: Flat Bitsets
next: Population Size
prev: Tests on Bitsets

Function bitset_adjoin

void bitset_adjoin (bitset b, bit_t n);

Set bit n of bitset b to 1 .



Function bitset_remove

void bitset_remove (bitset b, bit_t n);

Set bit n of bitset b to 0 .



Function bitset_toggle

void bitset_toggle (bitset b, bit_t n);

If bit n of bitset b is 1 , set it to 0 ; if 0 , set it to 1 .



Function bitset_clear

void bitset_clear (bit_t size, bitset b);

Clear all bits of bitset b .



Function bitset_clear_range

void bitset_clear_range (bitset b, bit_t from, bit_t to);

Clear the bits from .. to - 1 of bitset b .



Function bitset_fill

void bitset_fill (bit_t size, bitset b);

Set all bits of bitset b .



Function bitset_fill_range

void bitset_fill_range (bitset b, bit_t from, bit_t to);

Set the bits from .. to - 1 of bitset b .



Function bitset_complement

void bitset_complement (bit_t size, bitset b);

Toggle all bits in bitset b .



Function bitset_assign

void bitset_assign (bit_t size, bitset a, bitset b);

Copy bitset b to bitset a .



Function bitset_union

void bitset_union (bit_t size, bitset a, bitset b);

Add all elements of bitset b to bitset a .



Function bitset_intersection

void bitset_intersection (bit_t size, bitset a, bitset b);

Remove from bitset a all alements that are not also in bitset b .



Function bitset_difference

void bitset_difference (bit_t size, bitset a, bitset b);

Remove from bitset a all alements that are in bitset b .



Function bitset_revdifference

void bitset_revdifference (bit_t size, bitset a, bitset b);

Set all bits in a that are set in bitset b but not in a ; clear all other bits of bitset a .

This is similar to bitset_difference (size, b, a) except that the result is assigned to bitset a instead of bitset b .



Function bitset_xor

void bitset_xor (bit_t size, bitset a, bitset b);

Toggle all bits in bitset a that are set in bitset b .




Population Size

up: Flat Bitsets
next: First Bit Set -- First Bit Clear
prev: Set Operations

Function bitset_population

bit_t bitset_population (bit_t size, bitset a);

Return the number of bits set in bitset a .



Function bitset_population_range

bit_t bitset_population (bitset a, bit_t from, bit_t to);

Return the number of bits set in bitset a couning only members in the closed interval [from .. to-1] .




First Bit Set -- First Bit Clear

up: Flat Bitsets
prev: Population Size

Function bitset_ffs

bit_t bitset_ffs (bit_t size, bitset b);

Return the (bit) address of the first bit set in bitset b . Return -1 if no bit is set.



Function bitset_ffs_range

bit_t bitset_ffs_range (bitset b, bit_t from, bit_t to);

Return the (bit) address of the first bit set in bitset b in the range from .. to-1 . Return -1 if no bit is set.



Function bitset_ffc

bit_t bitset_ffc (bit_t size, bitset b);

Return the (bit) address of the first bit clear in bitset b . Return -1 if no bit is clear.



Function bitset_ffc_range

bit_t bitset_ffc_range (bitset b, bit_t from, bit_t to);

Return the (bit) address of the first bit clear in bitset b in the range from .. to-1 . Return -1 if no bit is set.




Bitset Trees

up: Bitsets
next: Shared Bitset Trees
prev: Flat Bitsets


#include <hackerlab/bitsets/bitset-tree.h>

A bitset tree is a sparse representation for a large bitset which is mostly homogenous (i.e., containing large runs which are all 0 or all 1 ). A bitset tree saves time and memory by compressing the representation of homogenous subsets.

It is important to note that if your bitsets do not contain large, contiguous, homogenous (all 0 or all 1 ) subsets, then bitset trees are of no help -- they add space and time overhead and offer no advantage. If your sets do have homogenous subsets, bitset trees can lead to significant space and time savings.

Usually, you will not want to use bitset trees directly. Instead, you will want to use "shared bitset trees". (See Shared Bitset Trees.) Nevertheless, even if you will use shared bitset trees, it is important to understand how bitset trees work.

The Bitset Tree Data Structure

This section describes the internal representation of bitset trees. The public interface to bitset trees hides most of the details of this representation -- but understanding it will help you predict the performance of programs which use bitset trees, and design bitset tree rules. (See Bitset Tree Rules.)

Bitset trees are defined recursively: a bitset tree is an array of (pointers to) bitset trees. The leaf nodes of a bitset trees are (pointers to) ordinary bitsets. (See Flat Bitsets.)

A bitset tree represents a single logical bitset which is the concatenation of its sub-trees. Each subtree at a given distance from the root has the same number of members and the same depth and branching structure.

As an optimization, if a particular subtree is all 0s, that subtree may be replaced by a null pointer. If a particular subtree is all 1s, it may be replaced by the pointer (bits_tree)-1 . Whenever practical, functions which operate on bitset trees use this optimization automatically. For example, bits_fill_range will store a subtree of all 0s or all 1s efficiently. In some cases, the optimization is not practical. For example, bits_adjoin does not attempt to optimize the subtrees it modifies.

The function bits_tree_compact recursively searches a bitset tree, replacing homogenous sub-trees with 0 or -1 .

Bitset Tree Rules

The branching structure of a bitset tree is determined by an array of structures of type struct bits_tree_rule :

Type struct bits_tree_rule

struct bits_tree_rule;


struct bits_tree_rule
{
  int fanout;
  size_t subset_size;
  size_t subset_shift;
  size_t subset_mask;
};


An array of struct bits_tree_rule values determines the branching structure of a bitset tree. Nodes at distance N from the the root of the tree are defined by the N th element of the array. (See also The Bitset Tree Data Structure.)

fanout is the number of sub-trees a node has. It is 0 for leaf nodes.

subset_size is the number of bits in each sub-tree. For leaf nodes, this should be the number of bits in the leaf node. For optimal performance, subset_size should be a multiple of sizeof (bitset_subset) .

Given a non-leaf bitset tree T , and a bit address within that tree, B , the subset containing that bit is:

             T[B / subset_size]

The relative address of the bit within that subtree is:

             B % subset_size

The fields subset_shift and subset_mask are not used for leaf nodes. For non-leaf nodes, if subset_size is not a power of two, the fields subset_shift and subset_mask should be 0 . Otherwise, they should be set as follows:

             subset_shift = log2(subset_size)
             subset_mask = subset_size - 1

Here is an example for bitset containing 1<<21 elements. In this example, the root node of the tree has 32 sub-trees; each second and third level tree has 16 sub-trees; leaf nodes have 256 bits (32 * 16 * 16 * 256 == 1<<21 ):

     struct bits_tree_rule bitset_rule[] =
     {
       {32, 1<<16, 16, 0xffff}, // root has 32 subtrees
       {16, 1<<12, 12, 0xfff},  // level 1 nodes have 16 subtrees
       {16, 256, 8, 0xff},      // level 2 nodes have 16 subtrees
       {0, 256, 0, 0}           // leaf nodes have 256 bits
     };

Bitset trees of the same size (1<<21 bits) could be represented other ways. For example:

     struct bits_tree_rule bitset_rule[] =
     {
       {16, 1<<17, 17, 0x1ffff}, // root has 16 subtrees
       {2, 1<<16, 16, 0xffff},   // level 1 nodes have 2 subtrees
       {16, 1<<12, 12, 0xfff},   // level 2 nodes have 16 subtrees
       {16, 256, 8, 0xff},       // level 3 nodes have 16 subtrees
       {0, 256, 0, 0}            // leaf nodes have 256 bits
     };

Some care is necessary when choosing the values for an array of struct bits_tree_rule . For a given bitset size, a deeper bitset tree (more elements in the rules array) means that the worst-case cost of accessing or modifying a single bit is raised. On the other hand, homogenous sub-trees (at any depth) are (often) replaced by a 0 or -1 pointer saving both space and time -- a deeper tree may offer more opportunities for that optimization. The best branching structure depends on the particular sets your programs uses and the particular access pattern of your program; experimentation with different branching structures may be necessary.



Allocating Bitset Trees

Function bits_tree_alloc

bits_tree bits_tree_alloc (alloc_limits lim,
                           struct bits_tree_rule * rule);

Allocate a new bitset tree.

lim describes allocation limits that apply to this bitset tree. For more information about allocation limits, see Allocation With Limitations.

rule describes the branching structure of the bitset tree. See Bitset Tree Rules.

The new set is initialized to all 0 s.

If allocation fails, 0 is returned.



Function bits_tree_free

void bits_tree_free (alloc_limits lim,
                     struct bits_tree_rule * rule,
                     bits_tree b);

Free all storage associated with a bitset tree.

lim describes allocation limits that apply to this bitset tree. For more information about allocation limits, see Allocation With Limitations.

rule describes the branching structure of the bitset tree. See Bitset Tree Rules.



Function bits_tree_compact

bit_t bits_tree_compact (alloc_limits lim,
                         struct bits_tree_rule * rule,
                         bits_tree a);

Optimize a bitset tree by compacting homogenous sub-trees. See The Bitset Tree Data Structure.

lim describes allocation limits that apply to this bitset tree. For more information about allocation limits, see Allocation With Limitations.

rule describes the branching structure of the bitset tree. See Bitset Tree Rules.

The population size of the set is returned.



Function bits_tree_dup

bits_tree bits_tree_dup (alloc_limits lim,
                         struct bits_tree_rule * rule,
                         bits_tree a);

Allocate a new copy of bitset tree a .

lim describes allocation limits that apply to this bitset tree. For more information about allocation limits, see Allocation With Limitations.

rule describes the branching structure of the bitset tree. See Bitset Tree Rules.

If allocation fails, 0 is returned.



Operations on Bitset Trees

Each of the operations defined for flat bitsets has a corresponding operation for bitset trees. See Flat Bitsets.

The bitset tree operations are:

    int bits_tree_is_member (alloc_limits lim,
                      struct bits_tree_rule * rule,
                      bits_tree b,
                      int n);
    int bits_tree_is_equal (alloc_limits lim,
                     struct bits_tree_rule * rule,
                     bits_tree a, bits_tree b);
    int bits_tree_is_subset (alloc_limits lim,
                      struct bits_tree_rule * rule,
                      bits_tree a,
                      bits_tree b);
    int bits_tree_is_empty (alloc_limits lim,
                     struct bits_tree_rule * rule,
                     bits_tree a);
    int bits_tree_is_full (alloc_limits lim,
                    struct bits_tree_rule * rule,
                    bits_tree a);
    int bits_tree_is_empty_range (alloc_limits lim,
                           struct bits_tree_rule * rule,
                           bits_tree a,
                           int from,
                           int to);
    int bits_tree_is_full_range (alloc_limits lim,
                          struct bits_tree_rule * rule,
                          bits_tree a,
                          int from,
                          int to);
    int bits_tree_adjoin (alloc_limits lim,
                   struct bits_tree_rule * rule,
                   bits_tree b,
                   int n);
    int bits_tree_remove (alloc_limits lim,
                   struct bits_tree_rule * rule,
                   bits_tree b, int n);
    int bits_tree_toggle (alloc_limits lim,
                   struct bits_tree_rule * rule,
                   bits_tree b,
                   int n);
    void bits_tree_clear (alloc_limits lim,
                   struct bits_tree_rule * rule,
                   bits_tree b);
    void bits_tree_fill (alloc_limits lim,
                  struct bits_tree_rule * rule,
                  bits_tree b);
    int bits_tree_clear_range (alloc_limits lim,
                        struct bits_tree_rule * rule,
                        bits_tree b,
                        int from,
                        int to);
    int bits_tree_fill_range (alloc_limits lim,
                       struct bits_tree_rule * rule,
                       bits_tree b,
                       int from,
                       int to);
    void bits_tree_complement (alloc_limits lim,
                        struct bits_tree_rule * rule,
                        bits_tree b);
    int bits_tree_assign (alloc_limits lim,
                   struct bits_tree_rule * rule,
                   bits_tree a,
                   bits_tree b);
    int bits_tree_union (alloc_limits lim,
                  struct bits_tree_rule * rule,
                  bits_tree a,
                  bits_tree b);
    int bits_tree_intersection (alloc_limits lim,
                         struct bits_tree_rule * rule,
                         bits_tree a,
                         bits_tree b);
    int bits_tree_difference (alloc_limits lim,
                       struct bits_tree_rule * rule,
                       bits_tree a,
                       bits_tree b);
    int bits_tree_revdifference (alloc_limits lim,
                          struct bits_tree_rule * rule,
                          bits_tree a,
                          bits_tree b);
    int bits_tree_xor (alloc_limits lim,
                struct bits_tree_rule * rule,
                bits_tree a,
                bits_tree b);
    int bits_tree_population (alloc_limits lim,
                       struct bits_tree_rule * rule,
                       bits_tree a);
    int bits_tree_population_range (alloc_limits lim,
                             struct bits_tree_rule * rule,
                             bits_tree a, int from, int to);
    int bits_tree_ffs (alloc_limits lim,
                struct bits_tree_rule * rule,
                bits_tree b);
    int bits_tree_ffc (alloc_limits lim,
                struct bits_tree_rule * rule,
                bits_tree b);
    int bits_tree_ffs_range (alloc_limits lim,
                      struct bits_tree_rule * rule,
                      bits_tree b,
                      int from,
                      int to);
    int bits_tree_ffc_range (alloc_limits lim,
                      struct bits_tree_rule * rule,
                      bits_tree b,
                      int from,
                      int to);

Each function performs the same operation as the corresponding bitset_ function (replace bits_tree with bitset_ .) For documentation, see Flat Bitsets. For that reason, the bits_tree_ functions are not individually documented.

Each bits_tree function takes two initial parameters, lim and rule .

lim describes allocation limits that apply to the bitset tree(s) being operated upon. For more information about allocation limits, see Allocation With Limitations.

rule describes the branching structure of the bitset trees. See Bitset Tree Rules.

These functions:

   bits_tree_adjoin
   bits_tree_remove
   bits_tree_toggle
   bits_tree_clear
   bits_tree_fill
   bits_tree_clear_range
   bits_tree_fill_range
   bits_tree_complement
   bits_tree_assign
   bits_tree_union
   bits_tree_intersection
   bits_tree_difference
   bits_tree_revdifference
   bits_tree_xor

return a value of type int . All of them will sometimes allocate memory.

If an allocation fails, these functions return -1 and have indeterminate side effect on the set being operated upon.

If all allocations succeed, they return 0 (and have the intended side effect on the set being operated upon).


Shared Bitset Trees

up: Bitsets
next: Unicode Character Bitsets
prev: Bitset Trees


#include <hackerlab/bitsets/bits.h>

Shared bitset trees are ordinary bitset trees with two differences:

1. The allocation limits that apply to a shared bitset tree are recorded when the tree is created and do not need to be passed as parameters to every bitset operation. (For more information about allocation limits, see Allocation With Limitations.

2. When a shared bitset tree is copied, very little data is actually duplicated: the old and the new bitset tree initially share state. Instead, copying takes place when (and if) either bitset is later modified.

Before reading this section, it is a good idea to first understand the material in Bitset Trees.

Allocating Shared Bitset Trees

Function bits_alloc

bits bits_alloc (alloc_limits lim, struct bits_tree_rule * rule);

Create a new shared bitset tree, subject to allocation limits lim , using branching structure rule .

For more information about allocation limits, see Allocation With Limitations.

For more information about branching structure rules, see Bitset Tree Rules.



Function bits_free

void bits_free (bits b);

Free a previously allocated shared bitset tree.



Function bits_dup

bits bits_dup (bits a);

Copy a shared bitset tree.

This operation is inexpensive -- most data is shared between the two trees until one of the two is modified.

If set a was created with no allocation limits, and allocation fails, this function does not return.

If set a was created with allocation limits, and allocation fails, this function returns 0 .



Function bits_compact

void bits_compact (bits a);

Optimize a shared bitset tree by compacting homogenous sub-trees. See The Bitset Tree Data Structure.



Operations on Shared Bitset Trees

Each of the operations defined for flat bitsets has a corresponding operation for shared bitset trees. See Flat Bitsets.

The shared bitset tree operations are:

    int bits_is_member (bits b, int n);
    int bits_is_equal (bits a, bits b);
    int bits_is_subset (bits a, bits b);
    int bits_is_empty (bits a);
    int bits_is_full (bits a);
    int bits_is_empty_range (bits a, int from, int to);
    int bits_is_full_range (bits a, int from, int to);
    int bits_adjoin (bits b, int n);
    int bits_remove (bits b, int n);
    int bits_toggle (bits b, int n);
    int bits_clear (bits b);
    int bits_fill (bits b);
    int bits_clear_range (bits b, int from, int to);
    int bits_fill_range (bits b, int from, int to);
    int bits_complement (bits b);
    int bits_assign (bits a, bits b);
    int bits_union (bits a, bits b);
    int bits_intersection (bits a, bits b);
    int bits_difference (bits a, bits b);
    int bits_revdifference (bits a, bits b);
    int bits_xor (bits a, bits b);
    int bits_population (bits a);
    int bits_population_range (bits a, int from, int to);
    int bits_ffs (bits b);
    int bits_ffc (bits b);
    int bits_ffs_range (bits b, int from, int to);
    int bits_ffc_range (bits b, int from, int to);

Each function performs the same operation as the corresponding bitset_ function (replace bits_ with bitset_ .) For documentation, see Flat Bitsets. For that reason, the bits_ functions are not individually documented.

These functions:

   bits_adjoin
   bits_remove
   bits_toggle
   bits_clear
   bits_fill
   bits_clear_range
   bits_fill_range
   bits_complement
   bits_assign
   bits_union
   bits_intersection
   bits_difference
   bits_revdifference
   bits_xor

return a value of type int . All of them will sometimes allocate memory.

If no allocation limit is being used, and an allocation fails, these functions do not return.

If an allocation limit is being used, and an allocation fails, these functions return -1 and have indeterminate side effect on the set being operated upon.

If allocation succeeds, they return 0 (and have the intended side effect on the set being operated upon).


Unicode Character Bitsets

up: Bitsets
prev: Shared Bitset Trees

Variable uni_bits_tree_rule

struct bits_tree_rule uni_bits_tree_rule[];

uni_bits_tree_rule defines a bitset tree branching structure suitable for representing sparse sets of Unicode code points. (See Bitset Tree Rules.)

Each set has 1 << 21 elements.

This tree structure has been tuned to efficiently represent sets corresponding to each of the Unicode General Categories of characters. (See Unicode Category Bitsets.)



libhackerlab: The Hackerlab C Library
The Hackerlab at regexps.com