Sign in to follow this  
rozz666

Criticize my implementation of BWT

Recommended Posts

rozz666    896
As the title says. I've implemented the forward Burrows-Wheeler transform. First it makes an ordered set of input values (alphabet). The for each sequence (in lexicographical order) prefix of length prefix_len it creates a sorted array (suffix array) of indices to suffixes of the input beginning with prefix. Then, it outputs the symbols preceding the beginnings of the suffixes. If an index points to the beginning of the input, the last element is returned instead and term is set to the number of already returned elements. I used this approach to save memory, since a full suffix array is 4 times bigger than the text itself. Waiting for you comments :-)
#ifndef __BWT_H
#define __BWT_H

#include <set>
#include <vector>
#include <deque>
#include <iterator>
#include <algorithm>

namespace bwt {

namespace detail {

// [first1, last1) prefix, [first2, last2) bwt input sequence, empty - lowest element
template <typename RanIt1, typename RanIt2>
bool _prefix_equal(RanIt1 first1, RanIt1 last1, RanIt2 first2, RanIt2 last2,
                   typename std::iterator_traits<RanIt1>::value_type empty)
{
    for(; first1 != last1 && first2 != last2; ++first1, ++first2) {

        if (*first1 != *first2) return false;
    }

    if (first1 == last1) return true;

    for(; first1 != last1; ++first1) {

        if (*first1 != empty) return false;
    }

    return true;
};

template <typename RanIt>
class _suffix_less
    : public std::binary_function<typename std::iterator_traits<RanIt>::difference_type,
                                  typename std::iterator_traits<RanIt>::difference_type,
                                  bool>  {
public:

    _suffix_less(RanIt first, RanIt last) : first(first), last(last) { }

    bool operator()(typename std::iterator_traits<RanIt>::difference_type left,
                    typename std::iterator_traits<RanIt>::difference_type right) const
    {
        RanIt first1(first + left), first2(first + right);

        for (; first1 != last && first2 != last; ++first1, ++first2) {

            if (*first1 < *first2) return true;
            else if (*first2 < *first1) return false;
        }

        return (first2 != last);
    }

private:

    RanIt first, last;
};

}

template <typename RanIt, typename OutIt>
void bwt(RanIt first, RanIt last, OutIt dest, typename std::iterator_traits<OutIt>::difference_type& term, typename std::iterator_traits<RanIt>::difference_type prefix_len = 1)
{
    typedef std::iterator_traits<RanIt>::difference_type difference_type;
    typedef std::iterator_traits<RanIt>::value_type value_type;

    std::set<value_type> values(first, last);
    std::vector<difference_type> indices;

    typedef std::vector<value_type> prefix_t;
    prefix_t prefix(prefix_len);

    std::fill_n(prefix.begin(), prefix_len, *values.begin());

    prefix_t::reverse_iterator prefix_it;

    difference_type _term = 0;

    do {

        for (RanIt it = first; it != last; ++it) {

            if (detail::_prefix_equal(prefix.begin(), prefix.end(), it, last, *values.begin()))
                indices.push_back(it - first);
        }

        std::sort(indices.begin(), indices.end(), detail::_suffix_less<RanIt>(first, last));

        for (std::vector<difference_type>::const_iterator it = indices.begin(); it != indices.end(); ++it) {

            if (*it == difference_type(0)) {

                *dest++ = *(last - 1);

                term = _term;
            }
            else {

                *dest++ = *(first + (*it - 1));
                ++_term;
            }
        }

        indices.clear();

        for (prefix_it = prefix.rbegin(); prefix_it != prefix.rend(); ++prefix_it) {

            std::set<value_type>::const_iterator next = ++values.find(*prefix_it);

            if (next == values.end()) {

                *prefix_it = *values.begin();
            }
            else {

                *prefix_it = *next;

                break;
            }
        }
    }
    while (prefix_it != prefix.rend());
};

}

#endif /* __BWT_H */



Share this post


Link to post
Share on other sites
implicit    504
The problem you're likely to run into is that the sorting might take a great deal of time, especially in degenerate cases. For instance consider sorting a megabyte of zeros, this would become an N^2 * log N operation or roughly the equivalent of sorting a terrabyte of data. Now this may seem like an extreme case, and it usually doesn't get nearly this bad, but remember that the whole point of block sorting and any associated prior transformations is to exploit these common prefixes.
Thankfully there are some quite clever specialized block sorting algorithms out there. This is the method I tried last time I played with BWT but to be honest I don't really know whether it's the most suitable algorithm, it was simply the first research paper I managed to find that I could actually decipher.

Share this post


Link to post
Share on other sites
Zahlman    1682
You can put standard library algorithms to better use, and use a class to fake "template typedefs". Something like:


// Don't use leading underscores on your #define symbols! These identifiers
// are reserved!
#ifndef BWT_H
#define BWT_H

#include <set>
#include <vector>
#include <deque>
#include <iterator>
#include <algorithm>

namespace bwt {

template <typename RanIt1, typename RanIt2, typename OutIt>
struct detail {
typedef typename std::iterator_traits<RanIt1>::value_type v1_type;
typedef typename std::iterator_traits<RanIt2>::value_type v2_type;
typedef typename std::iterator_traits<OutIt>::value_type vo_type;
typedef typename std::iterator_traits<RanIt1>::difference_type d1_type;
typedef typename std::iterator_traits<RanIt2>::difference_type d2_type;
typedef typename std::iterator_traits<OutIt>::difference_type do_type;

// Applying the same technique as _suffix_less to "bind" arguments
struct _prefix_equal: public std::unary_function<RanIt2, bool> {
_prefix_equal(RanIt1 first1, RanIt1 last1, RanIt2 last2, v1_type empty) : first(first1), last(last1), empty(empty) {}

bool operator()(RanIt2 first2) {
d1_type d1 = std::distance(first1, last1);
d2_type d2 = std::distance(first2, last2);

return (d1 < d2)
? (std::equal(first1, last1, first2))
: ((std::equal(first2, last2, first1)) && (std::count(first1 + d2, last1, empty) == d1 - d2));
};

private:
RanIt1 first1, last1;
RanIt2 last2;
v1_type empty;
};

struct _suffix_less: public std::binary_function<d1_type, d1_type, bool> {
_suffix_less(RanIt1 first, RanIt1 last) : first(first), last(last) {}

bool operator()(d1_type left, d1_type right) const {
// check the documentation and make sure this is exactly what you want.
// You might need to invert the sense of it or something...
return std::lexicographical_compare(first + left, last, first + right, last);
}
private:
RanIt1 first, last;
};

void _bwt(RanIt1 first, RanIt1 last, OutIt dest, do_type& term, d1_type prefix_len = 1) {
std::set<v1_type> values(first, last);

typedef std::vector<v1_type> prefix_t;
prefix_t prefix(prefix_len);

std::fill_n(prefix.begin(), prefix_len, *values.begin());

d1_type _term = 0;

prefix_t::reverse_iterator prefix_it;
do {
// declare the indices in here, then we don't have to re-clear the
// vector each time; it simply gets recreated.
// Instead of converting back and forth, just store iterators in the
// 'indices' vector, adapting _suffix_less() to match.
std::vector<RanIt1> indices;

// and then, our ugly for loop becomes a copy_if invocation:
std::copy_if(first, last, std::back_inserter(indices), _prefix_equal(prefix.begin(), prefix.end(), last, *values.begin()));

std::sort(indices.begin(), indices.end(), _suffix_less(first, last));

for (std::vector<RanIt1>::const_iterator it = indices.begin(); it != indices.end(); ++it) {
if (*it == first) {
*dest++ = *(last - 1);
term = _term;
} else {
*dest++ = *(*it - 1); // I think? o_O
++_term;
}
}

for (prefix_it = prefix.rbegin(); prefix_it != prefix.rend(); ++prefix_it) {
std::set<value_type>::const_iterator next = ++values.find(*prefix_it);

if (next == values.end()) {
*prefix_it = *values.begin();
} else {
*prefix_it = *next;
break;
}
}
}
while (prefix_it != prefix.rend());
}
};

template <typename RanIt, typename OutIt>
void bwt(RanIt first, RanIt last, OutIt dest, typename std::iterator_traits<OutIt>::difference_type& term, typename std::iterator_traits<RanIt>::difference_type prefix_len = 1) {
return detail<RanIt, RanIt, OutIt>::_bwt(first, last, dest, term, prefix_len);
}
// note the second template parameter is actually a dummy there :)

}

#endif /* BWT_H */



Share this post


Link to post
Share on other sites
rozz666    896
Thanks for the article. The sorting is indeed very slow. My previous implementation was for binary alphabet (output of Kautz-Zeckendorf encoding, http://www.stringology.org/event/2006/p21.html) and I used one of the method they described: comparing one symbol at a time and ternary quicksort ($ 00000 11111). I knew that there must be some way to exploit the fact that these are suffixes of the same string, but I never had time to think about it. Now I'll have a look at what Sadakane and Larsson came with.

This is my first generic attempt at BWT, so I'm not sure I used all iterators and traits correctly. Is there anything wrong apart from sorting?

Share this post


Link to post
Share on other sites
rozz666    896
Zahlman:

Thanks for the tips. I see I still have to learn a lot, espiecially about orginizing code :-)

Good to know that __xx marcos are reserved. I ALWAYS used tham as guards :-)

As for _prefix_equal::operator(). I thought about using std::count but it's slower than what I wrote. It's one of the bottlenecks. The second is std::sort(), but I'm going to use a different algorithm for that.

As for std::vector in the loop vs. outside and clear(), usually is memory freed when I call clear()? Because if not, I'm saving reallocations.

Edit: As for storing indices instead of iterators, it's because of space complexity. Suffiz array eats a lot of memory, so using iterators may make it worse. I know std::vector ::iterator has usually only a pointer, but whatg if I use std::vector<bool>::iterator or my container which stores k-bit values with k=2,3, etc. It's iterator will take more than 4 bytes (it's not written yet :-) )

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this