• Advertisement
Sign in to follow this  

Need Help Creating Suffix Arrays In Java

This topic is 4485 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

ok, i have looked into the problem of converting c code into java a little more and have found the simplest things that i need to do, as i really cant do this by following the c code. the first thing i need to do is create an empty suffix array of the length of string that will be passed (keeping in mind that this string may in fact be a large document as one). i then need to create a suffix array from this string i then need to create a check which ensures that it is in order, that is what the qsort is doing. once this has been done, i need to store it and output it to the screen. Now has anyone got experience with created suffix arrays from strings at all? keeping in mind that it needs to be an efficient way as the string may be a file that is a couple of megabytes? if you know of the code to create it or even a site that has it or even just the classes and methods i need, it would be a great help so that i can finally finish this off today. thanks Danny =)

Share this post


Link to post
Share on other sites
Advertisement
this is how the output of the suffix array should look once sorted if the input string was abracadabra:

11 #
10 A#
7 ABRA#
0 ABRACADABRA#
3 ACADABRA#
5 ADABRA#
8 BRA#
1 BRACADABRA#
4 CADABRA#
6 DABRA#
9 RA#
2 RACADABRA#

thanks

Share this post


Link to post
Share on other sites
To create the suffix array, you could do something like this (substring is the important function here):

String[] suffixArray = new String[ theString.length() ];

for(int count=0; count < suffixArray.length; ++count)
suffixArray[ count ] = new String( theString.substring(count) );


Then to sort it you could use Arrays.sort

Storing and outputting to the screen is trivial, so you can probably pick that up from any Java resource.

Share this post


Link to post
Share on other sites
here is what i got so far, it creates the suffix array but the indexes are not sorted within sarray[]. Does anyone have a logical / mathematical brain and knows the best way to sort it? the output below should look like the output in the previous post,
thanks


import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.IOException;
import java.text.ParseException;

public class SuffixArray
{
static String string1;

public static void main(String[] args) throws IOException
{

String string2;
int[] sarray1, sarray2;
int string_size1, string_size2, measure;


BufferedReader br = new BufferedReader(new InputStreamReader(System.in));

System.out.println("Enter string1: ");
string1 = br.readLine();
string_size1 = string1.length();

sarray1 = SA_make_sarray(string1, string_size1);
printSuffixArray(sarray1);

}

// Builds the suffix array in memory
public static int[] SA_make_sarray(String string, int stringSize)
{
int[] sarray;

sarray = new int[stringSize];

SA_init_sarray(sarray, stringSize);

return sarray;
}

//Initializes the suffix array's index.
public static void SA_init_sarray(int[] sarray, int size)
{
for(int p = 0; p < size; p++)
{
sarray

= p;
System.out.println("Size = " + sarray.length);
}
}

public static void printSuffixArray(int[] sarray)
{
for(int p = 0; p < sarray.length; p++)
{
System.out.println("Index: " + p + ", Suffix: " + string1.substring(sarray

, string1.length()));
}
}

}




output:
Enter string1: abracadabra$
Index: 0, Suffix: abracadabra$
Index: 1, Suffix: bracadabra$
Index: 2, Suffix: racadabra$
Index: 3, Suffix: acadabra$
Index: 4, Suffix: cadabra$
Index: 5, Suffix: adabra$
Index: 6, Suffix: dabra$
Index: 7, Suffix: abra$
Index: 8, Suffix: bra$
Index: 9, Suffix: ra$
Index: 10, Suffix: a$
Index: 11, Suffix: $
Press any key to continue...

thanks

Share this post


Link to post
Share on other sites
thanks rug but im pretty sure that it might be best if i stick with sorting the actual indexes rather than storing the substrings because the substrings will get very large and also there will be a large number of them so i think that i may have problems. If i am wrong and anyone knows this to be the case that rugs method is better than the one i have followed so far plz let me know but i really need help with sorting the indexes within sarray[] so that its ordered,
thanks

Share this post


Link to post
Share on other sites
i need:
sarray[0] = 11
sarray[1] = 10
sarray[2] = 7
sarray[3] = 0
sarray[4] = 3
sarray[5] = 5
sarray[6] = 8
sarray[7] = 1
sarray[8] = 4
sarray[9] = 6
sarray[10] = 9
sarray[11] = 2

thanks

Share this post


Link to post
Share on other sites
Does this do the job?


import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.IOException;
import java.text.ParseException;
import java.util.Arrays;
import java.util.Comparator;

public class SuffixArray
{
static String string1;
static CustomComparator comp;


static class CustomComparator implements Comparator {

public int compare(Object x, Object y) {
return ( string1.substring( ((Integer)x).intValue() ).compareTo( string1.substring( ((Integer)y).intValue() ) ) );
}

public boolean equals() { return false; }
}


public static void main(String[] args) throws IOException
{
comp = new CustomComparator();
String string2;
Integer[] sarray1, sarray2;
int string_size1, string_size2, measure;


BufferedReader br = new BufferedReader(new InputStreamReader(System.in));

System.out.println("Enter string1: ");
string1 = br.readLine();
string_size1 = string1.length();

sarray1 = SA_make_sarray(string1, string_size1);
sortSuffixArray(sarray1);
printSuffixArray(sarray1);

}

// Builds the suffix array in memory
public static Integer[] SA_make_sarray(String string, int stringSize)
{
Integer[] sarray;

sarray = new Integer[stringSize];

SA_init_sarray(sarray, stringSize);

return sarray;
}

//Initializes the suffix array's index.
public static void SA_init_sarray(Integer[] sarray, int size)
{
for(int p = 0; p < size; p++)
{
sarray

= new Integer(p);
}
System.out.println("Size = " + sarray.length);
}

public static void printSuffixArray(Integer[] sarray)
{
for(int p = 0; p < sarray.length; p++)
{
System.out.println("Index: " + p + ", Suffix: " + string1.substring(sarray

.intValue(), string1.length()));
}
}


public static void sortSuffixArray(Integer[] sarray) {
Arrays.sort( sarray, comp );
}


}



It's a bit of a hack, unfortunately. What I did was create a custom Comparator to compare two strings based on their substring indexes, and passed that Comparator to Arrays.sort. The actual sorting is performed by Arrays.sort, the custom part is the comparison between the two strings based on their substring indices... if you follow me.

Hope that helps.

Share this post


Link to post
Share on other sites
Efficient sorting of suffix-array is far from trivial, and may be quite space consuming as well. And a brute force sort of the entire array is not an option with data sizes much beyond 64 kB or so.
Unfortunately most, if not all, documents on the subject seems to be academic reports rather than easily followed tutorials. Here is one of the better documents, which I even managed to understand myself =)

At least I thought the whole h-order thing was kind of neat when I first read that document..

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
You really should make some effort to understand the language before you start trying to make statements about how efficient things are. In java, Strings are objects, and all object variables are really references to objects (really they are just like C++ pointers, with some extra run-time checking). When the array of objects is sorted, all it is doing is rearranging the pointers. It is not copying the data.

Share this post


Link to post
Share on other sites
Quote:
Original post by Anonymous Poster
You really should make some effort to understand the language before you start trying to make statements about how efficient things are. In java, Strings are objects, and all object variables are really references to objects (really they are just like C++ pointers, with some extra run-time checking). When the array of objects is sorted, all it is doing is rearranging the pointers. It is not copying the data.
When did he say that?
Unless language implements "copy-on-write" for substrings then storing the full set of them as separate objects is *very* inefficient compared to using indicies into the original string (4 GB vs 400 kB for a 64 kB string).
In C this could've been done directly with pointers of course..

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Java's String class is immutable. I am fairly sure it is specifically designed so that when you take a substring, internally it just keeps track of the starting index and length, or some similiar scheme. So the substrings already aren't taking up much space, and sorting the array doesn't actually move any data around besides the references. So generating all the suffixes and then sorting them is already efficient in java.

If this isn't the case, then you can always implement your own class that does this.

Share this post


Link to post
Share on other sites
Meh.

I have a SuffixArray data structure all lined out. It design is very similar to The Rug's, except I was using the lovely 5.0 generics.

The problem being that it doesn't quite compile. If I were more of a Java person, I'd probably see it right away; it's something to do with my usage of generics in the sorting method.

Anyway here's what I had...sorry I couldn't help further...


import java.util.Comparator;
import java.util.Arrays;

public class SuffixArray
{
private String orig_content;
private int[] suffix_array;
private boolean is_sorted = false;

public SuffixArray(String input)
{
orig_content = input;
suffix_array = new int[orig_content.length() + 1];
for (int ct = 0; ct <= orig_content.length(); ct++)
suffix_array[ct] = ct;
}

// Here is where all the magic happens :)
public void Sort()
{
if (is_sorted)
return;
Arrays.sort<Integer>(suffix_array, new SuffixArrayComparator()); // it doesn't like this...
is_sorted = true;
return;
}

// Here is what the magic calls on to be able to happen :)
private class SuffixArrayComparator implements Comparator<Integer>
{
public int compare(Integer o1, Integer o2)
{
return orig_content.substring(suffix_array[o1]).compareTo(orig_content.substring(suffix_array[o2]));
}
public boolean equals(Object obj)
{
return orig_content.substring(suffix_array[o1]).equals(orig_content.substring(suffix_array[o2]));
}
}

public boolean isSorted()
{
return is_sorted;
}

public String Suffix(int suffix_num) throws ArrayIndexOutOfBoundsException
{
return orig_content.substring(suffix_array[suffix_num]);
}
}


Share this post


Link to post
Share on other sites
To avoid creating multiple strings, you could instead create an array of Integer objects, then sort it using an instance of a Comparator class which stores the original String and uses it to make the comparisons:


// This is probably nicer in Java 1.5.
class SuffixComparator implements Comparator {
private String backing;
public SuffixComparator(String backing) { this.backing = backing; }
public int compare(Object o1, Object o2) {
int i1 = ((Integer)(o1)).intValue();
int i2 = ((Integer)(o2)).intValue();
return backing.substring(i1).compareTo(backing.substring(i2));
}
public boolean equals(Object o) {
try {
return ((SuffixComparator)(o)).backing.equals(backing);
} catch (ClassCastException e) {
return false;
}
}
}

// later:
Integer[] values = new Integer(myString.length());
for (int i = 0; i < values.length; ++i) {
values = new Integer(i);
}
Arrays.sort(values, new SuffixComparator(myString));


This still carries quite a bit of overhead. You might want to do it "manually" with an int array, possibly cribbing a std::sort implementation or something. You could still use the illustrated substring trick for comparisons initially (it makes a bunch of temporary String objects, but they won't really accumulate in memory) and then replace that with a low-level "strcmp" re-implementation later once you have that much working.

Share this post


Link to post
Share on other sites
i have created the following guys:


import java.util.Comparator;
import java.util.Arrays;

public class SuffixArray
{
static String string1;

public static void mainMethod(String args)
{

String string2;
Integer[] sarray1, sarray2;

int string_size1, string_size2, measure;

string1 = args;
string_size1 = string1.length();

sarray1 = SA_make_sarray(string1, string_size1);

sortSuffixArray(sarray1);
printSuffixArray(sarray1);

}

// Builds the suffix array in memory
public static Integer[] SA_make_sarray(String string, int stringSize)
{
Integer[] sarray;

sarray = new Integer[stringSize];

SA_init_sarray(sarray, stringSize);

return sarray;
}

//Initializes the suffix array's index.
public static void SA_init_sarray(Integer[] sarray, int size)
{
for(int p = 0; p < size; p++)
{
sarray

= new Integer(p);
}
}

public static void printSuffixArray(Integer[] sarray)
{
for(int p = 0; p < sarray.length; p++)
{
System.out.println("Index: " + p + ", SuffixIndex: " + sarray

+ " Suffix: " + string1.substring(sarray

.intValue(), string1.length()));
//System.out.println("Index: " + p + ", Suffix: " + string1.substring(sarray

.intValue(), string1.length()));
}
}

public static void sortSuffixArray(Integer[] sarray)
{
Arrays.sort(sarray, new IndexSorter(string1));
}

public static class IndexSorter implements Comparator<Integer>
{
String testString;
int l;

public IndexSorter(String astring)
{
testString = astring;
l = testString.length();
}

public int compare(Integer int1, Integer int2)
{
return (testString.substring(int1.intValue(), l)).compareTo(testString.substring(int2.intValue(), l));
}

public boolean equals(Object o)
{
return this == o;

}

}

}




which gives the following output, which is correct:
Enter Filename1: abra.txt
Index: 0, SuffixIndex: 11 Suffix: $
Index: 1, SuffixIndex: 10 Suffix: a$
Index: 2, SuffixIndex: 7 Suffix: abra$
Index: 3, SuffixIndex: 0 Suffix: abracadabra$
Index: 4, SuffixIndex: 3 Suffix: acadabra$
Index: 5, SuffixIndex: 5 Suffix: adabra$
Index: 6, SuffixIndex: 8 Suffix: bra$
Index: 7, SuffixIndex: 1 Suffix: bracadabra$
Index: 8, SuffixIndex: 4 Suffix: cadabra$
Index: 9, SuffixIndex: 6 Suffix: dabra$
Index: 10, SuffixIndex: 9 Suffix: ra$
Index: 11, SuffixIndex: 2 Suffix: racadabra$
Length of file: 12 characters
Press any key to continue...

but when i have an input file that looks like the following:
a
b
c
effectively a\nb\nc\n my program doesnt see the line wrap and instead prints out that the substring is in fact "abc"

so i guess this is the way that i read in the string before i actually pass the string to the above class

the code i use to read in is here:



import java.io.File;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.io.FileNotFoundException;

public class getTextFromFile
{

public static String getContents(File aFile)
{
StringBuffer contents = new StringBuffer();

//declared here only to make visible to finally clause
BufferedReader input = null;
try {
//use buffering
//this implementation reads one line at a time
input = new BufferedReader( new FileReader(aFile) );
String line = null; //not declared within while loop
while (( line = input.readLine()) != null){
contents.append(line);
}
}
catch (FileNotFoundException ex) {
ex.printStackTrace();
}
catch (IOException ex){
ex.printStackTrace();
}
finally {
try {
if (input!= null) {
//flush and close both "input" and its underlying FileReader
input.close();
}
}
catch (IOException ex) {
ex.printStackTrace();
}
}
return contents.toString();
}
}



now what i am doing is passing the read class a filename, it is reading the file and then storing its contents as a string. this string is then being passed to the SuffixArray class to do the rest of the work.
as my method of reading in is obviously incorrect and i guess u guys have all had and solved this problem, can anyone tell me what my reader class should really look like.
thanks guys =)

Share this post


Link to post
Share on other sites
Quote:
Original post by Anonymous Poster
Java's String class is immutable. I am fairly sure it is specifically designed so that when you take a substring, internally it just keeps track of the starting index and length, or some similiar scheme. So the substrings already aren't taking up much space, and sorting the array doesn't actually move any data around besides the references. So generating all the suffixes and then sorting them is already efficient in java.
According to the documentation substring creates a copy, so the implementation would have to be a bit more complicated than that. It is possible, but hardly worth risking if Java doesn't provide any guarantees.
Quote:
Original post by Anonymous Poster
If this isn't the case, then you can always implement your own class that does this.
Hence implementing it with an integer array..

A normal string sort would have O(N^2 log N) worst-case performance (with a regular O(N log N) sort, which results in about 30 billion character comparisons for an empty 64 kB string.
So try to use a specialized algorithm if you're going to process data of any significant length..

Share this post


Link to post
Share on other sites
Can you say Homework problem?

<Google "quicksort algorithm">
<retuns C source code...hmm, I need Java, cause that's what homwork problem is>
<Try to convert C to Java...can't do it, must get help>

Share this post


Link to post
Share on other sites
i have done the following to solve \n problem:

while (( line = input.readLine()) != null){

contents.append(line);
contents.append(System.getProperty("line.separator"));
}

and with relation to the above, it's not like that at all, it's not homework, it's my research, and i am not concerned with how the toolkit is made, i am simply concerned with the results of experiments. i have been told to follow a set route which is programmed in c and as i dont know c i thought i would ask for a little help? is that ok? this above is only a very small part of what i have done.
i am trying my best to get most of it done on my own (which i have) but unfortunately we all need help off more experienced people at some time, that's the joy of these forums!
anyway, thanks to everyone who has helped, that should be it for now =)

Share this post


Link to post
Share on other sites
Quote:
Original post by BeerNutts
Can you say Homework problem?

<Google "quicksort algorithm">
<retuns C source code...hmm, I need Java, cause that's what homwork problem is>
<Try to convert C to Java...can't do it, must get help>
Nah, this is much too hard to be a homework problem if the file (as the OP claims) is a couple of megabytes large, I'd take days for the naive qsort implementation to finish.
I suspect that it's a part of a burrows-wheeler encoder for data compression.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement