Jump to content
  • Advertisement
Sign in to follow this  
ug02070

Need Help Creating Suffix Arrays In Java

This topic is 4759 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

ok, i have looked into the problem of converting c code into java a little more and have found the simplest things that i need to do, as i really cant do this by following the c code. the first thing i need to do is create an empty suffix array of the length of string that will be passed (keeping in mind that this string may in fact be a large document as one). i then need to create a suffix array from this string i then need to create a check which ensures that it is in order, that is what the qsort is doing. once this has been done, i need to store it and output it to the screen. Now has anyone got experience with created suffix arrays from strings at all? keeping in mind that it needs to be an efficient way as the string may be a file that is a couple of megabytes? if you know of the code to create it or even a site that has it or even just the classes and methods i need, it would be a great help so that i can finally finish this off today. thanks Danny =)

Share this post


Link to post
Share on other sites
Advertisement
this is how the output of the suffix array should look once sorted if the input string was abracadabra:

11 #
10 A#
7 ABRA#
0 ABRACADABRA#
3 ACADABRA#
5 ADABRA#
8 BRA#
1 BRACADABRA#
4 CADABRA#
6 DABRA#
9 RA#
2 RACADABRA#

thanks

Share this post


Link to post
Share on other sites
To create the suffix array, you could do something like this (substring is the important function here):

String[] suffixArray = new String[ theString.length() ];

for(int count=0; count < suffixArray.length; ++count)
suffixArray[ count ] = new String( theString.substring(count) );


Then to sort it you could use Arrays.sort

Storing and outputting to the screen is trivial, so you can probably pick that up from any Java resource.

Share this post


Link to post
Share on other sites
here is what i got so far, it creates the suffix array but the indexes are not sorted within sarray[]. Does anyone have a logical / mathematical brain and knows the best way to sort it? the output below should look like the output in the previous post,
thanks


import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.IOException;
import java.text.ParseException;

public class SuffixArray
{
static String string1;

public static void main(String[] args) throws IOException
{

String string2;
int[] sarray1, sarray2;
int string_size1, string_size2, measure;


BufferedReader br = new BufferedReader(new InputStreamReader(System.in));

System.out.println("Enter string1: ");
string1 = br.readLine();
string_size1 = string1.length();

sarray1 = SA_make_sarray(string1, string_size1);
printSuffixArray(sarray1);

}

// Builds the suffix array in memory
public static int[] SA_make_sarray(String string, int stringSize)
{
int[] sarray;

sarray = new int[stringSize];

SA_init_sarray(sarray, stringSize);

return sarray;
}

//Initializes the suffix array's index.
public static void SA_init_sarray(int[] sarray, int size)
{
for(int p = 0; p < size; p++)
{
sarray

= p;
System.out.println("Size = " + sarray.length);
}
}

public static void printSuffixArray(int[] sarray)
{
for(int p = 0; p < sarray.length; p++)
{
System.out.println("Index: " + p + ", Suffix: " + string1.substring(sarray

, string1.length()));
}
}

}




output:
Enter string1: abracadabra$
Index: 0, Suffix: abracadabra$
Index: 1, Suffix: bracadabra$
Index: 2, Suffix: racadabra$
Index: 3, Suffix: acadabra$
Index: 4, Suffix: cadabra$
Index: 5, Suffix: adabra$
Index: 6, Suffix: dabra$
Index: 7, Suffix: abra$
Index: 8, Suffix: bra$
Index: 9, Suffix: ra$
Index: 10, Suffix: a$
Index: 11, Suffix: $
Press any key to continue...

thanks

Share this post


Link to post
Share on other sites
thanks rug but im pretty sure that it might be best if i stick with sorting the actual indexes rather than storing the substrings because the substrings will get very large and also there will be a large number of them so i think that i may have problems. If i am wrong and anyone knows this to be the case that rugs method is better than the one i have followed so far plz let me know but i really need help with sorting the indexes within sarray[] so that its ordered,
thanks

Share this post


Link to post
Share on other sites
i need:
sarray[0] = 11
sarray[1] = 10
sarray[2] = 7
sarray[3] = 0
sarray[4] = 3
sarray[5] = 5
sarray[6] = 8
sarray[7] = 1
sarray[8] = 4
sarray[9] = 6
sarray[10] = 9
sarray[11] = 2

thanks

Share this post


Link to post
Share on other sites
Does this do the job?


import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.IOException;
import java.text.ParseException;
import java.util.Arrays;
import java.util.Comparator;

public class SuffixArray
{
static String string1;
static CustomComparator comp;


static class CustomComparator implements Comparator {

public int compare(Object x, Object y) {
return ( string1.substring( ((Integer)x).intValue() ).compareTo( string1.substring( ((Integer)y).intValue() ) ) );
}

public boolean equals() { return false; }
}


public static void main(String[] args) throws IOException
{
comp = new CustomComparator();
String string2;
Integer[] sarray1, sarray2;
int string_size1, string_size2, measure;


BufferedReader br = new BufferedReader(new InputStreamReader(System.in));

System.out.println("Enter string1: ");
string1 = br.readLine();
string_size1 = string1.length();

sarray1 = SA_make_sarray(string1, string_size1);
sortSuffixArray(sarray1);
printSuffixArray(sarray1);

}

// Builds the suffix array in memory
public static Integer[] SA_make_sarray(String string, int stringSize)
{
Integer[] sarray;

sarray = new Integer[stringSize];

SA_init_sarray(sarray, stringSize);

return sarray;
}

//Initializes the suffix array's index.
public static void SA_init_sarray(Integer[] sarray, int size)
{
for(int p = 0; p < size; p++)
{
sarray

= new Integer(p);
}
System.out.println("Size = " + sarray.length);
}

public static void printSuffixArray(Integer[] sarray)
{
for(int p = 0; p < sarray.length; p++)
{
System.out.println("Index: " + p + ", Suffix: " + string1.substring(sarray

.intValue(), string1.length()));
}
}


public static void sortSuffixArray(Integer[] sarray) {
Arrays.sort( sarray, comp );
}


}



It's a bit of a hack, unfortunately. What I did was create a custom Comparator to compare two strings based on their substring indexes, and passed that Comparator to Arrays.sort. The actual sorting is performed by Arrays.sort, the custom part is the comparison between the two strings based on their substring indices... if you follow me.

Hope that helps.

Share this post


Link to post
Share on other sites
Efficient sorting of suffix-array is far from trivial, and may be quite space consuming as well. And a brute force sort of the entire array is not an option with data sizes much beyond 64 kB or so.
Unfortunately most, if not all, documents on the subject seems to be academic reports rather than easily followed tutorials. Here is one of the better documents, which I even managed to understand myself =)

At least I thought the whole h-order thing was kind of neat when I first read that document..

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
You really should make some effort to understand the language before you start trying to make statements about how efficient things are. In java, Strings are objects, and all object variables are really references to objects (really they are just like C++ pointers, with some extra run-time checking). When the array of objects is sorted, all it is doing is rearranging the pointers. It is not copying the data.

Share this post


Link to post
Share on other sites
Quote:
Original post by Anonymous Poster
You really should make some effort to understand the language before you start trying to make statements about how efficient things are. In java, Strings are objects, and all object variables are really references to objects (really they are just like C++ pointers, with some extra run-time checking). When the array of objects is sorted, all it is doing is rearranging the pointers. It is not copying the data.
When did he say that?
Unless language implements "copy-on-write" for substrings then storing the full set of them as separate objects is *very* inefficient compared to using indicies into the original string (4 GB vs 400 kB for a 64 kB string).
In C this could've been done directly with pointers of course..

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!