Need Help Creating Suffix Arrays In Java

Started by
16 comments, last by doynax 18 years, 6 months ago
ok, i have looked into the problem of converting c code into java a little more and have found the simplest things that i need to do, as i really cant do this by following the c code. the first thing i need to do is create an empty suffix array of the length of string that will be passed (keeping in mind that this string may in fact be a large document as one). i then need to create a suffix array from this string i then need to create a check which ensures that it is in order, that is what the qsort is doing. once this has been done, i need to store it and output it to the screen. Now has anyone got experience with created suffix arrays from strings at all? keeping in mind that it needs to be an efficient way as the string may be a file that is a couple of megabytes? if you know of the code to create it or even a site that has it or even just the classes and methods i need, it would be a great help so that i can finally finish this off today. thanks Danny =)
Advertisement
this is how the output of the suffix array should look once sorted if the input string was abracadabra:

11 #
10 A#
7 ABRA#
0 ABRACADABRA#
3 ACADABRA#
5 ADABRA#
8 BRA#
1 BRACADABRA#
4 CADABRA#
6 DABRA#
9 RA#
2 RACADABRA#

thanks
To create the suffix array, you could do something like this (substring is the important function here):
String[] suffixArray = new String[ theString.length() ];for(int count=0; count < suffixArray.length; ++count)    suffixArray[ count ] = new String( theString.substring(count) );


Then to sort it you could use Arrays.sort

Storing and outputting to the screen is trivial, so you can probably pick that up from any Java resource.
the rug - funpowered.com
here is what i got so far, it creates the suffix array but the indexes are not sorted within sarray[]. Does anyone have a logical / mathematical brain and knows the best way to sort it? the output below should look like the output in the previous post,
thanks

import java.io.BufferedReader;import java.io.InputStreamReader;import java.io.IOException;import java.text.ParseException;public class SuffixArray{    static String string1;    public static void main(String[] args) throws IOException    {    	 	    String string2; 	    int[] sarray1, sarray2;	    int string_size1, string_size2, measure;    	                BufferedReader br = new BufferedReader(new InputStreamReader(System.in));		    System.out.println("Enter string1: ");	    string1 = br.readLine();	    string_size1 = string1.length();		    sarray1 = SA_make_sarray(string1, string_size1);	    printSuffixArray(sarray1);    }        // Builds the suffix array in memory	public static int[] SA_make_sarray(String string, int stringSize)	{		int[] sarray;				sarray = new int[stringSize];				SA_init_sarray(sarray, stringSize);				return sarray;	}		//Initializes the suffix array's index. 	public static void SA_init_sarray(int[] sarray, int size)	{		for(int p = 0; p < size; p++)		{			sarray = p;			System.out.println("Size = " + sarray.length);		}		}		public static void printSuffixArray(int[] sarray)	{		for(int p = 0; p < sarray.length; p++)		{			System.out.println("Index: " + p + ", Suffix: " + string1.substring(sarray, string1.length()));		}	}}


output:
Enter string1: abracadabra$
Index: 0, Suffix: abracadabra$
Index: 1, Suffix: bracadabra$
Index: 2, Suffix: racadabra$
Index: 3, Suffix: acadabra$
Index: 4, Suffix: cadabra$
Index: 5, Suffix: adabra$
Index: 6, Suffix: dabra$
Index: 7, Suffix: abra$
Index: 8, Suffix: bra$
Index: 9, Suffix: ra$
Index: 10, Suffix: a$
Index: 11, Suffix: $
Press any key to continue...

thanks
thanks rug but im pretty sure that it might be best if i stick with sorting the actual indexes rather than storing the substrings because the substrings will get very large and also there will be a large number of them so i think that i may have problems. If i am wrong and anyone knows this to be the case that rugs method is better than the one i have followed so far plz let me know but i really need help with sorting the indexes within sarray[] so that its ordered,
thanks
i need:
sarray[0] = 11
sarray[1] = 10
sarray[2] = 7
sarray[3] = 0
sarray[4] = 3
sarray[5] = 5
sarray[6] = 8
sarray[7] = 1
sarray[8] = 4
sarray[9] = 6
sarray[10] = 9
sarray[11] = 2

thanks
Does this do the job?

import java.io.BufferedReader;import java.io.InputStreamReader;import java.io.IOException;import java.text.ParseException;import java.util.Arrays;import java.util.Comparator;public class SuffixArray{    static String string1;    static CustomComparator comp;	static class CustomComparator implements Comparator {		public int compare(Object x, Object y) {			return ( string1.substring( ((Integer)x).intValue() ).compareTo( string1.substring( ((Integer)y).intValue() ) ) );		}		public boolean equals() { return false; }	} 	public static void main(String[] args) throws IOException   	 { 	    comp = new CustomComparator();	    String string2; 	    Integer[] sarray1, sarray2;	    int string_size1, string_size2, measure;    	                    BufferedReader br = new BufferedReader(new InputStreamReader(System.in));		    System.out.println("Enter string1: ");	    string1 = br.readLine();	    string_size1 = string1.length();		    sarray1 = SA_make_sarray(string1, string_size1);	    sortSuffixArray(sarray1);	    printSuffixArray(sarray1);    }        // Builds the suffix array in memory	public static Integer[] SA_make_sarray(String string, int stringSize)	{		Integer[] sarray;				sarray = new Integer[stringSize];				SA_init_sarray(sarray, stringSize);				return sarray;	}		//Initializes the suffix array's index. 	public static void SA_init_sarray(Integer[] sarray, int size)	{		for(int p = 0; p < size; p++)		{			sarray = new Integer(p);		}	 		System.out.println("Size = " + sarray.length);	}		public static void printSuffixArray(Integer[] sarray)	{		for(int p = 0; p < sarray.length; p++)		{			System.out.println("Index: " + p + ", Suffix: " + string1.substring(sarray.intValue(), string1.length()));		}	}	public static void sortSuffixArray(Integer[] sarray) {	Arrays.sort( sarray, comp );	}}


It's a bit of a hack, unfortunately. What I did was create a custom Comparator to compare two strings based on their substring indexes, and passed that Comparator to Arrays.sort. The actual sorting is performed by Arrays.sort, the custom part is the comparison between the two strings based on their substring indices... if you follow me.

Hope that helps.
the rug - funpowered.com
Efficient sorting of suffix-array is far from trivial, and may be quite space consuming as well. And a brute force sort of the entire array is not an option with data sizes much beyond 64 kB or so.
Unfortunately most, if not all, documents on the subject seems to be academic reports rather than easily followed tutorials. Here is one of the better documents, which I even managed to understand myself =)

At least I thought the whole h-order thing was kind of neat when I first read that document..
You really should make some effort to understand the language before you start trying to make statements about how efficient things are. In java, Strings are objects, and all object variables are really references to objects (really they are just like C++ pointers, with some extra run-time checking). When the array of objects is sorted, all it is doing is rearranging the pointers. It is not copying the data.
Quote:Original post by Anonymous Poster
You really should make some effort to understand the language before you start trying to make statements about how efficient things are. In java, Strings are objects, and all object variables are really references to objects (really they are just like C++ pointers, with some extra run-time checking). When the array of objects is sorted, all it is doing is rearranging the pointers. It is not copying the data.
When did he say that?
Unless language implements "copy-on-write" for substrings then storing the full set of them as separate objects is *very* inefficient compared to using indicies into the original string (4 GB vs 400 kB for a 64 kB string).
In C this could've been done directly with pointers of course..

This topic is closed to new replies.

Advertisement