Need Help Creating Suffix Arrays In Java

Started by
16 comments, last by doynax 18 years, 6 months ago
Java's String class is immutable. I am fairly sure it is specifically designed so that when you take a substring, internally it just keeps track of the starting index and length, or some similiar scheme. So the substrings already aren't taking up much space, and sorting the array doesn't actually move any data around besides the references. So generating all the suffixes and then sorting them is already efficient in java.

If this isn't the case, then you can always implement your own class that does this.
Advertisement
Meh.

I have a SuffixArray data structure all lined out. It design is very similar to The Rug's, except I was using the lovely 5.0 generics.

The problem being that it doesn't quite compile. If I were more of a Java person, I'd probably see it right away; it's something to do with my usage of generics in the sorting method.

Anyway here's what I had...sorry I couldn't help further...

import java.util.Comparator;import java.util.Arrays;public class SuffixArray{    private String orig_content;    private int[] suffix_array;    private boolean is_sorted = false;    public SuffixArray(String input)    {        orig_content = input;        suffix_array = new int[orig_content.length() + 1];        for (int ct = 0; ct <= orig_content.length(); ct++)            suffix_array[ct] = ct;    }    // Here is where all the magic happens :)    public void Sort()    {        if (is_sorted)            return;        Arrays.sort<Integer>(suffix_array, new SuffixArrayComparator()); // it doesn't like this...        is_sorted = true;        return;    }    // Here is what the magic calls on to be able to happen :)    private class SuffixArrayComparator implements Comparator<Integer>    {        public int compare(Integer o1, Integer o2)        {            return orig_content.substring(suffix_array[o1]).compareTo(orig_content.substring(suffix_array[o2]));        }        public boolean equals(Object obj)        {            return orig_content.substring(suffix_array[o1]).equals(orig_content.substring(suffix_array[o2]));        }    }    public boolean isSorted()    {        return is_sorted;    }    public String Suffix(int suffix_num) throws ArrayIndexOutOfBoundsException    {        return orig_content.substring(suffix_array[suffix_num]);    }}
{[JohnE, Chief Architect and Senior Programmer, Twilight Dragon Media{[+++{GCC/MinGW}+++{Code::Blocks IDE}+++{wxWidgets Cross-Platform Native UI Framework}+++
To avoid creating multiple strings, you could instead create an array of Integer objects, then sort it using an instance of a Comparator class which stores the original String and uses it to make the comparisons:

// This is probably nicer in Java 1.5.class SuffixComparator implements Comparator {  private String backing;  public SuffixComparator(String backing) { this.backing = backing; }  public int compare(Object o1, Object o2) {    int i1 = ((Integer)(o1)).intValue();    int i2 = ((Integer)(o2)).intValue();    return backing.substring(i1).compareTo(backing.substring(i2));  }  public boolean equals(Object o) {    try {      return ((SuffixComparator)(o)).backing.equals(backing);    } catch (ClassCastException e) {       return false;    }  }}// later:Integer[] values = new Integer(myString.length());for (int i = 0; i < values.length; ++i) {  values = new Integer(i);}Arrays.sort(values, new SuffixComparator(myString));


This still carries quite a bit of overhead. You might want to do it "manually" with an int array, possibly cribbing a std::sort implementation or something. You could still use the illustrated substring trick for comparisons initially (it makes a bunch of temporary String objects, but they won't really accumulate in memory) and then replace that with a low-level "strcmp" re-implementation later once you have that much working.
i have created the following guys:

import java.util.Comparator;import java.util.Arrays;public class SuffixArray{    static String string1;    public static void mainMethod(String args)    {    	 	    String string2; 	    Integer[] sarray1, sarray2;	    	    	    int string_size1, string_size2, measure;		    string1 = args;	    string_size1 = string1.length();		    sarray1 = SA_make_sarray(string1, string_size1);	    	    sortSuffixArray(sarray1);	    printSuffixArray(sarray1);    }        // Builds the suffix array in memory	public static Integer[] SA_make_sarray(String string, int stringSize)	{		Integer[] sarray;				sarray = new Integer[stringSize];						SA_init_sarray(sarray, stringSize);				return sarray;	}		//Initializes the suffix array's index. 	public static void SA_init_sarray(Integer[] sarray, int size)	{		for(int p = 0; p < size; p++)		{			sarray = new Integer(p);		}		}		public static void printSuffixArray(Integer[] sarray)	{		for(int p = 0; p < sarray.length; p++)		{			System.out.println("Index: " + p + ", SuffixIndex: " + sarray + " Suffix: " + string1.substring(sarray.intValue(), string1.length()));			//System.out.println("Index: " + p + ", Suffix: " + string1.substring(sarray.intValue(), string1.length()));		}	}		public static void sortSuffixArray(Integer[] sarray)	{		Arrays.sort(sarray, new IndexSorter(string1));		}		public static class IndexSorter implements Comparator<Integer>	{		String testString;		int l;				public IndexSorter(String astring)		{			testString = astring;			l = testString.length();		}				public int compare(Integer int1, Integer int2)		{			return (testString.substring(int1.intValue(), l)).compareTo(testString.substring(int2.intValue(), l));		}				public boolean equals(Object o)		{			return this == o;					}			}}


which gives the following output, which is correct:
Enter Filename1: abra.txt
Index: 0, SuffixIndex: 11 Suffix: $
Index: 1, SuffixIndex: 10 Suffix: a$
Index: 2, SuffixIndex: 7 Suffix: abra$
Index: 3, SuffixIndex: 0 Suffix: abracadabra$
Index: 4, SuffixIndex: 3 Suffix: acadabra$
Index: 5, SuffixIndex: 5 Suffix: adabra$
Index: 6, SuffixIndex: 8 Suffix: bra$
Index: 7, SuffixIndex: 1 Suffix: bracadabra$
Index: 8, SuffixIndex: 4 Suffix: cadabra$
Index: 9, SuffixIndex: 6 Suffix: dabra$
Index: 10, SuffixIndex: 9 Suffix: ra$
Index: 11, SuffixIndex: 2 Suffix: racadabra$
Length of file: 12 characters
Press any key to continue...

but when i have an input file that looks like the following:
a
b
c
effectively a\nb\nc\n my program doesnt see the line wrap and instead prints out that the substring is in fact "abc"

so i guess this is the way that i read in the string before i actually pass the string to the above class

the code i use to read in is here:


import java.io.File;import java.io.BufferedReader;import java.io.FileReader;import java.io.IOException;import java.io.FileNotFoundException;public class getTextFromFile{  	  	public static String getContents(File aFile)  	{		StringBuffer contents = new StringBuffer();		    //declared here only to make visible to finally clause	    BufferedReader input = null;	    try {	      //use buffering	      //this implementation reads one line at a time	      input = new BufferedReader( new FileReader(aFile) );	      String line = null; //not declared within while loop	      while (( line = input.readLine()) != null){	        contents.append(line);	      }	    }	    catch (FileNotFoundException ex) {	      ex.printStackTrace();	    }	    catch (IOException ex){	      ex.printStackTrace();	    }	    finally {	      try {	        if (input!= null) {	          //flush and close both "input" and its underlying FileReader	          input.close();	        }	      }	      catch (IOException ex) {	        ex.printStackTrace();	      }	    }	    return contents.toString();  	}}


now what i am doing is passing the read class a filename, it is reading the file and then storing its contents as a string. this string is then being passed to the SuffixArray class to do the rest of the work.
as my method of reading in is obviously incorrect and i guess u guys have all had and solved this problem, can anyone tell me what my reader class should really look like.
thanks guys =)
Quote:Original post by Anonymous Poster
Java's String class is immutable. I am fairly sure it is specifically designed so that when you take a substring, internally it just keeps track of the starting index and length, or some similiar scheme. So the substrings already aren't taking up much space, and sorting the array doesn't actually move any data around besides the references. So generating all the suffixes and then sorting them is already efficient in java.
According to the documentation substring creates a copy, so the implementation would have to be a bit more complicated than that. It is possible, but hardly worth risking if Java doesn't provide any guarantees.
Quote:Original post by Anonymous Poster
If this isn't the case, then you can always implement your own class that does this.
Hence implementing it with an integer array..

A normal string sort would have O(N^2 log N) worst-case performance (with a regular O(N log N) sort, which results in about 30 billion character comparisons for an empty 64 kB string.
So try to use a specialized algorithm if you're going to process data of any significant length..
Can you say Homework problem?

<Google "quicksort algorithm">
<retuns C source code...hmm, I need Java, cause that's what homwork problem is>
<Try to convert C to Java...can't do it, must get help>

My Gamedev Journal: 2D Game Making, the Easy Way

---(Old Blog, still has good info): 2dGameMaking
-----
"No one ever posts on that message board; it's too crowded." - Yoga Berra (sorta)

i have done the following to solve \n problem:

while (( line = input.readLine()) != null){

contents.append(line);
contents.append(System.getProperty("line.separator"));
}

and with relation to the above, it's not like that at all, it's not homework, it's my research, and i am not concerned with how the toolkit is made, i am simply concerned with the results of experiments. i have been told to follow a set route which is programmed in c and as i dont know c i thought i would ask for a little help? is that ok? this above is only a very small part of what i have done.
i am trying my best to get most of it done on my own (which i have) but unfortunately we all need help off more experienced people at some time, that's the joy of these forums!
anyway, thanks to everyone who has helped, that should be it for now =)
Quote:Original post by BeerNutts
Can you say Homework problem?

<Google "quicksort algorithm">
<retuns C source code...hmm, I need Java, cause that's what homwork problem is>
<Try to convert C to Java...can't do it, must get help>
Nah, this is much too hard to be a homework problem if the file (as the OP claims) is a couple of megabytes large, I'd take days for the naive qsort implementation to finish.
I suspect that it's a part of a burrows-wheeler encoder for data compression.

This topic is closed to new replies.

Advertisement