• cd PaCE-pipeline/PaCE-clustering

  • Preprocessing:

This preprocessing step formats the input FASTA file and generates the corresponding input datafile for PaCE.

Run the preprocessPaCE command to preprocess the EST input file (fasta format) eg., $ PreprocessPaCE.pl {FASTA data file} {1: prune PolyA/T, 0 otherwise} > tEST.data.PaCE

If the second argument is : 0: Does not modify the input sequences. Just concatenate each sequence such that each sequence appears in one line and also converts all DNA characters to upper case. 1: In addition to the two effects produced by option ‘0’ this option trims off streaks of As and Ts occurring at either ends of the sequence. This option is not required if the input FASTA GNU nano 1.2.5 File: PaCE-README

sequences have already been stripped of poly As and Ts or if they they are required to be input. If you use this option, make sure you inspect manually some modifications to confirm there will be no negative impacts on the clustering. This inspection can be done by first running preprocessPaCE using flag 0 and then using flag 1 (both on the original file) and taking a diff of both the output files.

eg., $ PreprocessPaCE.pl ../datafiles/tEST.data 0 > ../datafiles/tEST.data.PaCE This generates another file by name “tEST.data.PaCE” in ../datafiles/ .

  • Find “n”: Find the number of sequences in the preprocessed output data file. Let us call it “n”.

This can be found by a simple unix command like: eg., grep -c “>” ../datafiles/tEST.data.PaCE

ie., n=500 for ../datafiles/tEST.data.PaCE.

  • Parameterization: Parameters to PaCE are kept in the file PaCE.cfg. Check if PaCE.cfg is there in the directory of the executable.

You typically do NOT need to change any parameters except “window”.

If the data size <=10,000 ESTs then a window size of 7 is recommended. If the data size <=30,000 and >=10,000 ESTs then a window size of 8 is recommended. Otherwise a window size of 9 is recommended.

  • Run PaCE: PaCE takes two parameters: Usage: {MPIRunCommand} PaCE {preprocessed FASTA datafile} {number of ESTs}

(P.S: Number of processors should be AT LEAST 2)

eg., $ mpirun -np 4 PaCE ../datafiles/tEST.data.PaCE 500

where 4 is the number of processors to run on.

“{MPIRunCommand}” depends on the available MPI implementation and job scheduler of the parallel cluster being used. For batch mode parallel platforms, use the specified batch submission routines like “qsub” or “llsubmit”.

  • PaCE output:

The results are of two categories: Run-time and Cluster results. A summary of these results are printed on the standard output. If the parallel platform uses batch processing which outputs the standard output to a file, then the summary will be present in that file.

(i) Run-time results:

The total run-time and the run-time in different components of the system is shown for each SLAVE processor. This indicates the total run-time for PaCE clustering.

For eg.,

Time taken by slave 1 : Load<0> + Preprocessing Phase<3> + Clustering Phase<1>= 5 secs

Here processor rank 1 took a total of 5 seconds to complete with the run-times in phases indicated separately. The numbers are truncated to integer values. Almost all the slave processors take the same amount of total run-time. Also the time to load the data file into memory indicated by Load<> first might vary with systems and as it is the time for initialization, subtracting it from the total run-time tells the actual time taken by the software.

(ii) Clustering results:

The standard output will also have something like this:

Master: #Clusters Output:= 357 #Singletons=258 Master: #Contained ESTs:= 63

This means: The total number of clusters generated is 357, out of which 258 are singletons. Also out of the n(=500) ESTs supplied, 63 ESTs are completely contained (with 100% identity) in other ESTs.

The clusters themselves are located in file estClust.n.p.PaCE where n is the number of ESTs and p is the number of slave processors. eg., estClust.500.3.PaCE PS: The number of slave processors is always one less than the total number of processors used to run PaCE.

The size distribution of these clusters (in the number of ESTs) is present in estClustSize.n.p.PaCE. eg., estClustSize.500.3.PaCE Each line is of the format “{Cluster#} {Number of Members in Cluster#}”.

The set of EST sequences which are contained in (an)other EST sequence(s) are indicated in ContainedESTs.n.PaCE. eg., ContainedESTs.500.PaCE Each line is of the format “{Contained EST sequence header} IN {Container EST sequence header}”. PS: One EST can be contained in multiple EST sequences, and only one container sequence is indicated here. Also the set of contained ESTs reported need not comprise of all the ESTs that are actually contained. More specifically, the contained ESTs reported are only based on the set for which alignments were performed by PaCE.

The set of EST sequences which are not contained in any other EST sequence in the provided data set is present in NonContainedESTs.n.PaCE. eg., NonContainedESTs.500.PaCE

Adding clone mates information :


As additional input, PaCE is designed to take as input clone mates information. This information should be present in a file which should be present in the following format:

clone mate id # FASTA header for sequences belonging to this clone mate (each in separate lines) ….

eg., Let the clone mates be in a file myCloneMates that looks like this:

CloneId1 gi|19863109| gi|19800130| CloneId2 gi|19863111| gi|19800132| …

Step 1:

Run the following command (for the above example) : $perl formatCloneMates.pl myCloneMates tEST.data.PaCE > myCloneMates.PaCE

This will generate the PaCE formatted myCloneMates.PaCE. (The numbers in each line indicates the sequence number that PaCE provides for each input sequence.)

Step 2:

The PaCE formatted clone mate file should be in the folder where the PaCE executable resides. To specify the clone mate file as input, it has to be provided in the PaCE.cfg file as:

ClonePairsFile clonematefilename

eg., ClonePairsFile myCloneMates.PaCE

This has to be done before starting to execute PaCE. PaCE will put each set of sequences linked by the same clone mate id together in one cluster in its final output.a

If you do not have any clone pairs information to provide then set:

ClonePairsFile None

PS: Both these steps should be performed for each input FASTA file (even if you intend to run for subsets of the original FASTA data file).