- Reference sequence :
- - In order to run the BLAST with query and references, you need to put reference sequences. If you have the user-defined amino acid or nucleotide sequences, you can upload it on the system. Otherwise, you can put the accesion number such as NC_000000.For more information, please refer NCBI site
- Input File Format :
- - Input file is an aassembled contig. The format should be "FASTA". The AGORA allows only one assembled query contig. For more information about FASTA, please refer to NCBI site
- Type :
- - Select Choloroplast or Mitochondrion for your organellar genome
- Genetic Code
- - This code is used for running the tBLASTn Standard Vertebrate Mitochondrial Yeast Mitochondrial Mold Mitochondriali Invertebrate Mitochondrial Ciliate Nuclear Echinoderm Mitochondrial Euplotid Nuclear Bacteria and Archaea Alternative Yeast Nuclear Ascidian Mitochondrial Flatworm Mitochondrial Blepharisma Macronuclear For more information about genetic code please see NCBI
- Sample output is here
- Output :
- - As you see below examples, output file is the BLAST result that includes amino acid and nucloetide. The Query is set to the refereces and Data base is set to query. The number of matched position is decided upon the "Maximum matched sub gene's count"
- The blast result of amino acid
- The blast result of nucleotide
- Amino acid db sequences :
- - This file includes the amino acid data base sequences. If the user uploaded the user-defined sequence, this file is same to that uploaded file. Otherwise, system is automatically generated from the NCBI.
- Amino acid sequences :
- - This is CDS translation files that is matched from the BLAST
- output CSV file :
- - This file provides the start and end position, direction and gene product for each gene.
- Nucleotide db sequences :
- - This file is nucleotide data base sequences.
- Nucleotide sequences :
- -The FASTA formatted seuqneces file is includes the BLAST mached sequences.
- GenBank File format :
- - This file is GenBank formatted file. With this file we draw the circular gene map by running OGDRAW
- OGDRAW :
- - If all genes are matched correctly, you can see the figure. Here is example
Genome Search Plotter
Big data research on genomic sequence analysis has accelerated considerably with the development of next-generation sequencing. Currently, research on genomic sequencing has been conducted using various methods, ranging from the assembly of reads consisting of fragments to the annotation of genetic information using a database that contains known genome information. According to the development, most tools to analyze the new organelles’ genetic information requires different input formats such as FASTA, GeneBank (GB) and tab separated files. The various data formats should be modified to satisfy the requirements of the gene annotation system after genome assembly. In addition, the currently available tools for the analysis of organelles are usually developed only for specific organisms, thus the need for gene prediction tools, which are useful for any organism, has been increased. The proposed method—termed the genome_search_plotter—is designed for the easy analysis of genome information from the related references without any file format modification. Anyone who is interested in intracellular organelles such as the nucleus, chloroplast, and mitochondria can analyze the genetic information using the assembled contig of an unknown genome and a reference model without any modification of the data from the assembled contig.
Reference accesion Number :
- In order to run the BLAST with query contigs and reference, Reference sequences are required. For more information, please refer to NCBI site
Input File Format :
- Input file is the assembled contigs and format should be in FASTA format. The number of contigs are not limimted, but, it will takes several hours. We recommend to copy the reslut page URL for your reference.
Maximum number of matched BLAST hit group :
- The number sets the maximum groups which is based on the BLAST result.
Minimum number of matched sub gene's count per each contig : - Please refer to below Figure 1-B.
Figure 1. Maximum number of hit group and mininum number of subgene's count
The provided files are "sorted sequences" and "PDF" files
Sorted by the number of subgene :
- BLAST result file which is sorted by BLAST e-value. In the sorted query sequences file, sequences are sorted by the number of matched sub-genes and the sequences that do not meet the minimum value of k are filtered out
- The results are shown as a graph depicting matches with the 149 reference genomes on the X-axis and query sequences uploaded by the user on the Y-axis. Each 150 line on the plot indicates that a query contig is matched with the reference sequences.
C-Hunter is a new clustering algorithm which incorporates knowledge of gene function derived from Gene Ontology, with the organization of genes on chromosomes. In order to use C-Hunter program, basic data sets are needed. All data sets for eight species(AT,CE,DM,DR,EC,HS,MM & SC), Data/Map file, GO, gene2accession and gene2go are supplied with C-Hunter program together. But, if you want to use new data sets, you can download from each website and you can make them again. C-Hunter program can be compiled under the Unix/Linux/Windows(Cygwin) environment, if the compiler supports STL.
If you are going to use alreay-made data sets, you don't need to do this procedure. 1. Download go.obo text file (include Molecular Function, Biological Process, Cellular) at http://www.geneontology.org/page/download-ontology 2. Download gene2accession at NCBI, ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ 3. Download gene2go at NCBI, ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ 4. Make Chromosome list file (text file by tab-separated format) 1st_Column = Chromsome Number 2nd_Column = Accession 3rd_Column = GI 5. Run "obo2scheme.py" with 1 Gene Ontology data Usage : python obo2scheme.py go.obo (file from #1) The output file name will be "scheme.GO.data" 6. Run "Make_GO_Map_Data.php" with gene2accession, gene2go and Chr_list Usage : php Make_GO_Map_Data.php gene2accession, gene2go, Chr_list, Ouput file name This converting program makes map files and GO data files.
------------------------------------------------------------------------------------------------------ Usage: ./C_Hunter [arg_1] [arg_2] [arg_3] [arg_4] [arg_5] [arg_6] [arg_7] [arg_8] [arg_9] [arg_10] ------------------------------------------------------------------------------------------------------ [arg_1] = GO scheme file [arg_2] = Map file [arg_3] = Data file [arg_4] = Minimum number of genes in a cluster [arg_5] = Maximum number of genes in a cluster ( 0 = No maximum size ) [arg_6] = Maximum cluster size ( 0 = No maximum size ) [arg_7] = E-value cutoff ( 0 = No consideration ) [arg_8] = Threshold of cluster overlap percent ( 0 = No consideration ) [arg_9] = Show only top clusters (T) or full clusters (F) ( Select T or F ) [arg_10] = Output file name Before running C-Hunter program, be sure all data sets are ready. 1st parameter is a converted GO data from Number 5 at "Procedure for preparing data sets". 2nd and 3rd parameters are Data/Map file from Number 6 at "Procedure for preparing data sets". All parameters are required to run C-Hunter. This is an example of runing C-Hunter program with basic data set. Min. number of genes in a cluster = 2 Max. number of genes in a cluster = 10 Max. cluster size = 0 ( Considering all possible cluster sizes ) Show only Top Clusters E-value cutoff = 0.001 Threshold of cluster overlap percent = 50% Output file name = SC ( SC and SC.out will be generated ) SHELL> C_Hunter.v1.2 Scheme/scheme.GO.data Data_set/SaccharomycesCerevisiae/map_list Data_set/SaccharomycesCerevisiae/data_list 2 10 0 0.001 50 t SC
SClassify is a supervised protein family classification algorithm that overcomes the problems of existing supervised and unsupervised algorithms and achieves much improved accuracy. It can assign proteins to existing families in databases, and by taking into account similarities between the unclassified proteins, can assign them to new families.
The SClassify source code, including sample input and output files, can be compiled under the Unix/Linux/Windows(Cygwin) environment. The following steps will create a directory called sclassify. Detailed usage of SClassify is provided in a README file.
The program assumes that e-values between each unclassified protein and each protein in existing families and e-values between each pair of unclassified proteins have already been obtained by other software such as BLAST or SSEARCH. The following files are needed:
1. A file that lists the name of each protein in existing families along with the name of its family in a two-column tab-separated format (example file: pfam.list).
2. A file that lists the name of each unclassified protein in a one-column format (example file: test.list).
3. A file that lists the e-values between each unclassified protein and each protein in existing families in a three-column tab-separated format that gives the name of an unclassified protein, the name of a protein in an existing family, and the e-value between them. There is no need to have an e-value for each pair if some of them are missing. The file is optional (example files: blast/test_pfam.score, ssearch/test_pfam.score).
4. A file that lists the e-values between each pair of unclassified proteins in a three-column tab-separated format that gives the names of two unclassified proteins and the e-value between them. There is no need to have an e-value for each pair if some of them are missing. The file is optional (example files: blast/test_test.score, ssearch/test_test.score).
./sclassify -c infile1 -u infile2 -p infile3 -n infile4 -e cutoff -o outfile
where infile1 to infile4 are the input files described above, cutoff is the e-value cutoff, and outfile is the output file.
./sclassify -c pfam.list -u test.list -p blast/test_pfam.score -n blast/test_test.score -e 0.1 -o test.out
./sclassify -c pfam.list -u test.list -p blast/test_pfam.score -e 1e-10 -o test.out
./sclassify -c pfam.list -u test.list -p ssearch/test_pfam.score -n ssearch/test_test.score -e 0.1 -o test.out
./sclassify -c pfam.list -u test.list -p ssearch/test_pfam.score -e 1e-10 -o test.out
The output file is in a two-column tab-separated format that lists the name of each protein that is classified and the name of its assigned family. A distinct name is generated for each new family, and the same name is used for all proteins that are classified to the same family.
Two scripts are provided to convert the results from BLAST and from SSEARCH to a three-column tab-separated format.
1. BLAST converter
Usage: python convert_blast.py infile outfile
where infile contains the results from BLAST, and outfile is the output file.
python convert_blast.py blast/test_pfam.blast blast/test_pfam.score
python convert_blast.py blast/test_test.blast blast/test_test.score
Note: If BLAST is applied with option -m 8, then there is no need to run the python script to convert the BLAST output.
blastall -p blastp -m 8 -d pfam -i test.fasta | cut -f 1,2,11 > blast/test_pfam.score
2. SSEARCH converter
Usage: python convert_ssearch.py infile outfile
where infile contains the results from SSEARCH, and outfile is the output file.
python convert_ssearch.py ssearch/test_pfam.ssearch ssearch/test_pfam.score
python convert_ssearch.py ssearch/test_test.ssearch ssearch/test_test.score
Users can freely use the software, and the accessible URL is . The main module of geneCo is implemented by Python and the web-based user interface is built by PHP, HTML and CSS to support all browsers.