Summary: Next-generation sequencing (NGS) technologies have led to the accumulation of high-throughput sequence data from various organisms in biology. To apply gene annotation of organellar genomes for various organisms, more optimized tools for functional gene annotation are required. Almost all gene annotation tools are mainly focused on the chloroplast genome of land plants or the mitochondrial genome of animals. We have developed a web application AGORA for the fast, user-friendly and improved annotations of organellar genomes. Annotator for Genes of Organelle from the Reference sequence Analysis (AGORA) annotates genes based on a basic local alignment search tool (BLAST)-based homology search and clustering with selected reference sequences from the NCBI database or user-defined uploaded data. AGORA can annotate the functional genes in almost all mitochondrion and plastid genomes of eukaryotes. The gene annotation of a genome with an exon–intron structure within a gene or inverted repeat region is also available. It provides information of start and end positions of each gene, BLAST results compared with the reference sequence and visualization of gene map by OGDRAW. https://bigdata.dongguk.edu/gene_project/AGORA/






Reference sequence :
- In order to run the BLAST with query and references, you need to put reference sequences. If you have the user-defined amino acid or nucleotide sequences, you can upload it on the system. Otherwise, you can put the accesion number such as NC_000000.For more information, please refer NCBI site
Input File Format :
- Input file is an aassembled contig. The format should be "FASTA". The AGORA allows only one assembled query contig. For more information about FASTA, please refer to NCBI site
Type :
- Select Choloroplast or Mitochondrion for your organellar genome
Genetic Code
- This code is used for running the tBLASTn
   Standard
   Vertebrate Mitochondrial
   Yeast Mitochondrial
   Mold Mitochondriali
   Invertebrate Mitochondrial
   Ciliate Nuclear
   Echinoderm Mitochondrial
   Euplotid Nuclear
   Bacteria and Archaea
   Alternative Yeast Nuclear
   Ascidian Mitochondrial
   Flatworm Mitochondrial
   Blepharisma Macronuclear

For more information about genetic code please see NCBI


Sample output is here
Output :
- As you see below examples, output file is the BLAST result that includes amino acid and nucloetide. The Query is set to the refereces and Data base is set to query. The number of matched position is decided upon the "Maximum matched sub gene's count"

The blast result of amino acid
The blast result of nucleotide

Amino acid db sequences :
- This file includes the amino acid data base sequences. If the user uploaded the user-defined sequence, this file is same to that uploaded file. Otherwise, system is automatically generated from the NCBI.
Amino acid sequences :
- This is CDS translation files that is matched from the BLAST
output CSV file :
- This file provides the start and end position, direction and gene product for each gene.
Nucleotide db sequences :
- This file is nucleotide data base sequences.
Nucleotide sequences :
-The FASTA formatted seuqneces file is includes the BLAST mached sequences.
GenBank File format :
- This file is GenBank formatted file. With this file we draw the circular gene map by running OGDRAW
OGDRAW :
- If all genes are matched correctly, you can see the figure. Here is example










Big data research on genomic sequence analysis has accelerated considerably with the development of next-generation sequencing. Currently, research on genomic sequencing has been conducted using various methods, ranging from the assembly of reads consisting of fragments to the annotation of genetic information using a database that contains known genome information. According to the development, most tools to analyze the new organelles’ genetic information requires different input formats such as FASTA, GeneBank (GB) and tab separated files. The various data formats should be modified to satisfy the requirements of the gene annotation system after genome assembly. In addition, the currently available tools for the analysis of organelles are usually developed only for specific organisms, thus the need for gene prediction tools, which are useful for any organism, has been increased. The proposed method—termed the genome_search_plotter—is designed for the easy analysis of genome information from the related references without any file format modification. Anyone who is interested in intracellular organelles such as the nucleus, chloroplast, and mitochondria can analyze the genetic information using the assembled contig of an unknown genome and a reference model without any modification of the data from the assembled contig. https://bigdata.dongguk.edu/gene_project/genome_search_plotter/






Reference accesion Number :
- In order to run the BLAST with query contigs and reference, Reference sequences are required. For more information, please refer to NCBI site

Input File Format :
- Input file is the assembled contigs and format should be in FASTA format. The number of contigs are not limimted, but, it will takes several hours. We recommend to copy the reslut page URL for your reference.

Maximum number of matched BLAST hit group :
- The number sets the maximum groups which is based on the BLAST result.

Minimum number of matched sub gene's count per each contig :
- Please refer to below Figure 1-B.


Figure 1. Maximum number of hit group and mininum number of subgene's count



The provided files are "sorted sequences" and "PDF" files

Sorted by the number of subgene :
- BLAST result file which is sorted by BLAST e-value. In the sorted query sequences file, sequences are sorted by the number of matched sub-genes and the sequences that do not meet the minimum value of k are filtered out

PDF file:
- The results are shown as a graph depicting matches with the 149 reference genomes on the X-axis and query sequences uploaded by the user on the Y-axis. Each 150 line on the plot indicates that a query contig is matched with the reference sequences.




C-Hunter is a new clustering algorithm which incorporates knowledge of gene function derived from Gene Ontology, with the organization of genes on chromosomes. In order to use C-Hunter program, basic data sets are needed. All data sets for eight species(AT,CE,DM,DR,EC,HS,MM & SC), Data/Map file, GO, gene2accession and gene2go are supplied with C-Hunter program together. But, if you want to use new data sets, you can download from each website and you can make them again. C-Hunter program can be compiled under the Unix/Linux/Windows(Cygwin) environment, if the compiler supports STL.


  • tar -xzvf C_Hunter_v.1.2.tar.gz
  • cd C_Hunter_v.1.2
  • ./install

  •  
     If you are going to use alreay-made data sets, you don't need to do this procedure.
    
     1. Download go.obo text file (include Molecular Function, Biological Process, Cellular) at http://www.geneontology.org/page/download-ontology
     2. Download gene2accession at NCBI, ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
     3. Download gene2go at NCBI, ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
     4. Make Chromosome list file (text file by tab-separated format)
         1st_Column = Chromsome Number
         2nd_Column = Accession 
         3rd_Column = GI
     5. Run "obo2scheme.py" with 1 Gene Ontology data
         Usage : python obo2scheme.py go.obo (file from #1)
    	 The output file name will be "scheme.GO.data"
     6. Run "Make_GO_Map_Data.php" with gene2accession, gene2go and Chr_list
         Usage : php Make_GO_Map_Data.php gene2accession, gene2go, Chr_list, Ouput file name
         This converting program makes map files and GO data files.
    
    

       ------------------------------------------------------------------------------------------------------
         Usage: ./C_Hunter [arg_1] [arg_2] [arg_3] [arg_4] [arg_5] [arg_6] [arg_7] [arg_8] [arg_9] [arg_10]
       ------------------------------------------------------------------------------------------------------
    
         [arg_1] = GO scheme file
         [arg_2] = Map file
         [arg_3] = Data file
         [arg_4] = Minimum number of genes in a cluster
         [arg_5] = Maximum number of genes in a cluster ( 0 = No maximum size )
         [arg_6] = Maximum cluster size ( 0 = No maximum size )
         [arg_7] = E-value cutoff ( 0 = No consideration )
         [arg_8] = Threshold of cluster overlap percent ( 0 = No consideration )
         [arg_9] = Show only top clusters (T) or full clusters (F) ( Select T or F )
         [arg_10] = Output file name
     
      Before running C-Hunter program, be sure all data sets are ready.
      1st parameter is a converted GO data from Number 5 at "Procedure for preparing data sets".
      2nd and 3rd parameters are Data/Map file from Number 6 at "Procedure for preparing data sets".
      All parameters are required to run C-Hunter.
    
      This is an example of runing C-Hunter program with basic data set.
         Min. number of genes in a cluster = 2
         Max. number of genes in a cluster = 10
         Max. cluster size = 0 ( Considering all possible cluster sizes )
         Show only Top Clusters
         E-value cutoff = 0.001
         Threshold of cluster overlap percent = 50%
         Output file name = SC ( SC and SC.out will be generated )
    
    SHELL> C_Hunter.v1.2 Scheme/scheme.GO.data Data_set/SaccharomycesCerevisiae/map_list Data_set/SaccharomycesCerevisiae/data_list 2 10 0 0.001 50 t SC 
    


  • Download C_Hunter.v.1.2.tar.gz without FASTA files - 2.5GB
  • Download C_Hunter.v.1.2.tar.gz with FASTA files - 1.3GB







  • SClassify is a supervised protein family classification algorithm that overcomes the problems of existing supervised and unsupervised algorithms and achieves much improved accuracy. It can assign proteins to existing families in databases, and by taking into account similarities between the unclassified proteins, can assign them to new families.


    The SClassify source code, including sample input and output files, can be compiled under the Unix/Linux/Windows(Cygwin) environment. The following steps will create a directory called sclassify. Detailed usage of SClassify is provided in a README file.

  • gunzip sclassify.tar.gz
  • tar xvf sclassify.tar
  • cd sclassify
  • ./install



  • The program assumes that e-values between each unclassified protein and each protein in existing families and e-values between each pair of unclassified proteins have already been obtained by other software such as BLAST or SSEARCH. The following files are needed:

    1. A file that lists the name of each protein in existing families along with the name of its family in a two-column tab-separated format (example file: pfam.list).

    2. A file that lists the name of each unclassified protein in a one-column format (example file: test.list).

    3. A file that lists the e-values between each unclassified protein and each protein in existing families in a three-column tab-separated format that gives the name of an unclassified protein, the name of a protein in an existing family, and the e-value between them. There is no need to have an e-value for each pair if some of them are missing. The file is optional (example files: blast/test_pfam.score, ssearch/test_pfam.score).

    4. A file that lists the e-values between each pair of unclassified proteins in a three-column tab-separated format that gives the names of two unclassified proteins and the e-value between them. There is no need to have an e-value for each pair if some of them are missing. The file is optional (example files: blast/test_test.score, ssearch/test_test.score).

    USAGE
    ./sclassify -c infile1 -u infile2 -p infile3 -n infile4 -e cutoff -o outfile
    where infile1 to infile4 are the input files described above, cutoff is the e-value cutoff, and outfile is the output file.

    EXAMPLES
    ./sclassify -c pfam.list -u test.list -p blast/test_pfam.score -n blast/test_test.score -e 0.1 -o test.out
    ./sclassify -c pfam.list -u test.list -p blast/test_pfam.score -e 1e-10 -o test.out

    ./sclassify -c pfam.list -u test.list -p ssearch/test_pfam.score -n ssearch/test_test.score -e 0.1 -o test.out
    ./sclassify -c pfam.list -u test.list -p ssearch/test_pfam.score -e 1e-10 -o test.out

    OUTPUT
    The output file is in a two-column tab-separated format that lists the name of each protein that is classified and the name of its assigned family. A distinct name is generated for each new family, and the same name is used for all proteins that are classified to the same family.

    SCRIPTS
    Two scripts are provided to convert the results from BLAST and from SSEARCH to a three-column tab-separated format.



    1. BLAST converter

    Usage: python convert_blast.py infile outfile
    where infile contains the results from BLAST, and outfile is the output file.

    Examples:
    python convert_blast.py blast/test_pfam.blast blast/test_pfam.score
    python convert_blast.py blast/test_test.blast blast/test_test.score

    Note: If BLAST is applied with option -m 8, then there is no need to run the python script to convert the BLAST output.

    Example:
    blastall -p blastp -m 8 -d pfam -i test.fasta | cut -f 1,2,11 > blast/test_pfam.score

    2. SSEARCH converter

    Usage: python convert_ssearch.py infile outfile
    where infile contains the results from SSEARCH, and outfile is the output file.

    Examples:
    python convert_ssearch.py ssearch/test_pfam.ssearch ssearch/test_pfam.score
    python convert_ssearch.py ssearch/test_test.ssearch ssearch/test_test.score



    In comparative and evolutionary genomics, a detailed comparison of common features between organisms is essential to evaluate genetic distance. However, identifying differences in matched and mismatched genes among multiple genomes is difficult using current comparative genomic approaches due to complicated methodologies or the generation of meager information from obtained results. This study describes a visualized software tool, geneCo (gene Comparison), for comparing genome structure and gene arrangements between various organisms. User data are aligned, gene information is recognized, and genome structures are compared based on user-defined GenBank files. Information regarding inversion, gain, loss, duplication, and gene rearrangement among multiple organisms being compared is provided by geneCo, which uses a web-based interface that users can easily access without any need to consider the computational environment.

    Users can freely use the software, and the accessible URL ishttps://bigdata.dongguk.edu/geneCo. The main module of geneCo is implemented by Python and the web-based user interface is built by PHP, HTML and CSS to support all browsers.



    Under development