Guide to the BioCyc Database Collection
5 Discovering Functional Gene Clusters Using Genome Context Methods
This document provides an overview of the BioCyc collection of Pathway/Genome Databases.
Although its content is limited at the current time, it will expand over time to cover additional aspects of BioCyc. The information in this document pertains to all BioCyc databases (DBs), and to most other DBs created using the Pathway Tools software. More detailed information about specific members of the BioCyc family is available as follows:
The BioCyc collection of Pathway/Genome Databases (PGDBs) provides electronic reference sources on the pathways and genomes of many organisms. BioCyc databases describe organisms with sequenced genomes. BioCyc is primarily microbial. In addition, BioCyc contains databases for humans; for important model organisms such as yeast, fly, and mouse; and for other eukaryotes whose PGDBs have been curated. One reason for collecting all these PGDBs together within BioCyc is to enable the comparative analyses that become possible when multiple PGDBs are available within one site (see Tools → Analysis → Comparative Analysis). PGDBs have been created for many other organisms, including microbes, fungi, plants (see PlantCyc.org and animals [details].
The databases (DBs) within the BioCyc collection are organized into tiers according to the amount of manual review and updating they have received.
We encourage scientists to adopt the Tier 2 and Tier 3 DBs for ongoing curation and updating.
Most microbial PGDBs within BioCyc have been generated computationally by SRI and are regenerated every 6-12 months to take advantage of improvements in our pathway prediction algorithms and in the MetaCyc pathway database. The PGDBs within BioCyc that have been provided by outside groups are updated with variable frequencies. Usually the date on which a PGDB was generated or last updated can be determined by selecting that PGDB as the current PGDB and then viewing the page at Tools → Analysis → Summary Statistics or Tools → Analysis → Reports → History of Updates.
Tier 3 PGDBs are generated using an automated computational pipeline. The genome data source for a specific PGDB can be determined by selecting that PGDB as current database and then executing Tools → Analysis → Summary Statistics. Genome data sources include:
The processing steps used to create BioCyc from an annotated genome are as follows.
Although all processing steps are overseen by our staff, no human curation is applied to the Tier 3 PGDBs.
The specific processing steps used to generate Tier 3 PGDBs are as follows.
Operationally, BioCyc uses the term “ortholog” not in the strict evolutionary sense defined by Fitch  but in the looser sense of genes that are likely to be counterparts of one another because they are the most closely related in a pair of organisms. Although such genes are likely to be evolutionarily related in the sense of Fitch, we do not perform a detailed evolutionary analysis to compute BioCyc orthologs.
BioCyc ortholog information is generated by running NCBI BLAST (“legacy” BLAST version 2.2.23) pair-wise between all proteomes of all PGDBs. We do not compute sequence similarity information for RNA genes or pseudo genes, therefore BioCyc does not currently contain orthology information for such genes.
We use an E-value cut-off of 0.001 when running BLAST, with all other parameters being left at their default settings. We define two proteins A and B as orthologs if protein A from proteome PA and protein B from proteome PB are bi-directional “best” BLAST hits of one another, meaning that protein B is the best BLAST hit of protein A within proteome PB, and protein A is the best BLAST hit of protein B within proteome PA. In rare cases, protein A might have multiple orthologs in proteome PB, as explained below.
The “best” hit(s) of protein A in proteome PB is defined by finding the minimal E-value among all hits in proteome PB in the BLAST output. There could be hits to multiple proteins in proteome PB that share that same minimal E-value. In other words, ties are possible, as in the case of exact gene duplications. We attempt to break ties using two methods: taking the hit with the maximum alignment length; and then taking the hit with the maximum alignment amino acid residue identity. For the first method, we compare the alignment lengths among all the hits of protein A in proteome PB that share the same minimum E-value, and the protein in proteome PB with the maximum alignment length is selected. For the second method, we compare the number of identical amino acid residues in the alignments between protein A and the hits of protein A in proteome PB that share the same minimum E-value, and the protein in proteome PB with the maximum number of identical amino acid residues is selected. In the case that ties still remain (as in the case of exact gene duplications), all ties are included in the final set of orthologs used by BioCyc. Thus, protein A could have multiple orthologs in PB, such as if multiple proteins B1, B2, etc, exist in PB, and have exactly the same regions align against protein A. BioCyc does not calculate paralogs. Although orthologs are defined for most pairs of PGDBs in BioCyc, some PGDBs lack ortholog data. A given pair of organisms might lack orthologs within BioCyc for the following possible reasons.
This aspect of BioCyc is meant to enable the discovery of new pathways, and to provide clues as to the functions of genes of unknown function. A functionally linked gene cluster, or FunGC, refers to a group of genes that are estimated by the system to operate in a common pathway, a common cellular process, or a common protein complex (note the term “cluster” does not imply physical clustering on a chromosome). There is no guarantee as to what type of functional linkages exist among the genes in a FunGC. FunGCs are predicted using comparative-genomics methods called genome-context methods [7, 6, 20, 5, 16] that search for patterns across hundreds of genomes. Details of the methods used by BioCyc are described in the next sections.
BioCyc includes FunGCs for 10 genomes as of October 2014. In later releases we plan to expand the number of genomes for which FunGCs are available.
For example, E. coli genes bioA, bioB, bioD, and bioF, which participate in the metabolic pathway for biotin biosynthesis, are all predicted by our method to form a FunGC. Another E. coli FunGC consists of genes xylF, xylH, xylA, and xylB. The xylF and xylH genes encode a xylose ABC transporter, while xylA and xylB encode a xylose isomerase and a xylulokinase, respectively, which participate in a xylose degradation pathway. The full list of E. coli FunGCs is available at http://biocyc.org/ECOLI/genome-context.
We will use E. coli gene yqeB to illustrate BioCyc FunGCs in more detail. yqeB is a gene of unknown function; it is annotated as a conserved protein with a NAD(P)-binding Rossman fold. yqeB forms a FunGC with genes xdhC (xanthine dehydrogenase, Fe-S subunit), xdhB (xanthine dehydrogenase subunit, FAD-binding domain), xdhA (xanthine dehydrogenase subunit), and paoA (aldehyde dehydrogenase, Fe-S subunit). Thus, one might infer because of its assignment to a common FunGC with these four other genes that yqeB forms a common pathway with these genes.
FunGCs are computed by a two-step process. Given all genes in a given organism,
In a moment we will consider these steps in more detail. But first we discuss the reliance of these methods on the ortholog data within BioCyc.
Computation of functional linkages between two genes G1 and G2 is based on comparative-genomics analyses of the orthologs of G1 and G2 that exist in many genomes. We will refer to a given such pair of orthologs as G1′ and G2′.
BioCyc computes orthologous genes (e.g., that G1 and G1′ are orthologs) as bidirectional best hits obtained from BLAST outputs among most (but not yet all) pairs of genomes within BioCyc. To determine whether yqeB and xdhA are functionally linked, we examine the orthologs of those two genes across a special set of 1800 BioCyc genomes that we call the reference genomes. The reference genomes are chosen to be taxonomically diverse because selection of multiple closely related species could bias the relationships that we consider during computational of functional relatedness (e.g., because gene order will be highly conserved in highly related genomes).
To estimate whether genes G1 and G2 are functionally linked we consider two factors. First is the co-occurrence of their orthologs in many genomes. Are the orthologs typically found in the same genomes? If so their pairwise functional-linkage score will be higher. But the algorithm also considers how many genomes they are found in. For example, if orthologs of G1 and G2 are found in virtually every reference genome, they would have no choice but to co-occur, thus decreasing their functional linkage score. The genome-occurrence profile for a gene is called its phylogenetic profile, and is represented as a vector of ones and zeros — a one when an ortholog of G is found in a given genome, and a zero when no ortholog of G is found in a given genome. If the phylogenetic profile vectors of G1 and G2 are similar then we infer that the genes are likely to be functionally linked. The exact phylogenetic profile method we use is called pp-mutual-info (znorm) in .
The second factor used to infer functional linkage is the spatial proximity of orthologs of G1 and G2. That is, when orthologs G1′ and G2′ do co-occur in the same genome, are G1′ and G2′ frequently close to one another on the same chromosome (e.g., separated by few genes or adjacent)? If so, this conserved genome proximity also provides evidence of functional relatedness. The genomic proximity method we use is a variation of the one called gn-lnX (znorm) in , where correlation between organisms is taken into account, with a resulting improvement in performance. Given this improvement and the fact that both methods are fused to get an additional boost in performance, the resulting score is significantly better than the best method presented in that paper. To our knowledge, this makes the fused scores used for the system, the best genome context scores in the literature.
Pairwise functional-linkage information alone is useful in providing clues regarding the function of one gene based on the known functions of other genes that it is related to. Even more useful is to find larger sets of genes that are functionally related to one another and that therefore may function in a common pathway or biological process.
The method described in this section computes Functionally Linked Gene Clusters, or FunGCs. The algorithm starts with the set of pairwise functional linkages computed in the previous step. The linkages are encoded in a graph whose nodes are the genes of a given organism, and whose edges are the inferred functional linkages among those genes. The algorithm searches the full graph to find highly interconnected subgraphs (they need not be fully interconnected), that is, sets of genes that have a high fraction of mutual functional linkages. Some genes will be removed from the set if they are not sufficiently highly interconnected to other genes in the set. The algorithm we use for computing FunGCs as highly interconnected graphs is CFinder .
When computing FunGCs we must make empirical choices regarding two thresholds. One is the strength of pairwise functional linkage that we deem sufficient for inclusion in the network. The other one is the clique size CFinder uses as starting point, which, in turn, determines the minimum size of the resulting clusters (currently 4). We may perform some adjustment of these thresholds over time to try to seek a proper balance between finding sufficient numbers of interesting new pairwise linkages and FunGCs, versus limiting the number of incorrect pairwise linkages and FunGCs. The current values were chosen to give a good trade-off on E. coli̇ We have performed extensive evaluation and tuning of our methods [6, 7], and in fact the methods currently used for BioCyc have improved somewhat over the published methods. For example, our use of CFinder is a new addition; it does not require full connectivity among genes within a cluster as did the previous method. And because highly-overlapping cliques are merged by CFinder into a bigger groups, this code results in fewer groups which overlap each other much less and are also more accurate (make better predictions of the pathways or protein complexes that we know of in the target organisms). CFinder is run on a network that was created with a very high threshold on the pairwise scores resulting in relatively few high-confidence groups being found. Inspection of the FunGC listing page for E. coli will show that the FunGCs present in BioCyc for genes of known function overlap very strongly with known pathways.
FunGC information is available on two new BioCyc page types:
This page is available for genes with at least one pairwise functional linkage that exceeds our minimum cutoff. Navigate to this page from the gene page by clicking on “Genome Context Analysis” in the right-sidebar menu of the gene page. Not every gene page will have such a page available.
The Genome-Context Analysis Page has two sections: the first section lists the FunGC(s) to which a gene gene has been assigned. The second section lists the pairwise functional linkages that have been computed for a gene. Not every functionally linked gene will appear in a FunGC.
The first section depicts the FunGC as a graph with edge color indicating the strength of pairwise functional linkages. The tabs to the right of the FunGC diagram produce different views of the genes within a FunGC, such as showing any known pathways and operon structures containing these genes.
The Functional Gene Clusters page lists all FunGCs available for an organism. Navigate to this page from the Genome-Context Analysis Page for any gene, using the link near the top (“See all FunGCs for this organism”).
As genomic information becomes available for an ever growing number of organisms, it becomes essential to have up-to-date knowledge resources for organisms with large experimental communities. For example, how can an experimentalist hope to accurately interpret gene-expression microarray results without having all available knowledge about the functions of each gene product in the organism at their fingertips? We see it as essential that experts on the biology of specific organisms begin curation of organism-specific databases describing the genome and biochemical networks of those organisms.
All of the Tier 3 BioCyc PGDBs, and some of the Tier 2 PGDBs, are available for adoption. The DBs are available under an open license agreement, meaning that they are freely available to all, they may be modified, and they may be freely redistributed.
The Pathway Tools software system on which BioCyc is based provides a package of graphical editing tools that allow biologists to create and update database entities such as genes, proteins, metabolic pathways, metabolites, operons, and transcriptional regulatory interactions. Scientists can also use Pathway Tools to publish their PGDBs on their own website, in a manner analogous to the BioCyc website (which is powered by Pathway Tools).
Suggested steps for adoption of PGDBs are as follows.
For more details on post-adoption care of a PGDB, see Post-Adoption Care of a BioCyc Database.
BioCyc is served by a network-load balancer that dispatches user requests to a collection of Linux-based computers. BioCyc web service APIs are described here .
The BioCyc website is powered by the Pathway Tools software [11, 12]. Pathway Tools runs as a long-lived web server process, with web requests handled by Franz AllegroServe and CWEST . Pathway Tools also makes use of other bioinformatics software such as BLAST, PatMatch, and MUSCLE. Users can install Pathway Tools on their own computers, where it can run as both a desktop application and as a local web server. Local installations of Pathway Tools can also be used to query and update BioCyc data via APIs in Python, Java, Lisp, Perl, and R. Pathway Tools is written in the Common Lisp language using the Allegro Common Lisp product from Franz Inc .
BioCyc databases are stored in an object-oriented database system called Ocelot , which stores its databases persistently in disk files and in MySQL databases. For graphics generation Pathway Tools relies on Allegro CLIM (Common Lisp Interface Manager) in concert with an image generation package called SKIPPY.
For more details on the architecture of Pathway Tools see .
Please see the comparison section of the MetaCyc Guide .
BioCyc has a number of advanced operations including a number of comparative genomics tools, programs for analysis of high throughput datasets such as gene-expression data, and metabolic network analysis tools. The following information resources describe BioCyc in more detail.
BioCyc is grateful for the following groups: