Guide to the BioCyc Database Collection
Contents
3 Methodology for Generation of BioCyc Tier 3 PGDBs
5 BioCyc Concepts
5.1 How are Pathway Boundaries Defined?
5.2 Super Pathways and Base Pathways
5.3 Do We Force a Pathway View of the Metabolic Network?
5.4 Reaction Direction
5.5 Reaction Balancing and Protonation State in BioCyc
6 BioCyc Computing Architecture
7 Comparison of BioCyc to Other Pathway Databases
8 Discovering Functional Gene Clusters Using Genome Context Methods
8.1 How are FunGCs Computed?
8.2 BioCyc Ortholog Data and the Reference Genomes
8.3 Computing Pairwise Functional Linkage Scores
8.4 Computation of Functionally Linked Gene Clusters
8.5 How to Use Functional Gene Cluster Information in BioCyc
9 Mechanisms for Accessing BioCyc Data
10 How to Learn More About PGDBs and BioCyc
1 Introduction
This document provides an overview of the BioCyc collection of Pathway/Genome Databases.
Although its content is limited at the current time, it will expand over time to cover additional aspects of BioCyc. The information in this document pertains to all BioCyc databases (DBs), and to most other DBs created using the Pathway Tools software. More detailed information about specific members of the BioCyc family is available as follows:
2 The BioCyc Databases
The BioCyc collection of Pathway/Genome Databases (PGDBs) provides electronic reference sources on the pathways and genomes of many organisms. BioCyc databases describe organisms with sequenced genomes. BioCyc is primarily microbial. In addition, BioCyc contains databases for humans and for model eukaryotes such as yeast, fly, mouse, and rat. One reason for collecting all these PGDBs together within BioCyc is to enable the comparative analyses that become possible when multiple PGDBs are available within one site (see Tools → Analysis → Comparative Analysis). Other groups have created PGDBs for many other organisms (see [details]), including microbes, fungi, and plants (see PlantCyc.org).
The databases (DBs) within the BioCyc collection are organized into tiers according to the amount of manual review and updating they have received.
Tier 1 PGDBs have been created through intensive manual efforts, and receive continuous updating [details of Tier 1]. The BioCyc Tier 1 DBs are:
EcoCyc — Escherichia coli K-12.
MetaCyc — experimentally elucidated enzymes and metabolic pathways from 3,400 organisms. MetaCyc does not seek to model the complete metabolic network of any one organism, but to provide a comprehensive collection of experimental pathways.
HumanCyc — Homo sapiens
YeastCyc — Saccharomyces cerevisiae
AraCyc — Arabidopsis thaliana
Tier 2 PGDBs were computationally generated by the PathoLogic program, and have undergone moderate amounts of review and updating. There are 80 PGDBs in Tier 2. [details of Tier 2]
Tier 3 PGDBs were computationally generated by the PathoLogic program, and have undergone no review and updating. There are 19985 PGDBs in Tier 3. [details of Tier 3]
We encourage scientists to contact us to contribute to the ongoing curation and updating of BioCyc databases.
Most microbial PGDBs within BioCyc have been generated computationally by SRI and are regenerated every 6-12 months to take advantage of improvements in our pathway prediction algorithms and in the MetaCyc pathway database. The PGDBs within BioCyc that have been provided by outside groups are updated with variable frequencies. Usually the date on which a PGDB was generated or last updated can be determined by selecting that PGDB as the current PGDB and then viewing the page at Tools → Analysis → Summary Statistics or Tools → Analysis → Reports → History of Updates.
3 Methodology for Generation of BioCyc Tier 3 PGDBs
Tier 3 PGDBs are generated using an automated computational pipeline. The genome data source for a specific PGDB can be determined by selecting that PGDB as current database and then executing Tools → Analysis → Summary Statistics. Genome data sources include:
Most Tier 3 genomes are obtained from RefSeq: ftp://ftp.ncbi.nih.gov/genomes/all/GCF
Human Microbiome Project genomes are obtained from: ftp://ftp.ncbi.nih.gov/genomes/HUMAN_MICROBIOM/Bacteria
The processing steps used to create BioCyc from an annotated genome are as follows.
Each genome is converted to files in the PathoLogic format that serves as input to the PathoLogic component of the Pathway Tools software.
PathoLogic generates a BioCyc PGDB for the organism (see further details below).
A combination of automated and manual quality assurance checks are applied to the generated PGDBs to ensure they are free of errors. Some genomes are not included in BioCyc because their data fail these quality checks, e.g., some genomes have such a small number of proteins with predicted functions that they are not included. These checks are described in detail in the BioCyc PGDB Concepts Guide, section “Quality Checking of PGDB Data”.
Although all processing steps are over seen by our staff, no human curation is applied to the Tier 3 PGDBs.
The specific processing steps used to generate Tier 3 PGDBs are as follows.
Predict metabolic pathways using PathoLogic [13].
Predict operons using PathoLogic [22].
Predict pathway hole fillers using PathoLogic [9].
Predict transporter reactions using PathoLogic [15].
Predict protein complexes using PathoLogic.
Generate an organism-specific metabolic map diagram (cellular overview) using Pathway Tools [14].
Import from UniProt [4] predicted and experimental protein functions, gene names, protein feature data (e.g., protein phosphorylation sites, active sites, and transmembrane regions), and Gene Ontology terms. Note that UniProt lacks these data for some organisms.
Data extracted from UniProt is copyright of the UniProt Consortium and subject to the Creative Commons Attribution (CC BY 4.0) License and the disclaimers at uniprot.org/help/license. The original data is available from uniprot.org.
Import protein subcellular localizations from PSORTdb [19] for available organisms.
Import transcriptional regulatory information from RegTransBase [3] and from Tractor_DB [21] for available organisms.
Import organism metadata such as relationship to oxygen, geographic location of sample, host for pathogens, and human microbiome site from NCBI BioProject/BioSample, PATRIC, and other available sources such as GOLD.
Compute the presence of Pfam domains in proteins (they are depicted as protein features).
Compute ortholog relationships among proteins within all BioCyc PGDBs.
Create web links from each BioCyc genome and each BioCyc gene/protein, to as many external bioinformatics DBs as possible, including to RefSeq, UniProt, and to additional DBs via the UniProt ID. The full list of DBs to which BioCyc PGDBs can include links are:
Cazy, dip, disprot, interpro, mint, panther, pdb, phosphosite, pride, prints, prodom, proteinmodelportal, prosite, pfam, smart, smr, string
4 BioCyc Ortholog Data
Operationally, BioCyc uses the term “ortholog” not in the strict evolutionary sense defined by Fitch [8] but in the looser sense of genes that are likely to be counterparts of one another because they are the most closely related in a pair of organisms. Although such genes are likely to be evolutionarily related in the sense of Fitch, we do not perform a detailed evolutionary analysis to compute BioCyc orthologs.
BioCyc ortholog information is generated by running a sequence comparison program pair-wise between all proteomes of all PGDBs. We do not compute sequence similarity information for RNA genes or pseudo genes, therefore BioCyc does not currently contain orthology information for such genes.
In the past, we used NCBI BLAST (“legacy” BLAST version 2.2.23) for the sequence comparisons, and some ortholog information is still retained that was computed by BLAST. However, since 2021, we have switched to a much faster program called Diamond (version 2.0.4), which in its “sensitive mode” still produces comparable results. New orthologs are now computed by Diamond, as well as for older PGDBs, if they are updated with newer versions of genome annotations.
We use an E-value cut-off of 0.001 when running Diamond (or BLAST), with all other parameters left at their default settings. We define two proteins A and B as orthologs if protein A from proteome PA and protein B from proteome PB are bi-directional “best” comparison hits of one another, meaning that protein B is the best hit of protein A within proteome PB, and protein A is the best hit of protein B within proteome PA. In rare cases, protein A might have multiple orthologs in proteome PB, as explained below.
The “best” hit(s) of protein A in proteome PB is defined by finding the minimal E-value among all hits in proteome PB in the sequence comparison output. There could be hits to multiple proteins in proteome PB that share that same minimal E-value. In other words, ties are possible, as in the case of exact gene duplications. We attempt to break ties using two methods: taking the hit with the maximum alignment length; and then taking the hit with the maximum alignment amino acid residue identity. For the first method, we compare the alignment lengths among all the hits of protein A in proteome PB that share the same minimum E-value, and the protein in proteome PB with the maximum alignment length is selected. For the second method, we compare the number of identical amino acid residues in the alignments between protein A and the hits of protein A in proteome PB that share the same minimum E-value, and the protein in proteome PB with the maximum number of identical amino acid residues is selected. In the case that ties still remain (as in the case of exact gene duplications), all ties are included in the final set of orthologs used by BioCyc. Thus, protein A could have multiple orthologs in PB, such as if multiple proteins B1, B2, etc, exist in PB, and have exactly the same regions align against protein A. BioCyc does not calculate paralogs. Although orthologs are defined for most pairs of PGDBs in BioCyc, some PGDBs lack ortholog data. A given pair of organisms might lack orthologs within BioCyc for the following possible reasons.
One of the organism PGDBs might not have sequence data
The two organisms are evolutionarily distant and their genes do not have any bi-directional best comparison hits
The ortholog computation is pending in the future
5 BioCyc Concepts
This section introduces a number of concepts that are important to understanding BioCyc.
5.1 How are Pathway Boundaries Defined?
Pathway boundaries are defined heuristically, using the judgement of expert curators. Curators consider the following aspects of a pathway when defining its boundaries.
What boundaries were defined historically for the pathway?
When possible, we prefer to define boundaries at the 13 common currency metabolites:
Consistency with regulatory units
Consistency with metabolic units that are evolutionarily conserved
The preceding philosophy toward pathway boundary definition contrasts sharply with KEGG maps. KEGG maps are on average 4.2 times larger than BioCyc pathways because KEGG tends to group into a single map multiple biological pathways that converge on a single metabolite [Pathway05].
5.2 Super Pathways and Base Pathways
We define a super-pathway as a cluster of related pathways. Typically, a super-pathway consists of a linked set of smaller pathways that share a common metabolite. For example, the super pathway superpathway of phenylalanine, tyrosine, and tryptophan biosynthesis connects several smaller pathways at the metabolite chorismate. The components of super-pathways include base pathways (pathways that are not themselves super-pathways), other super-pathways, and individual reactions that have not necessarily been assigned to base pathways. Those reactions typically serve to connect together the component pathways within a super-pathway.
Super-pathways are stored within each BioCyc PGDB – they are not computed dynamically.
5.3 Do We Force a Pathway View of the Metabolic Network?
No. Pathways comprise a level defined on top of the metabolic network. Users can choose to compute with the metabolic (reaction) network directly, ignoring the pathway layer, if they so choose. Note also that some metabolic reactions in most PGDBs are not assigned to any metabolic pathway.
5.4 Reaction Direction
How do PGDBs handle reaction direction?
The direction in which a reaction is stored in a PGDB has no implication for the physiological directionality of that reaction. Each reaction is stored as an instance of the Reactions class that includes two slots, Left and Right. It is possible that the reaction is bidirectional; it is possible that the reaction proceeds physiologically in the left-to-right direction, and it is possible that the reaction proceeds in the right-to-left direction.
The equilibrium constraint and change in Gibbs free energy stored for the reaction (if any) refer to the direction of the reaction as stored.
Currently, the best way to query the direction of a reaction is via an internal Pathway Tools Lisp function called get-rxn-direction. In the future, a field will be added to the Pathway Tools schema to record this information.
5.5 Reaction Balancing and Protonation State in BioCyc
5.5.1 Background and Motivations
This section addresses the state of reaction mass balance and protonation state of chemical compounds in the BioCyc databases. Because these issues are still evolving and are influenced to a large degree by history, we include a historical discussion of these issues.
Our long-term goal is for all reactions in BioCyc to be fully mass balanced and charge balanced, and for all chemical compounds to be properly protonated at cellular pH. Although in some cases such a treatment may yield reactions or chemical structures that look non-traditional to biochemists, we believe this approach provides the most consistent and correct treatment. In addition, it provides a treatment that will facilitate automatic generation of flux-balance models from PGDBs.
Historically, the chemical structure data within BioCyc databases has been obtained from many different sources, including textbooks, articles from the primary research literature, and downloading from certain open databases. In the early years of the project we developed programs to check the mass balance and element balance of reactions within BioCyc databases. We found that these programs were extremely valuable because identification of unbalanced reactions allowed us to identify errors in both the reaction equations, and in the chemical structures. However, we also found that, because of the diverse sources from which we obtained chemical structure data, the structures were protonated inconsistently. Therefore, for many years we ignored element imbalances due to hydrogen only, while correcting imbalances due to other elements.
In 2008, we began to address the problem of inconsistent protonation to facilitate automatic generation of flux-balance models. Work was completed on ensuring that reactions in the MetaCyc and EcoCyc PGDBs are completely mass-balanced. The first releases of those fully mass-balanced MetaCyc and EcoCyc DBs were version 13.0 in early 2009. In time, other BioCyc PGDBs will become mass balanced as well. For example, because we periodically regenerate the Tier 3 BioCyc PGDBs, the next time these PGDBs are generated from version 13.0 or higher of MetaCyc, they will be based on the consistently protonated compounds, and the fully mass-balanced reactions.
The following sections describe the methodology by which the protonation-state normalization and reaction mass balancing were achieved.
5.5.2 Protonation State Normalization
For a given chemical compound, there can be atoms that will bind a variable number of hydrogen atoms, depending on their chemical structure and the pH of their environment. A term for the isomers of a compound that differ in the number of hydrogens bound to these atoms is proto-isomer. A term for the atoms with variable numbers of bonded hydrogens is the proto-isomerization centers of a compound. Oxygen, sulfur, phosphorus, and nitrogen are examples of typical proto-isomerization centers.
In order to bring a greater degree of consistency to our PGDBs, we protonated (i.e., assigned the correct number of bound hydrogens to the proto-isomerization centers of a compound) the compounds of EcoCyc with a reference pH value of 7.3, using the Marvin (version 5.1.02) computational chemistry software available from ChemAxon, Ltd [1]. The pH value of 7.3 was selected based on a paper on the measurement of cytoplasmic pH of E. coli [2]. In order to easily exchange compound data between MetaCyc and EcoCyc, MetaCyc was also protonated with a reference pH value of 7.3. This step is an approximation since MetaCyc contains reactions and compounds from many organisms and many cellular compartments.
The Marvin software calculates the protonation state of a compound’s proto-isomerization centers by first determining their pKa. The pKa of the proto-isomerization centers of a compound were obtained by computing the partial charge distribution. This, in turn, is calculated using a numerical partial differential equation solver, which computes the distribution by means of the structure of the compound, and the known electronegativities of the constituent atoms. Although we have worked with ChemAxon to improve the accuracy of their calculations to match that of experimentally-verified pKa’s of many biochemically-relevant compounds, this calculation is still based on an approximation technique, and will not necessarily yield fully correct pKa’s for every substance.
Some caveats about our protonation of compounds:
Some compounds are present in multiple reactions that take place in various different compartments in a cell, or across membranes, where the pH might vary from our stated value of 7.3.
For any given compound, only one proto-isomer is present in our PGDBs. We do not represent the other proto-isomers, nor do we represent the proto-isomerization reactions that inter-convert the various proto-isomers of a compound.
Sometimes a pKa value for a proto-isomerization center is very close to the pH of the solution, and therefore there is approximately a 50 / 50 split between the relative abundance of the two proto-isomers of that compound in solution. The Marvin software will select the most likely proto-isomer based on a comparison of the floating point value of the relative abundance in such situations.
Our compounds might have a slightly different structure than what you will find for the same compound in an alternate chemical compound database. Please ensure that you are comparing the two compounds for the correct protonation state at a reference pH value of 7.3.
5.5.3 Computational Reaction Balancing for Hydrogen
Once the compounds of EcoCyc and MetaCyc were protonated, all reactions that had a mass-imbalance due only to hydrogen atoms were computationally balanced. This balancing procedure added or removed instances of the proton from the appropriate side of a reaction to achieve mass-balance.
Some caveats about our computational reaction balancing:
One might notice some reactions that have more or less protons participating than what you would typically see depicted. This might be most evident in our EC reactions. One reason for this, beyond our computational reaction balancing, is that traditionally protons and other small, ubiquitous chemical moieties were considered auxiliary to the main function of a reaction and thus not depicted. In general, our EC reactions may vary from the IUBMB reactions by including more or fewer protons than the original reaction.
For use in FBA models, one must be aware that we are only representing one of the possibly many proto-isomers of a compound. We also do not represent the fast protonation reactions that inter-convert the proto-isomers. Thus, a FBA model that is attempting to simulate the flux of hydrogen in a PGDB may be inaccurate.
As of 2009, reactions that were computationally balanced for mass are not necessarily balanced for charge.
Unbalanced reactions are due to non-trivial imbalances (i.e., imbalances not due solely to hydrogens or protons). These imbalances are usually due to omissions or errors in the structures and/or reaction composition obtained from the literature. Our curation staff are actively researching such compounds and reactions and correct the data whenever possible.
For the category of reactions where it is not possible to determine the balance state, these are mainly due to:
Reactions that have compound classes as substrates
Polymerization reactions. As of the beginning of 2009, BioCyc.org is working to extend our representation of polymerization reactions to allow for mass and charge balance.
Reactions with substrates that lack a chemical structure
Reactions with substrates that include R-groups
6 BioCyc Computing Architecture
BioCyc.org is served by a network-load balancer that dispatches user requests to a collection of Linux-based computers.
The BioCyc.org website is powered by the Pathway Tools software [11, 12]. Pathway Tools runs as a long-lived web server process, with web requests handled by Franz AllegroServe and CWEST [17]. BioCyc.org makes use of additional bioinformatics software including BLAST, PatMatch, Clustal Omega, and MSAviewer.
Users can install Pathway Tools on their own computers, where it can run as both a desktop application and as a local web server. Local installations of Pathway Tools can also be used to query and update BioCyc data via APIs in Python, Java, Lisp, Perl, and R. Pathway Tools is written in the Common Lisp language using the Allegro Common Lisp product from Franz Inc [1].
BioCyc databases are stored in an object-oriented database system called Ocelot [10], which stores its databases persistently in disk files and in MySQL databases. Data such as orthologs and SmartTables are stored in MySQL databases.
For graphics generation Pathway Tools relies on Allegro CLIM (Common Lisp Interface Manager) in concert with a locally developed JavaScript graphics engine called WG.
For more details on the architecture of Pathway Tools see [12].
BioCyc web service APIs are described here [2].
7 Comparison of BioCyc to Other Pathway Databases
Please see the comparison section of the MetaCyc Guide .
8 Discovering Functional Gene Clusters Using Genome Context Methods
This aspect of BioCyc is meant to enable the discovery of new pathways, and to provide clues as to the functions of genes of unknown function. A functionally linked gene cluster, or FunGC, refers to a group of genes that are estimated by the system to operate in a common pathway, a common cellular process, or a common protein complex (note the term “cluster” does not imply physical clustering on a chromosome). There is no guarantee as to what type of functional linkages exist among the genes in a FunGC. FunGCs are predicted using comparative-genomics methods called genome-context methods [7, 6, 20, 5, 16] that search for patterns across hundreds of genomes. Details of the methods used by BioCyc are described in the next sections.
BioCyc includes FunGCs for 10 genomes as of October 2014. In later releases we plan to expand the number of genomes for which FunGCs are available.
For example, E. coli genes bioA, bioB, bioD, and bioF, which participate in the metabolic pathway for biotin biosynthesis, are all predicted by our method to form a FunGC. Another E. coli FunGC consists of genes xylF, xylH, xylA, and xylB. The xylF and xylH genes encode a xylose ABC transporter, while xylA and xylB encode a xylose isomerase and a xylulokinase, respectively, which participate in a xylose degradation pathway. The full list of E. coli FunGCs is available at http://biocyc.org/ECOLI/genome-context.
We will use E. coli gene yqeB to illustrate BioCyc FunGCs in more detail. yqeB is a gene of unknown function; it is annotated as a conserved protein with a NAD(P)-binding Rossman fold. yqeB forms a FunGC with genes xdhC (xanthine dehydrogenase, Fe-S subunit), xdhB (xanthine dehydrogenase subunit, FAD-binding domain), xdhA (xanthine dehydrogenase subunit), and paoA (aldehyde dehydrogenase, Fe-S subunit). Thus, one might infer because of its assignment to a common FunGC with these four other genes that yqeB forms a common pathway with these genes.
8.1 How are FunGCs Computed?
FunGCs are computed by a two-step process. Given all genes in a given organism,
Step 1: Compute a pairwise functional-linkage score between every pair of genes in the organism.
Step 2: Compute FunGCs by searching for highly connected sets of functionally linked genes from Step 1
In a moment we will consider these steps in more detail. But first we discuss the reliance of these methods on the ortholog data within BioCyc.
8.2 BioCyc Ortholog Data and the Reference Genomes
Computation of functional linkages between two genes G1 and G2 is based on comparative-genomics analyses of the orthologs of G1 and G2 that exist in many genomes. We will refer to a given such pair of orthologs as G1′ and G2′.
BioCyc computes orthologous genes (e.g., that G1 and G1′ are orthologs) as bidirectional best hits obtained from BLAST outputs among most (but not yet all) pairs of genomes within BioCyc. To determine whether yqeB and xdhA are functionally linked, we examine the orthologs of those two genes across a special set of 1800 BioCyc genomes that we call the reference genomes. The reference genomes are chosen to be taxonomically diverse because selection of multiple closely related species could bias the relationships that we consider during computational of functional relatedness (e.g., because gene order will be highly conserved in highly related genomes).
8.3 Computing Pairwise Functional Linkage Scores
To estimate whether genes G1 and G2 are functionally linked we consider two factors. First is the co-occurrence of their orthologs in many genomes. Are the orthologs typically found in the same genomes? If so their pairwise functional-linkage score will be higher. But the algorithm also considers how many genomes they are found in. For example, if orthologs of G1 and G2 are found in virtually every reference genome, they would have no choice but to co-occur, thus decreasing their functional linkage score. The genome-occurrence profile for a gene is called its phylogenetic profile, and is represented as a vector of ones and zeros — a one when an ortholog of G is found in a given genome, and a zero when no ortholog of G is found in a given genome. If the phylogenetic profile vectors of G1 and G2 are similar then we infer that the genes are likely to be functionally linked. The exact phylogenetic profile method we use is called pp-mutual-info (znorm) in [7].
The second factor used to infer functional linkage is the spatial proximity of orthologs of G1 and G2. That is, when orthologs G1′ and G2′ do co-occur in the same genome, are G1′ and G2′ frequently close to one another on the same chromosome (e.g., separated by few genes or adjacent)? If so, this conserved genome proximity also provides evidence of functional relatedness. The genomic proximity method we use is a variation of the one called gn-lnX (znorm) in [7], where correlation between organisms is taken into account, with a resulting improvement in performance. Given this improvement and the fact that both methods are fused to get an additional boost in performance, the resulting score is significantly better than the best method presented in that paper. To our knowledge, this makes the fused scores used for the system, the best genome context scores in the literature.
Pairwise functional-linkage information alone is useful in providing clues regarding the function of one gene based on the known functions of other genes that it is related to. Even more useful is to find larger sets of genes that are functionally related to one another and that therefore may function in a common pathway or biological process.
8.4 Computation of Functionally Linked Gene Clusters
The method described in this section computes Functionally Linked Gene Clusters, or FunGCs. The algorithm starts with the set of pairwise functional linkages computed in the previous step. The linkages are encoded in a graph whose nodes are the genes of a given organism, and whose edges are the inferred functional linkages among those genes. The algorithm searches the full graph to find highly interconnected subgraphs (they need not be fully interconnected), that is, sets of genes that have a high fraction of mutual functional linkages. Some genes will be removed from the set if they are not sufficiently highly interconnected to other genes in the set. The algorithm we use for computing FunGCs as highly interconnected graphs is CFinder [18].
When computing FunGCs we must make empirical choices regarding two thresholds. One is the strength of pairwise functional linkage that we deem sufficient for inclusion in the network. The other one is the clique size CFinder uses as starting point, which, in turn, determines the minimum size of the resulting clusters (currently 4). We may perform some adjustment of these thresholds over time to try to seek a proper balance between finding sufficient numbers of interesting new pairwise linkages and FunGCs, versus limiting the number of incorrect pairwise linkages and FunGCs. The current values were chosen to give a good trade-off on E. coli̇ We have performed extensive evaluation and tuning of our methods [6, 7], and in fact the methods currently used for BioCyc have improved somewhat over the published methods. For example, our use of CFinder is a new addition; it does not require full connectivity among genes within a cluster as did the previous method. And because highly-overlapping cliques are merged by CFinder into a bigger groups, this code results in fewer groups which overlap each other much less and are also more accurate (make better predictions of the pathways or protein complexes that we know of in the target organisms). CFinder is run on a network that was created with a very high threshold on the pairwise scores resulting in relatively few high-confidence groups being found. Inspection of the FunGC listing page for E. coli will show that the FunGCs present in BioCyc for genes of known function overlap very strongly with known pathways.
8.5 How to Use Functional Gene Cluster Information in BioCyc
FunGC information is available on two new BioCyc page types:
Genome-Context Analysis Page
FunGC listing page
8.5.1 Genome-Context Analysis Page
This page is available for genes with at least one pairwise functional linkage that exceeds our minimum cutoff. Navigate to this page from the gene page by clicking on “Genome Context Analysis” in the right-sidebar menu of the gene page. Not every gene page will have such a page available.
The Genome-Context Analysis Page has two sections: the first section lists the FunGC(s) to which a gene gene has been assigned. The second section lists the pairwise functional linkages that have been computed for a gene. Not every functionally linked gene will appear in a FunGC.
The first section depicts the FunGC as a graph with edge color indicating the strength of pairwise functional linkages. The tabs to the right of the FunGC diagram produce different views of the genes within a FunGC, such as showing any known pathways and operon structures containing these genes.
8.5.2 FunGC Listing Page
The Functional Gene Clusters page lists all FunGCs available for an organism. Navigate to this page from the Genome-Context Analysis Page for any gene, using the link near the top (“See all FunGCs for this organism”).
9 Mechanisms for Accessing BioCyc Data
BioCyc data is accessible in several ways, which are described in more detail on the downloads page.
Query and visualization access is available through this BioCyc Web site
Data files for the BioCyc databases are available for download in multiple formats
The preceding databases can be loaded into SRI’s BioWarehouse relational database system (MySQL based) for querying.
A downloadable “software/database bundle” is available that supports querying, visualization, and analysis of BioCyc data. It also allows users to create their own Pathway/Genome Databases. The software/database bundle includes functionality not available through the Web site, and also executes faster than the Web version.
The software/database bundle also allows users to query BioCyc data via the Java, Perl, and Common Lisp languages.
10 How to Learn More About PGDBs and BioCyc
BioCyc has a number of advanced operations including a number of comparative genomics tools, programs for analysis of high throughput datasets such as gene-expression data, and metabolic network analysis tools. The following information resources describe BioCyc in more detail.
Take the BioCyc guided tour
The Pathway Tools software [download] contains the Pathway Tools User’s Guide, a document that provides extensive coverage of all aspects of the software, including an extensive description of the database schema that underlies PGDBs.
How to download Pathway Tools and organism flat-file databases
11 Acknowledgments
BioCyc is grateful for the following groups:
ChemAxon for use of the Marvin chemoinformatics tool
References
[1] |
Allegro Common Lisp.
Deletetitle.
http://www.franz.com/products/allegro-common-lisp/.
|
[2] |
BioCyc Web Services.
Deletetitle.
https://biocyc.org/web-services.shtml.
|
[3] |
M. J. Cipriano, P. N. Novichkov, A. E. Kazakov, D. A. Rodionov, A. P. Arkin,
M. S. Gelfand, and I. Dubchak.
RegTransBase–a database of regulatory sequences and interactions
based on literature: a resource for investigating transcriptional regulation
in prokaryotes.
BMC Genomics, 14:213, 2013.
|
[4] |
UniProt Consortium.
The Universal Protein Resource (UniProt) 2009.
Nuc Acids Res, 37(Database issue):D169–174, 2009.
|
[5] |
A.J. Enright, I. Iliopoulos, N.C. Kyrpides, and C.A. Ouzounis.
Protein interaction maps for complete genomes based on gene fusion
events.
Nature, 402:86–90, 1999.
|
[6] |
L. Ferrer, J.M. Dale, and P. D. Karp.
A systematic study of genome context methods: calibration,
normalization and combination.
BMC Bioinformatics, 11:493, 2010.
|
[7] |
L. Ferrer, A.G. Shearer, and P. D. Karp.
Discovering novel subsystems using comparative genomics.
Bioinformatics, 27:2478–85, 2011.
|
[8] |
W.M. Fitch.
Distinguishing homologous from analogous proteins.
Systemic zoology, 19:99–113, 1970.
|
[9] |
M.L. Green and P. D. Karp.
A Bayesian method for identifying missing enzymes in predicted
metabolic pathway databases.
BMC Bioinformatics, 5(1):76, 2004.
http://www.biomedcentral.com/1471-2105/5/76.
|
[10] |
P. D. Karp, Vinay K. Chaudhri, and Suzanne M. Paley.
A collaborative environment for authoring large knowledge bases.
J Intelligent Information Systems, 13:155–94, 1999.
http://www.ai.sri.com/pkarp/pubs/99jiis.pdf.
|
[11] |
P. D. Karp, P.E. Midford, R. Billington, A. Kothari, M. Krummenacker, W.K. Ong,
P. Subhraveti, R. Caspi, I.M Keseler, and S. M. Paley.
Pathway Tools version 23.0 update: Software for pathway/genome
informatics and systems biology.
Brief Bioinform, 22:109––126, 2019.
https://academic.oup.com/bib/article-abstract/22/1/109/5669859?redirectedFrom=fulltext.
|
[12] |
P. D. Karp, P.E. Midford, S.M. Paley, M. Krummenacker, R. Billington,
A. Kothari, W.K. Ong, P. Subhraveti, I.M. Keseler, and R. Caspi.
Pathway Tools version 23.0: Integrated software for
pathway/genome informatics and systems biology [v3].
arXiv, pages 1–111, 2019.
http://arxiv.org/abs/1510.03964v3.
|
[13] |
P. D. Karp, S. M. Paley, M. Krummenacker, M. Latendresse, J.M. Dale, T. Lee,
P. Kaipa, F. Gilham, A. Spaulding, L. Popescu, T. Altman, I. Paulsen, I.M.
Keseler, and R. Caspi.
Pathway Tools version 13.0: Integrated software for
pathway/genome informatics and systems biology.
Brief Bioinform, 11:40–79, 2010.
http://bib.oxfordjournals.org/cgi/content/abstract/bbp043.
|
[14] |
M. Latendresse and P. D. Karp.
Web-based metabolic network visualization with a zooming user
interface.
BMC Bioinformatics, 12:176–84, 2011.
http://www.biomedcentral.com/1471-2105/12/176/abstract.
|
[15] |
T.J. Lee, I. Paulsen, and P. D. Karp.
Annotation-based inference of transporter function.
Bioinformatics, 24:i259–67, 2008.
http://bioinformatics.oxfordjournals.org/cgi/content/full/24/13/i259.
|
[16] |
R. Overbeek, M. Fonstein, M. D’Souza, G.D. Pusch, and N. Maltsev.
Use of contiguity on the chromosome to predict functional coupling.
In Silico Biol., 1(2):93–108, 1999.
|
[17] |
S. M. Paley and P. D. Karp.
Adapting EcoCyc for use on the World Wide Web.
Gene, 172:GC43–GC50, 1996.
|
[18] |
G. Palla, I. Derenyi, I. Farkas, and T. Vicsek.
Uncovering the overlapping community structure of complex networks in
nature and society.
Nature, 435:814–8, 2005.
|
[19] |
M. A. Peabody, M. R. Laird, C. Vlasschaert, R. Lo, and F. S. Brinkman.
PSORTdb: expanding the bacteria and archaea protein subcellular
localization database to better reflect diversity in cell envelope
structures.
Nuc Acids Res, 44(D1):D663–8, 2016.
|
[20] |
M. Pellegrini, E.M. Marcotte, M.J. Thompson, D. Eisenberg, and T.O. Yeates.
Assigning protein functions by comparative genome analysis: Protein
phylogenetic profiles.
Proc National Academy of Sciences, USA, 96:4285–8, 1999.
|
[21] |
A. G. Perez, V. E. Angarica, A. T. Vasconcelos, and J. Collado-Vides.
Tractor_DB (version 2.0): A database of regulatory interactions in
gamma-proteobacterial genomes.
Nuc Acids Res, 35(Database issue):D132–6, 2007.
|
[22] |
P. Romero and P. D. Karp.
Using functional and organizational information to improve
genome-wide computational prediction of transcription units on
Pathway/Genome Databases.
Bioinformatics, 20:709–17, 2004.
|