Example 1. Advanced computational and biological analysis of high quality expression and SNP/copy number arrays.
Example 2. Analysis of Single nucleotide polymorphisms.
Example 3. Proteomics analysis.
Example 4. Promoter Analysis.
Example 5. Meta-Analysis of Heterogeneous Data Sets.
Example 6. Web-based Databases.
Example 1. Advanced computational and biological analysis of high quality expression and SNP/copy number arrays of adrenocortical tumor tissues
These data were a study headed by Dr. Michael Demeure to identify drug targets for adrenocortical cancer. The same set of 4 normal and 19 cancer tissues were analyzed on both Agilent and Affymetrix chip platforms. For expression data, the approach we take is as follows: (1) assess the quality of each chip and retain the best chips for analysis, (2) identify genes that show significantly altered expression in tumor tissue compared to normal tissue, (3) analyze the regulation and function of the altered genes. A similar assessment is performed for SNP/copy number arrays. The final goal of the adrenocortical project is to find patterns of gene expression and loss of heterozygosity in cancer tissues that point to new drug targets. There are similar goals for the prostate and pancreatic cancer tissue data that we are analyzing.
In Figure 1A are shown two examples of quality control (QC) analysis. The red-yellow figure on the left is a QC analysis of 50 Affymetrix arrays of prostate tissue that have been compared to each other in all possible combinations. The test is to find and eliminate an odd man out – a tissue that is not like any other as evidenced by a row of red, indicative of a large overall difference in expression. Columns and rows of a lighter red or yellow (white for identical) reveal tissues that belong to a similar cluster. This set of tissues came from an original collection of over 100 and about ½ were removed based on this test. On the right of Figure 1A, the blue-yellow QC plot shows gene expression values on an Agilent slide of adrenocortical tissue. A random pattern of shades of blue and yellow indicative of higher or lower levels of expression is expected, but often, due to problems in labeling and hybridization, stripes of color and faded areas are observed. Based on this picture and a number of other computational tests that utilize controls on the slide, this slide passed as having acceptable, although not highest possible, quality.
Figure 1B-D illustrates more example of analysis. From high quality Affymetrix and Agilent tissue expression data, expression levels of fifteen cell cycle and several mitotic spindle genes were found to be very strongly correlated, and a few negatively correlated, with CDC2 expression across tumor tissues. The promoter sequences of these genes were analyzed and all were shown to have a consensus binding site for the E2F-1 transcription factor, extending previously published results.
Figure 1B shows how well expression of CDC2, which is the main control gene for the mitotic cell cycle, and CDCA5 is correlated across the tumor samples by tumor grade. A heat map analysis that shows how well the pattern of gene expression of a set of cell cycle and spindle formation genes can separate the clinical samples is shown in Figure 1C, where shade of red indicates higher, and green, lower expression levels, respectively. Figure 1D shows the loss of heterozygosity in chromosome 10 based on our analysis of the adrenocortical copy number/SNP Affymetrix chips. The blue vertical lines indicate those tissue samples that have lost some or all of chromosome 10. Chromosome loss is more apparent in tissues that are showing the most abnormal expression of mitotic spindle proteins. These analyses demonstrate the collective capabilities of our staff in collecting important biological and clinical information from large data sets that can provide predictions for hypothesis building in grant applications. Our staff has also developed some novel methods for finding genes that are changing in cancer cells and tissues.
Pathway Analysis: In the first method, a pattern of significantly varying genes is input into our Biorag database, which has collected all of the current information on the human genome that is available, as well as many other genomes (this database is described further below). Figure 1E is an output of the Pathway Miner tool that shows which genes are in well-known regulatory and metabolic pathways in pancreatic cancer tissues. Nodes are genes colored by expression level in a set of microarray experiments. Edges between nodes show genes that act in the same pathway: edge thickness represents more interactions. Clicking the mouse on the display produce a complete description of the genes or an annotated diagram of a pathway.
Drug Target Finding: In a second analysis method of large data sets, the Bioinformatics Shared Service uses novel methods to determine which genes are suitable drug targets or predictive for disease. Often, targets are identified as products of significantly over-expressed genes in cancer tissues and cells. We also use a sophisticated computational method called graph theory to determine which genes appear to interact, such as one gene regulating another. An example of a graph showing interaction among the cell cycle genes expressed abnormally in Adrenocortical cancer tissues is shown in Figure 2A. These types of graphs can be used very effectively to predict gene interactions by combining different data sets and types of analysis (also see http://www.cytoscape.org )
Alternatively, the genes could be a synthetic lethal combination in that the cell needs one gene or the other for viability, or needs one gene expressed in order to compensate for over-expression of another. Through data analysis, we can determine if a cancer cell is depending totally on one gene, having lost or over-expressed the other, making the remaining gene a sensitive drug target. As an example, we are helping to discover synthetic lethal targets for over-expressed cell cycle gene in adrenocortical (Figure 1B) and pancreatic cancer. We are also able to search for genes that are likely to be strongly over-expressed because of a rearrangement in the genome that fuses the gene to a strong promoter, as occurs in chronic myelogenous leukemia and other B and T cell malignancies.
Researchers are beginning to analyze the influence of genetics on cancer risk and treatment. To plan these studies requires an understanding of population genetics, experimental design, and the data analysis needed. The Bioinformatics Shared Service has the expertise needed to assist with planning such studies, analyzing the data, and assisting with paper and grant writing.
As an example of a study of this kind, a large number (over 1,000) of blood samples from a colonic adenoma recurrence study by the Cancer Prevention and Control Program were obtained from Drs. Patricia Thompson and Gene Gerner and analyzed for the presence of sequence polymorphisms in the APC gene using a SNPStream instrument at the Arizona Research Laboratories and in the ODC gene using Sequenom data from TGen. These machines reveal, for example, that a particular DNA site might have either an A/T or G/C base pair, and tells which base pair is present in each of the two chromosomes for each sample (two G/Cs, two A/Ts or an A/T-G/C combination). For the 4-6 sites analyzed, our role was to determine which bases are present on the same piece of chromosome (i.e. are in the same phase or haplotype) and from which the genotype of each sample could be found. To do the analysis, we developed a number of Perl scripts that reconfigured the SNPStream or ODC instrument data output. The data were used as input into an open source program (fastPHASE) that computed the desired haplotypes and genotypes. The fastPHASE program data format was modified by additional Perl scripts in order to produce a table that could be used for a statistical analysis to test the association of particular haplotypes with clinical outcome.
The analysis of the ODC gene SNPs using the HaploVIEW program is shown in Figure 1F. Color indicates degree of linkage disequilibrium (extent of deviation from Hardy Weinberg equilibrium) between SNP markers across the gene, with darker purple blocks along the diagonal between adjacent SNPs indicating which SNPs are most likely in the same haplotype block (98-99% probability). This example illustrates that the Bioinformatics Shared Service has the knowledge and expertise needed to perform these type of population genetics analysis.
The Bioinformatics Shared Service is also prepared to analyze similar data sets that include many more sequence polymorphisms extending over longer intervals of the human genome. In the above example of an APC gene analysis, the sequence analysis assumed that the 4 SNPs analyzed were in a single haplotype block inherited from one generation to the next. This assumption was reasonable because the length of the sequence was quite short. When dealing with longer regions, it is necessary to perform an additional analysis that shows which adjacent polymorphisms are inherited in the same haplotype block, as in Figure 1F. Once the block structure is known, The Service can assist with supportive sequence analysis such as amino acid variation between blocks.
The Service is also tracking SNP data sets that are being collected and analyzed in large genome projects. The HapMap project (http://www.hapmap.org ) involves hundreds of investigators in different countries that are deciphering genetic variability groups in three different racial groups. In some cases, genes of interest to cancer prevention have already been studied and may be found on the HapMap web site and the dbSNP site at NCBI (http://www.ncbi.nlm.nih.gov/ ). The Service can retrieve and analyze the data using bioinformatics tools such as HaploVIEW and thereby assist cancer center researchers in planning the collection and analysis of these kinds of data. Having the available information can greatly increase the scientific significance and reduce the cost of a population genetics study. For example, we can show when a single marker SNP may be used to indicate the presence of a known haplotype of a cancer-related gene by simple DNA sequencing, rather than an expensive study of many SNPs using chip technology.
A melanoma cell line was analyzed to search for proteins induced in response to an apoptosis signal compared to a control treatment. Control and responsive proteins were labeled, immunoprecipitated to obtain a mitochondrial fraction, run on 2D gels, and spots of interest were analyzed by mass spectrometry to identify the protein. The Bioinformatics Shared Service provides informatics support to Arizona proteomics consortium labs through the protbase web site (http://www.protbase.org ), which logs new requests for service, stores experimental results and returns data to the investigator. Bioinformatics analyzed the sequest result (which identifies proteins based on peptides) for peptides with better identification scores and functionally profiled them using gene function (gene ontology), metabolic pathways, protein domains, and other features. In some cases, it was necessary to do a more extensive search of other mammalian genomes. This example illustrates how well the Bioinformatics Shared Service is integrated with the Proteomics Shared Service by providing information management and expert assistance in interpreting proteomics results with cancer cells and tissue samples. We are also prepared to help design and implement experiments for the identification of early peptide markers in tissue and serum samples.
One type of experiment requested by several investigators is sequence pattern analysis of gene promoter regions. In one example, a set of genes was found to be up-regulated following Arsenic treatment, raising the question as to whether common transcription factors might be involved in co-regulating these genes. Our service pays for access to a comprehensive transcription factor database called TRANSFAC. This database can be used to profile genes co-regulated by common transcription factors or transcription factors that co-regulate sets of genes. This database includes a set of sequences that are recognized by each transcription factor and a scoring matrix representation of these sites. A match algorithm searches with the scoring matrices, ranks binding sites and a cutoff threshold is used to find significant sites. If sequence patterns similar to the sites for a particular transcription factor are present in the promoter sequence, then that factor is predicted to regulate the gene. In many cases, the prediction may be a false positive one. However, we use other tools to search for clusters of binding sites, providing a more meaningful and reliable prediction of gene regulation.
To analyze candidate biomarkers and therapeutic targets, it is advantageous to validate gene expression data by analysis across independent data sets, a meta-analysis. Analyzing a collection of large data sets can reveal additional information beyond the individual data sample. As an example, we are presently working with several investigators, each of whom is collecting an independent data set. The goal is to use these diverse data sets to generate gene modules and networks that describe the response to human cell lines and mice to arsenic or other toxic agents. The idea is to analyze expression changes across different microarray expression platforms and across diverse samples to identify an arsenic response network and to search for commonly regulated gene sets for meaningful relationship in terms of pathways or regulatory factors.
The Bioinformatics Shared Service often provides database design and programming to individual groups who want to build a secure, web-based database. Three examples of databases we have built are: first, databases for storing pathological and clinical information on a cancer tissue collection; second, databases for storing and analyzing genome information (e.g. http://azcc-microarray.arl.arizona.edu/index.php); and third, project-specific databases.
Our GI spore tissue database is an Oracle database that was originally funded by the GI spore grant to Dr. Gene Gerner to store information on GI tissues. The same database is also being used for storage of a large prostate cancer tissue data set and pancreatic cancer tissues, and will be eventually expanded to include all cancer tissue types. The database is accessible from any internet location but also has highest possible level of security. Also, patient privacy and all IRB and HIPAA requirements are followed. The tissue databases are also used to store data e.g. prostate PSA levels so that data analysis can be done. The database can be “tuned” to the data storage and analysis needs of a particular research project or grant by providing data entry and analysis pages in support of that database. The database can store a large variety of data including images of slides, laboratory experiments, and links to large data sets for later analysis.
The second example of a database that has been built by the Bioinformatics Service is the Biorag database and genome analysis site. The goal of this site was to provided cancer center researchers with the capability of analyzing lists of over- or under-expressed genes from microarray experiments. The database stores a large amount of information on gene function and interaction to reveal what biological changes are taking place in cancer cells or tissues. Having such information can greatly assist with defining a new drug target, for instance. The biorag site includes updated lists of the human, mouse, drosophila, rat and yeast genomes, computer tools to aid in finding gene homologs and examining gene ontologies, gene and protein interactions, and metabolic and regulatory pathway information. This database was originally built in response to the needs of Arizona Cancer Center researchers to have a web site that they could use to analyze their gene expression data from the Genomics Shared Service. However, the web site has since been greatly expanded as needs evolved. Users of the Arizona Cancer Center Genomics Shared Service are routinely provided with a list of genes that are further analyzed using Biorag.
Examples of project-specific, web-based databases are a mouse experiment-tracking database, and databases for information and data sharing, conferencing and experiment logging for the GI spore and pancreatic grants.