Genome Session Exercises and Problems

1) Interacting with the UCSC Browser and Database

Start at the UCSC genome informatics home page: http://genome.ucsc.edu/

The first 3 tools listed in the horizontal blue menu (Genomes Ð Blat Ð Tables) are the most frequently used. In each case, pull-down menus control the organism and assembly. Note: different organisms and assemblies have different amounts and types of associated data associated. Clicking the ÒGenomesÓ link takes you to the browser gateway page. Example searches and details regarding the selected assembly are below. Perform a basic search by entering ÒBRCA1Ó into the Òposition or search termÓ window. Anytime the text pattern is present in the database, a result is reported as a hyperlink to the genome browser. Select the first result in the ÒKnown genesÓ list to go to the Genome Browser.

The information displayed in the browser window is organized into groups of tracks. Note the visual cues in the window such as strand, introns, exons and coding sequence. Different data types have different display properties and details about each type of data and aspects of their display are controlled with the lower part of the page.

Hide all the tracks. Turn on Known genes by selecting ÒfullÓ from the pull-down menu and clicking ÒrefreshÓ. Turn on Òchromosome bandÓ to full and increase the size of the window to 1000 using the configure button and subsequent options. Control the region of the genome being displayed in the browser window using the zoom tools, the Òposition/searchÓ text window and the Òmove start/move endÓ functions. Experiment with the pull-down menu for known genes to see the difference between full, dense, squish and pack. Restore the tracks to default using the ÒDefault tracksÓ button.

Almost all the data displayed in the browser window is hyperlinked to additional data when display is in the full mode. Click a known gene track in the browser window to open a gene information page. Look at the different kinds of data displayed on this page. In general terms, what kinds of human cells express BRCA1 at a high level? Does D. melanogaster have a homolog? What GO terms are attached to this gene product? Return to the browser window.

Zoom in on the large exon and a couple of flanking exons by clicking in the numbered Òchr17Ó track. This re-centers the browser and zooms in 3X. Based on the known gene view, do you think that the spice variants displayed encode for different proteins?

Get the DNA sequence for this region by clicking the ÒDNAÓ text in the top blue bar. Examine the initial options and also the Òextended case/color optionsÓ. Configure the display so that known exons are in bold, red and upper case letters. In most cases, introns start with the sequence GT and end with the sequence AT. Based on this, can you tell if you have selected the coding strand of the BRCA1 gene? Can you see the reason for the non-consensus splice junction between the large exon and the next downstream exon? Get the sequence for the 3rd variant from the top using the gene information page. Note: The color formatting options are not available from this page. Check the splice junctions for this variant near the large exon.

Return to the Genomes gateway page and restore all changes you have made to default using the ÒClick here to resetÓ function.

2) Blat Searching, Custom Tracks and Table Browser Queries

Perform a BLAT search of the human genome (Mar. 2006 assembly) using the following EST sequences as query (ests.fa). You can either use copy/paste or file upload to get the sequences into the system. Note: There are different query types, sort options and output format options.

Obtain the blat results in different output formats to see how they compare. In the hyperlinked results, examine the ÒdetailsÓ page. When finished, obtain the results in psl format. If you have trouble obtaining the results in the correct format, pre-computed results are available here: query.psl

Go to the Genomes gateway page and create a custom track using the Òadd custom tracks functionÓ. You can copy/paste or file upload with the psl query results. The custom track documentation is located here: http://genome.ucsc.edu/goldenPath/customTracks/custTracks.html

Access the data in the table browser and use the Òsummary/statisticsÓ function to determine the total number of alignments.

Use the ÒfilterÓ function to determine the number of alignments on chromosome 6?

Using the ÒintersectionÓ function to identify the number of EST alignments that overlap with ÒKnown GenesÓ. How many overlaps are detected?

Perform a table browser query to list the custom track-known gene intersection as hyperlinks to Genome Browser. Based on the list of hyperlinks, estimate how many different genes are represented in this set of ESTs?

Examine the results in more detail. Are some of the intersected results more informative than others? Use the filter function to identify quality alignments (alignments > 200 bp in length for example). How many alignments pass this filter? How many different known genes do they overlap? Follow the links to the browser and examine the data on the page. Do these genes have something in common?

3) Cross-species Alignments and Conservation

Query the human genome browser gateway with the gene symbol FOXP2. View the gene on chromosome 7. Adjust the tracks displayed so that only chromosome band (full) and refseq (pack) are displayed.

Switch the Zebrafish, Dog and Mouse Net Alignments to Full. What portions of the gene are most highly conserved across all of these species? Compare the dog and mouse alignments. Based on what you see which species is more closely related to human? Why? Which species diverged from human more recently?

Switch "Conservation" to full and zoom in on the 5Õ end of the gene. A useful sequence range is:

chr7:113,827,244-113,857,780

Examine the conservation upstream from the transcription start site (TSS). How far from the TSS is the conserved block? Can you think of a reason why it is conserved? Can you find any tracks that could support this idea? Examine the details of the track. Reconfigure the view so that it includes an indicator line at 0.05.

4) Interpretation of Blast Results Using the Table Browser

Blast your protein of interest against an interesting database. For this example, human fibronectin has been blasted against the zebrafish EST database. The pre-computed results are located here: Blast Results

Convert these results to a list of IDs using excel. Copy all the blast results scoring better than 1e-50 and paste them into excel. Select the first column and use the of ÒText to ColumnsÓ function under the Data menu to separate the data based on the Ò|Ó delimiter. Repeat the operation on the accession column using the Ò.Ó delimiter.

Copy the EST IDs for all blast results scoring better that1e-50. The prepared list is available here: HitList

Enter the Zebrafish Table Browser, select the mRNA and EST table and appropriate track, paste the list of identifiers in the Òidentifiers (names/accessions)Ó box and check the results using the summary/statistics function. How many ESTs pass the accession # filtering step? Obtain the results as a custom track in the table browser.

Use the table browser to examine the contents of the custom track by obtaining the results as hyperlinks to the genome browser. How many different loci are represented?

Intersect the custom track of blast hits with the RefSeq table from genes and gene prediction group. Obtain the results as hyperlinks to the genome browser. How many loci are represented now?

Invert the intersection to report refseq genes that overlap with the EST set. How many refseq genes overlap with the ESTs?

Do probes on the Zebrafish Affymetrix chip interrogate the expression of genes at these loci? If so, how many different probes for each gene?

5) More Table Browser Queries, Proteome Browser and Gene Sorter

How many known human genes are more than 1,000,000 base pairs in length? (Hints: Table Browser query, filter using a free form query and find the number using the summary/statistics function). What is the largest gene with respect to genome footprint? (Hint: Adjust filter to report only the largest.) Is there evidence that the gene product has a role in human disease? How large is the protein encoded by the gene? Use the proteome browser to see how does this proteinÕs hydrophobicity compares to other proteins in the genome? Use the gene sorter to find genes that have a similar pattern of expression, similar protein sequence and similar protein domain content. Obtain 1500 bases of sequence upstream from this gene. Do a blat search to confirm the position of the sequence you obtained. How many SNPs are found in this region?

How many know and refseq human transcripts have more than 100 exons? Which transcripts have the largest number of exons in the refseq and known categories and how many exons do they have? (Hint: Instead of obtaining hyperlinks to the genome browser as output, use the Òselected fields from primary and related tablesÓ option and select the relevant information.) How large is the protein encoded by the refseq gene with the greatest number of exons? How long is the gene? How do these data compare to those of the longest gene examined above?

What fraction of human refseq transcripts on chromosome 17 overlap with a CpG island? What are CpG islands? Obtain the non-overlapping refseq genes as a custom track (Hint: Control the results reported by the intersection) and visualize the results in the browser. In particular, look at:

chr17:34,104,206-34,140,224

chr17:32,701,467-32,834,869

chr17:33,688,654-33,822,056

Can you think of three reasons why a refseq transcript might fail to overlap a CpG island.