Genome Session Exercises and
Problems
1) Interacting with the
UCSC Browser and Database
Start at the UCSC
genome informatics home page: http://genome.ucsc.edu/
The first 3 tools listed in the horizontal blue menu (Genomes Ð Blat Ð Tables) are the most frequently used. In each case, pull-down menus control the organism and assembly. Note: different organisms and assemblies have different amounts and types of associated data associated. Clicking the ÒGenomesÓ link takes you to the browser gateway page. Example searches and details regarding the selected assembly are below. Perform a basic search by entering ÒBRCA1Ó into the Òposition or search termÓ window. Anytime the text pattern is present in the database, a result is reported as a hyperlink to the genome browser. Select the first result in the ÒKnown genesÓ list to go to the Genome Browser.
The information displayed in the browser window is organized into groups of tracks. Note the visual cues in the window such as strand, introns, exons and coding sequence. Different data types have different display properties and details about each type of data and aspects of their display are controlled with the lower part of the page.
Hide all the tracks. Turn on Known genes by selecting ÒfullÓ from the pull-down menu and clicking ÒrefreshÓ. Turn on Òchromosome bandÓ to full and increase the size of the window to 1000 using the configure button and subsequent options. Control the region of the genome being displayed in the browser window using the zoom tools, the Òposition/searchÓ text window and the Òmove start/move endÓ functions. Experiment with the pull-down menu for known genes to see the difference between full, dense, squish and pack. Restore the tracks to default using the ÒDefault tracksÓ button.
Almost all the data displayed in the browser window is hyperlinked to additional data when display is in the full mode. Click a known gene track in the browser window to open a gene information page. Look at the different kinds of data displayed on this page. In general terms, what kinds of human cells express BRCA1 at a high level? Does D. melanogaster have a homolog? What GO terms are attached to this gene product? Return to the browser window.
Zoom in on the large exon and a couple of flanking exons by clicking in the numbered Òchr17Ó track. This re-centers the browser and zooms in 3X. Based on the known gene view, do you think that the spice variants displayed encode for different proteins?
Get the DNA sequence for this region by clicking the ÒDNAÓ text in the top blue bar. Examine the initial options and also the Òextended case/color optionsÓ. Configure the display so that known exons are in bold, red and upper case letters. In most cases, introns start with the sequence GT and end with the sequence AT. Based on this, can you tell if you have selected the coding strand of the BRCA1 gene? Can you see the reason for the non-consensus splice junction between the large exon and the next downstream exon? Get the sequence for the 3rd variant from the top using the gene information page. Note: The color formatting options are not available from this page. Check the splice junctions for this variant near the large exon.
Return to the Genomes gateway page and restore all changes you have made to default using the ÒClick here to resetÓ function.
2) Blat Searching, Custom
Tracks and Table Browser Queries
Perform a BLAT search of the
human genome (Mar. 2006 assembly) using the following EST sequences as query (ests.fa). You can either use copy/paste or file upload to get
the sequences into the system. Note:
There are different query types, sort options and output format options.
Obtain the blat results in
different output formats to see how they compare. In the hyperlinked results,
examine the ÒdetailsÓ page. When finished, obtain the results in psl format. If
you have trouble obtaining the results in the correct format, pre-computed
results are available here: query.psl
Go to the Genomes gateway
page and create a custom track using the Òadd custom tracks functionÓ. You can
copy/paste or file upload with the psl query results. The custom track
documentation is located here: http://genome.ucsc.edu/goldenPath/customTracks/custTracks.html
Access the data in the table
browser and use the Òsummary/statisticsÓ function to determine the total number
of alignments.
Use the ÒfilterÓ function to
determine the number of alignments on chromosome 6?
Using the ÒintersectionÓ function to identify the number of EST alignments that overlap with ÒKnown GenesÓ. How many overlaps are detected?
Perform a table browser query
to list the custom track-known gene intersection as hyperlinks to Genome
Browser. Based on the list of hyperlinks, estimate how many different genes are
represented in this set of ESTs?
Examine the results in more detail. Are some of the intersected results more informative than others? Use the filter function to identify quality alignments (alignments > 200 bp in length for example). How many alignments pass this filter? How many different known genes do they overlap? Follow the links to the browser and examine the data on the page. Do these genes have something in common?
3) Cross-species
Alignments and Conservation
Query the human
genome browser gateway with the gene symbol FOXP2. View the gene on chromosome
7. Adjust the tracks displayed so that only chromosome band (full) and refseq
(pack) are displayed.
Switch the Zebrafish, Dog and Mouse Net Alignments to Full. What portions of the gene are most highly conserved across all of these species? Compare the dog and mouse alignments. Based on what you see which species is more closely related to human? Why? Which species diverged from human more recently?
Switch
"Conservation" to full and zoom in on the 5Õ end of the gene. A
useful sequence range is:
chr7:113,827,244-113,857,780
Examine the conservation upstream from the transcription start site (TSS). How far from the TSS is the conserved block? Can you think of a reason why it is conserved? Can you find any tracks that could support this idea? Examine the details of the track. Reconfigure the view so that it includes an indicator line at 0.05.
4) Interpretation of Blast
Results Using the Table Browser
Blast your protein of
interest against an interesting database. For this example, human fibronectin
has been blasted against the zebrafish EST database. The pre-computed results
are located here: Blast Results
Convert these results to a
list of IDs using excel. Copy all the blast results scoring better than 1e-50
and paste them into excel. Select the first column and use the of ÒText to
ColumnsÓ function under the Data menu to separate the data based on the Ò|Ó
delimiter. Repeat the operation on the accession column using the Ò.Ó
delimiter.
Copy the EST IDs for all blast
results scoring better that1e-50. The prepared list is available here: HitList
Enter the Zebrafish Table
Browser, select the mRNA and EST table and appropriate track, paste the list of
identifiers in the Òidentifiers (names/accessions)Ó
box and check the results using the summary/statistics function. How
many ESTs pass the accession # filtering step? Obtain the results as a custom
track in the table browser.
Use the table browser to examine
the contents of the custom track by obtaining the results as hyperlinks to the
genome browser. How many different loci are represented?
Intersect the custom track of
blast hits with the RefSeq table from genes and gene prediction group. Obtain
the results as hyperlinks to the genome browser. How many loci are represented
now?
Invert the intersection to
report refseq genes that overlap with the EST set. How many refseq genes
overlap with the ESTs?
Do probes on the Zebrafish
Affymetrix chip interrogate the expression of genes at these loci? If so, how
many different probes for each gene?
5) More Table Browser
Queries, Proteome Browser and Gene Sorter
How many known human genes
are more than 1,000,000 base pairs in length? (Hints: Table Browser query, filter using a free form query
and find the number using the summary/statistics function). What is the largest
gene with respect to genome footprint? (Hint: Adjust filter to report only the largest.) Is there evidence
that the gene product has a role in human disease? How large is the protein
encoded by the gene? Use the proteome browser to see how does this proteinÕs
hydrophobicity compares to other proteins in the genome? Use the gene sorter to
find genes that have a similar pattern of expression, similar protein sequence
and similar protein domain content. Obtain 1500 bases of sequence upstream from
this gene. Do a blat search to confirm the position of the sequence you
obtained. How many SNPs are found in this region?
How many know and refseq
human transcripts have more than 100 exons? Which transcripts have the largest
number of exons in the refseq and known categories and how many exons do they
have? (Hint: Instead of obtaining
hyperlinks to the genome browser as output, use the Òselected fields from
primary and related tablesÓ option and select the relevant information.) How
large is the protein encoded by the refseq gene with the greatest number of
exons? How long is the gene? How do these data compare to those of the longest
gene examined above?
What fraction of human refseq transcripts on chromosome 17 overlap with a CpG island? What are CpG islands? Obtain the non-overlapping refseq genes as a custom track (Hint: Control the results reported by the intersection) and visualize the results in the browser. In particular, look at:
chr17:34,104,206-34,140,224
chr17:32,701,467-32,834,869
chr17:33,688,654-33,822,056
Can you think of three
reasons why a refseq transcript might fail to overlap a CpG island.