Monday, March 30, 2015

Titin exon specific GERP scores

Coming back to Titin gene, are all exons equally important? GERP scores could tell us something about how the exons vary in conservation level at base-pair resolution.

In the above figure, the y-axis is the exon rank and the x-axis has boxplots of GERP scores for each exon. Exons with mean GERP scores above 0 are colored red and those below zero are colored blue. One can clearly see the GERP score reduce considerably between exons 160 and 200.

Tuesday, March 17, 2015

HGT candidate genes - GERP score distribution

HGT candidate genes have been identified in vertebrate species based on gene absence/presence in metazoans. The actual workflow can be found in figure S4.
Interestingly, they identify 145 genes that are candidates for HGT into all primates from various taxon and another 51 from viruses. They provide Ensemble gene id's for these genes in the Human genome.

Of the 145 genes, only 135 could be found in the Ensemble biomart today (Ensemble 79). The following ten genes have been "retired" and are not used in below analysis. These genes seem to be randomly drawn from the different "confidence classes" used in the paper.

ENSG00000107618    class A HGT    Validated
ENSG00000157358    class A HGT    Validated
ENSG00000196333    class C HGT    Not-validated
ENSG00000204486    class C HGT    Validated
ENSG00000204502    class C HGT    Validated
ENSG00000204513    class C HGT    Validated
ENSG00000212857    class C HGT    Not-validated
ENSG00000256062    class A HGT    Validated
ENSG00000260383    class B HGT    Not-validated
ENSG00000263074    class B HGT    Validated

ENSG00000107618 has been merged with ENSG00000265203 which is also a HGT candidate. ENSG00000256062 has been merged into ENSG00000175164, this is the much talked about ABO blood group gene. ENSG00000260383 has been merged into ENSG00000179832 which also happens to be a HGT candidate. ENSG00000263074 is merged into ENSG00000141337 (arylsulfatase G).

A total of 4018 exons could be found for these 135 genes. GERP scores could not obtained for 8 of these genes [ENSG00000066813,ENSG00000183248,ENSG00000183549,ENSG00000183747,
ENSG00000204510,ENSG00000229571,ENSG00000232423,ENSG00000243073]. Here, we obtain the GERP scores for each base of all these exons. Example code from the web was updated to use a bed file as input and print the GERP score for each base within the bed intervals.

 #!/usr/bin/env perl  
 use strict;  
 use warnings;  
 use lib '~/biomart-perl/lib';  
 use Bio::EnsEMBL::Registry;  
 use Bio::EnsEMBL::Utils::Exception qw(throw);  
 my $reg = "Bio::EnsEMBL::Registry";  
 my $species = "Homo sapiens";  
 my $line = "";  
 open FILE1, $ARGV[0] or die $!;  
 $reg->load_registry_from_db(  
    -host => "ensembldb.ensembl.org",  
    -user => "anonymous",  
 );  
 #get method_link_species_set adaptor  
 my $mlss_adaptor = $reg->get_adaptor("Multi", "compara", "MethodLinkSpeciesSet");  
 my $mlss = $mlss_adaptor->fetch_by_method_link_type_species_set_name("GERP_CONSERVATION_SCORE", "mammals");  
 throw("Unable to find method_link_species_set") if (!defined($mlss));  
 my $slice_adaptor = $reg->get_adaptor($species, 'core', 'Slice');  
 throw("Registry configuration file has no data for connecting to <$species>") if (!$slice_adaptor);  
 my $cs_adaptor = $reg->get_adaptor("Multi", 'compara', 'ConservationScore');                 
 while($line=<FILE1>){  
 #$line=$ARGV[0];  
 chomp $line;  
 my @parts=split('\t',$line);  
 my $seq_region = $parts[0];  
 my $seq_region_start = $parts[1];  
 my $seq_region_end =  $parts[2];  
 my $slice = $slice_adaptor->fetch_by_region('toplevel', $seq_region, $seq_region_start, $seq_region_end);  
 throw("No Slice can be created with coordinates $seq_region:$seq_region_start-$seq_region_end") if (!$slice);  
 my $display_size = $slice->end - $slice->start + 1;   
 my $scores = $cs_adaptor->fetch_all_by_MethodLinkSpeciesSet_Slice($mlss, $slice, $display_size);  
 #print "number of scores " . @$scores . "\n";  
 foreach my $score (@$scores) {  
   if (defined $score->diff_score) {  
 #     printf("position %d observed %.4f expected %.4f difference %.4f\n", $score->position, $score->observed_score, $score->expected_score, $score->diff_score);  
      printf("$parts[3]\t$parts[4]\t%.4f\n",$score->diff_score);  
   }  
 }  
 }  
 close FILE1;  

Negative GERP scores suggest that the sites are probably evolving neutrally.

 read.table(file="GERP.scores",header=FALSE,stringsAsFactors=FALSE)->M  
 as.data.frame(aggregate(M,list(M$V2),mean))->N  
 as.data.frame(aggregate(M,list(M$V2),max))->P  
 N$Group.1[N$V3<0]->negative.mean.GERP  
 jpeg("GERP_HGT_distribution.jpeg")  
 boxplot(as.numeric(M$V3)~M$V2,outline=FALSE,col=ifelse(unique(M$V2) %in% negative.mean.GERP, "red", "blue"),xaxt="n",ylab="GERP score",main="GERP score by gene")  
 legend("bottomright", c("Negative mean GERP", "Positive mean GERP"), fill = c("red", "blue"))  
 dev.off()  

Despite 41 genes having a mean GERP score less than 0, each of these genes have atleast a few sites that are having positive GERP scores.



Based on above arguments, it could be suggested that 10 of the 145 genes are probably annotation artifacts. The 127 genes for which GERP scores could be obtained from whole genome alignments show moderate levels of conservation, suggesting these genes are probably not artifacts.

Friday, March 13, 2015

Lack of Isolation By Distance pattern in Tiger mitochondrial data

A recent paper, that used ancient DNA sequences from museum samples concluded that Bali, Javan and Sumatran tigers had a closer genetic relationship by comparing it with diagnostic mtDNA sequences from various other tiger subspecies.

Due to lack of precise geographic information about the sampling locations, despite having a matrix of genetic distances they could not look at isolation by distance patterns. The historical geographic ranges of these subspecies can be found on wikipedia. One could, just assume sampling locations within the geographic range and check if a pattern of IBD can be seen. While performing such assumptions in a paper might not be possible, a blog provides you some freedom to do so. However, somebody familiar with Tigers might be able to provide better assumptions. 

ALT or Siberian Tiger is assumed to have been sampled from Sikhote-Alin    45°N 136°E. VIR or Caspian Tiger from Gansu 38°N 102°E. AMO or South China Tiger from Qin mountains 33°N 107°E. COR or Indo-chinese Tiger from Yunnan 25°N 101°E. JAX or Malayan Tiger from Kelantan 5°N 102°E. SUM or Sumatran Tiger from Bukit Barisan Selatan National Park 5°S 104°E. SON or Javan Tiger from Java 7°S 110°E. BAL or Bali Tiger from Bali 8°S 115°E. TIG or Bengal Tiger from Hazaribagh National Park 24°N 85°E.

The R package sp provides many useful functions to deal with spatial data. We use the spDistsN1 function to get the using Euclidean or Great Circle distance between two co-ordinates.

 library(sp)  
 library(reshape)  
 library("ecodist")  
 someCoords <- data.frame(long=c(136,102,107,101,102,104,110,115,85), lat=c(45,38,33,25,5,5,7,8,24))  
 apply(someCoords, 1, function(eachPoint) spDistsN1(as.matrix(someCoords), eachPoint, longlat=TRUE))->X  
 X[upper.tri(X)]->G  
 pop1<-c("ALT","ALT","ALT","ALT","ALT","ALT","ALT","ALT","VIR","VIR","VIR","VIR","VIR","VIR","VIR","AMO","AMO","AMO","AMO","AMO","AMO","COR","COR","COR","COR","COR","JAX","JAX","JAX","JAX","SUM","SUM","SUM","SON","SON","BAL")  
 pop2<-c("VIR","AMO","COR","JAX","SUM","SON","BAL","TIG","AMO","COR","JAX","SUM","SON","BAL","TIG","COR","JAX","SUM","SON","BAL","TIG","JAX","SUM","SON","BAL","TIG","SUM","SON","BAL","TIG","SON","BAL","TIG","BAL","TIG","TIG")  
 data.frame(pop1,pop2,G)->Geodist  
 Geodist2 <- with(Geodist, G)  
 nams <- with(Geodist, unique(c(as.character(pop1), as.character(pop2))))  
 attributes(Geodist2) <- with(Geodist, list(Size = length(nams),Labels = nams,Diag = FALSE,Upper = FALSE,method = "user"))  
 class(Geodist2) <- "dist"  
 read.table(file="Tiger_Fst.txt",header=TRUE,sep="\t",row.names = 1)->M  
 na.omit(melt(M))->N  
 data.frame(pop1,pop2,N$value)->Gendist  
 Gendist2 <- with(Gendist, N$value)  
 nams <- with(Gendist, unique(c(as.character(pop1), as.character(pop2))))  
 attributes(Gendist2) <- with(Gendist, list(Size = length(nams),Labels = nams,Diag = FALSE,Upper = FALSE,method = "user"))  
 class(Gendist2) <- "dist"  
 mantel(Geodist2 ~ Gendist2, nperm=10000)  
   mantelr    pval1    pval2    pval3  llim.2.5% ulim.97.5%   
 -0.06313036 0.67800000 0.32230000 0.65900000 -0.25930292 0.12999925   
 jpeg("Tiger_isolation_by_distance.jpeg")  
 plot(G,N$value/(1-N$value),xlab="Distance in Km",ylab="Fst/(1-Fst) from mtDNA sequences",main="Isolation by Distance pattern",pch=16,col="blue",xlim=c(0,6000))  
 text(G,N$value/(1-N$value),labels=paste(pop1,"  ",pop2))  
 dev.off()  
 jpeg("Tiger_Fst_cladogram.jpeg")  
 hc = hclust(dist(Gendist2))  
 op = par(bg = "#DDE3CA")  
 plot(hc, col = "#487AA1", col.main = "#45ADA8", col.lab = "#7C8071", col.axis = "#F38630", lwd = 3, lty = 3, sub = "", hang = -1, axes = FALSE,xlab="",main="Tiger Fst cladogram")  
 axis(side = 2, at = seq(0, 400, 100), col = "#F38630", labels = FALSE,   lwd = 2)  
 mtext(seq(0, 400, 100), side = 2, at = seq(0, 400, 100), line = 1, col = "#A38630", las = 2)  
 dev.off()  

The mantel's test shows that the IBD pattern has a very weak negative correlation. This is not entirely unexpected, given the high values of Fst that are almost saturated. Would using more markers with greater resolution capture a IBD pattern? How strong are the bottlenecks?

 
They have already used the data to build a phylogenetic tree and a network. As i need to do something different, i build a cladogram using the Fst matrix. While this cladogram shows that the SUM and BAL are close to each other, it fails to group SON with them. Various other differences from the phylogenetic tree can be seen.