Monday, June 10, 2013

Randomise order of lines in a file

Different programming languages are good at different things. R has many powerful statistical functions, while perl is good at data handling.

N random numbers from a certain range of numbers without re-sampling can be easily done in R with the "sample" function. To do the same thing in Perl, looping has (or atleast some form of iteration) to be used along with storing the results and checking to avoid re-sampling. 
 args<-commandArgs(TRUE)  
 totalsnps<-as.integer(args[1])  
 runumber<-as.integer(args[2])  
 sample(1:totalsnps,totalsnps,replace=F)->N  
 write.table(file=paste("rands.out",runumber,sep="."),N,col.names=F,row.names=F,quote=F)  

This Rscript can be run using the below line

 Rscript sampleit.r $linecount $iterationcount  

Once, the file with the new order of lines has been generated, it can be used by the perl script to write the file in the new order. We also keep the first two columns of the file unchanged and just randomise the remaining parts of the file.


 #!/usr/bin/perl  
 open RANDS, $ARGV[1] or die $!;  
 my %rhash = ();  
 my $count=1;  
#read in the file created by the R script in previous step
 while($lines = <RANDS>){  
 chomp $lines;  
 $rhash{$count}=$lines;  
 $count++;  
 }  
 close RANDS;  
#read the file that needs to be randomised and store it in hash with new order
 open STATS, $ARGV[0] or die $!;  
 my $hash = {CHR =>my $genename,POS =>my $pid,RESTATS =>my $pco1};  
 $mycount=1;  
 while($line = <STATS>){  
 chomp $line;  
 @tabs=split(/[ \t]+/,$line);  
 $hash{$mycount}{CHR}=$tabs[0];  
 $hash{$mycount}{POS}=$tabs[1];  
 $line =~ m/\w*\t\w*\t(.*)$/;  
 $hash{$rhash{$mycount}}{RESTATS}=$1;  
 $mycount++;  
 }#end of file while loop  
#check if number of lines match in old and new file
 if($mycount!=$count){print "mismatch in counts\n";}  
#print the file out in new order
 foreach $contigs (sort { $a <=> $b } keys %hash) {   
 print "$hash{$contigs}{CHR}\t$hash{$contigs}{POS}\t$hash{$contigs}{RESTATS}\n";   
 }  

This perl script just reads in the input file, stores it in a hash with new line order and then prints it out.

Friday, June 7, 2013

Distribution of the six most sighted birds

The folks over at the molecular Ecologist have a instructive tutorial on map making using R. The data from the birders in India, provides a novel dataset to try out these new map making skills. To start of with something simple, we look at the distribution of the 6 most sighted birds across India.

Only 6 species of birds have more than 700 recorded observations. The "Rosy Starling" leads the record with 864 observations followed by "Grey Wagtail" (857), "Common Sandpiper"(787),  "Barn Swallow" (752),"Pied Cuckoo" (750) and "Greenish Warbler" with 710 observations.

 read.csv("migrantwatch_reports.csv",row.names=NULL)->M  
 colnames(M)<-c("Species","Location name","City","State","Reporter","Date","Sighting type","Observation frequency","Start date","On behalf of","Latitude","Longitude")  
 numberofclusters<-10  
 N<-data.frame(M$Latitude,M$Longitude)  
 kmeans(N, numberofclusters)->P  
 M$cluster<-P$cluster  
 as.data.frame(P$centers)$M.Latitude->lat  
 as.data.frame(P$centers)$M.Longitude->lon  
 BSwallow<-as.data.frame(table(M$cluster[M$Species=="Barn Swallow"]))$Freq  
 CSandpiper<-as.data.frame(table(M$cluster[M$Species=="Common Sandpiper"]))$Freq  
 GWarbler<-as.data.frame(table(M$cluster[M$Species=="Greenish Warbler"]))$Freq  
 GWagtail<-as.data.frame(table(M$cluster[M$Species=="Grey Wagtail"]))$Freq  
 PCuckoo<-as.data.frame(table(M$cluster[M$Species=="Pied Cuckoo"]))$Freq  
 RStarling<-as.data.frame(table(M$cluster[M$Species=="Rosy Starling"]))$Freq  
 library(maps)  
 library(mapdata)  
 library(mapplots)  
 jpeg("top6.map.jpeg")  
 map("worldHires","India",col="gray90", fill=TRUE)  
 for (i in 1:numberofclusters) {  
 add.pie(z=c(BSwallow[i],CSandpiper[i],GWarbler[i],GWagtail[i],PCuckoo[i],RStarling[i]), x=lon[i], y=lat[i], col=c("blue","brown","Green","Grey","black","yellow"), labels="")  
 }  
 legend("topright",c("Barn Swallow","Common Sandpiper","Greenish Warbler","Grey Wagtail","Pied Cuckoo","Rosy Starling"),col=c("blue","brown","Green","Grey","black","yellow"),pch="*")  

Running the above code gives a nice(take up the politics with the guys who made the R package) map with pie-charts that looks below image:

The pie-charts are generated for the mean locations that are identified by k-means clustering on all observations with a K of 10. So, this also gives us an idea of where most of the observations come from.

Rosy Starling's are definitely observed more often on the west coast while the Grey Wagtail is seen more often to the south. A large number (337) of the Rosy Starling observations come from Shantilal Varu, whose favorite birds are waders. 


Thursday, June 6, 2013

Importance of recording observation frequency


With some feedback from MigrantWatch regarding the post about weekend bias in bird observations, it seems that the "Observation frequency" field is very important indeed. The whole weekend effect  does seem to disappear when the data is filtered for only Daily observations.



While, only 3029 of the 22622 observations ~13% of the observations are Daily, the weekend effect does have a nice control and should affect first and last sighting results much less. Just staring at the graphs might not be appealing to the more statistically minded, so doing a test like analysis of variance might be more productive.


 > data.frame(Day=format(as.Date(M$Date), format="%A"),Year=format(as.Date(M$Date), format="%Y"))->WD  
 > as.data.frame(table(WD))->WDD  
 > WDD[WDD$Year %in% c(2007:2013),]->WDDF  
 > aov(Freq~Day,WDDF)->WDDFA  
 > summary(WDDFA)  
       Df Sum Sq Mean Sq F value  Pr(>F)    
 Day     6 3875954 645992 5.0784 0.0005365 ***  
 Residuals  42 5342556 127204             
 ---  
 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1   
 > M[M$"Observation frequency" == "Daily",]->MD  
 >   
 > data.frame(Day=format(as.Date(MD$Date), format="%A"),Year=format(as.Date(MD$Date), format="%Y"))->DWD  
 > as.data.frame(table(DWD))->DWDD  
 > DWDD[DWDD$Year %in% c(2007:2013),]->DWDDF  
 > aov(Freq~Day,DWDDF)->DWDDFA  
 > summary(DWDDFA)  
       Df Sum Sq Mean Sq F value Pr(>F)  
 Day     6  466  77.75 0.0548 0.9993  
 Residuals  42 59625 1419.64          

While the ANOVA shows a significant effect of Day for the whole dataset, it does not show one for data filtered for only daily observations. A much larger sample size would be needed to ensure that this field is being used in a proper way.

The other point about these observations being affected by Holidays also seems very valid. 
 as.data.frame(table(data.frame(Day=format(as.Date(M$Date), format="%j"))))->DY  
 > DY[DY$Freq>200,]  
   Var1 Freq  
 20  020 246  
 27  027 224  
 337 337 261  
 358 358 213  
 365 365 204  
By tabulating the observations by day of year, annual holidays like Christmas and New years should be captured. Lo and behold! Christmas (358th day of the year) and New years eve (365th day) are in the top five.

Why does 337th day (Lawyers day in India) have the most observations? May be most of the birders are lawyers!!


Wednesday, June 5, 2013

Bird populations increase during weekends

Yes, the populations of birds increases drastically during weekends. This apparent bias in recording bird sightings during weekends in citizen science databases has been quantified for Europe and North America by Courter et al. They see a decreasing effect of the weekend effect over time for the bird species that they analyzed.

MigrantWatch is a 5 year old citizen science initiative in India that has recorded as many as 22,622 reports as of today. Although the web interface started in 2007, the dataset available for download does have 63 observations that are older, going all the way back to 1910 (Whitehead, CHT. 1911. JBNHS 21).

To get started with analyzing this dataset, it might be interesting to see how the weekend bias has changed over the years. Since, many of these reports are filed by amateur birders it should definitely show a weekend bias. In-fact Dr Jayant Wadatkar fondly remembers a weekend trip to Malkhed Reservoir when he sighted a Common Shelduck for the first time in Maharashtra. While such anecdotal information and common sense would suggest that the weekend effect is real, quantifying its magnitude might be more useful.

So lets tabulate it using the following lines:

 read.csv("migrantwatch_reports.csv",row.names=NULL)->M  
 colnames(M)<-c("Species","Location name","City","State","Reporter","Date","Sighting type","Observation frequency","Start date","On behalf of","Latitude","Longitude")  
 data.frame(Day=format(as.Date(M$Date), format="%A"),Year=format(as.Date(M$Date), format="%Y"))->WD  
 table(WD)  

Tabulation of all observations per day of week from 2007 onward

Day2007200820092010201120122013Total
Monday481862172262496622771865
Tuesday632492082282785623201908
Wednesday742281872363638072912186
Thursday642282522852707223132134
Friday492443003063066465082359
Saturday12349147761464216155874549
Sunday216848835895954247713337558
637247424762790306274913629

The weekend (Sunday more than Saturday) effect does seem rather clear just by eye. Observations for each of the weekdays ranges from 8 to 10% while Saturday has 20% and Sunday has more than 30% of the observations.

While this will obviously affect estimates of first sighting and last sighting for a season in the short term, over time this effect should not affect the larger patterns. However, one has to definitely check the effect of the day of week on all results that are obtained.

Due to the nature of the data collection effort the weekend effect is something that one has to live with.