For today, I have new paper that is only partially bioinformatic, however, I’ll obviously focus on this part. The article was written by Andrew T. Major, Yoichi Miyamoto, Camden Y. Lo, David A. Jans and Kate L. Loveland and can be found here: http://www.nature.com/articles/srep43323
Generally, the authors hypothesized that changes in IMPα proteins in paraspeckles are important for spermatogenesis. Due to unsuccessful past experiences (due to natural bias among cells), they prepared high throughput image analysis pipeline to analyze thousands of cells to obtain results that enables to cut off normal biological variation background. Images were prepared as tiled confocal z-series. The authors used Imaris software for identification of cells, nuclei and PSPC1/SFPQ nuclear spots. Results were exported from Imaris in CSV formats, processed with Python scripts and then statistical analysis was performed with R (all scripts – Python and R as well as shell scripts – are available as supplementary data).
Thresholds were computed for two fluorescent signals – GFP that was used to image whole cell cytoplasm and nucleus including autofluorescent signal from unmodified cells, and DAPI/Draq5 that was used to image nuclei. Cells located on the border of images were excluded from the analysis in R scripts and so were cells with undetectable nuclei. The pipeline calculated the ratio between the fluorescent nuclear and cytoplasmic signals for proteins, what is manually very time-consuming, and subjective. Results obtained with the automatic workflow remained in the general agreement with the earlier results gathered by manual paraspeckles analysis.
To work with overwhelming amount of gathered data the authors implemented PCA analysis to visualize 2D relationship among analyzed parameters. For statistical analysis with the use of R script additional packages were used: “car”, “epitools”, “geepack” and “ggplot2”. Comparisons were made between groups using generalized linear models, logistic or linear regression.
What is really important, this pipeline may be quite simply implemented in analyzes in other research areas. A disadvantage is the need to use Imaris software, that besides it’s unquestionable advantages, is also very expensive. However, with some manipulations, images may be also saved in csv format from free imaging software, e.g. ImageJ. Although, the pipeline would need to be highly modified and tested.
I’m actually quite impressed with the explanation of the scripts in readme file and in scripts themselves, especially as it’s not bioinformatic paper. It’s easier to use these pipeline scripts then sometimes software published to use as is. It worked perfectly from the first run on Ubuntu 16.04 LTS. Good job! It took 5 days to run all analyzes (4-year old personal computer) and as expected the results were as in publication (no, I did not check them all). You just need to check all the dependencies (indicated in the script, e.g. R libraries). However, there is a need to explain in more detail (and somewhere easy to find) what each folder contain and how to understand each csv file and results in generated pdfs.
Moreover, some of the lines in scripts are unnecessary and authors knew that and left them be anyway. I don’t know why. In StatsView.R script we can find lines:
citation(“ggbiplot”) # Nothing used from this in final analysis
citation(“reshape2”) # Requirement for ggplot2?
So why inlcude libraries that are not used and why not check if something is really a requirement? However, the script is enormous and it updates the user in the shell on which step it actually is, so it’s pardonable.
Over all, scripts are reusable and this paper may be a good example how we can help with our scripts to accelerate research 🙂