Quick Hacks for R BatchJobs: An awesome package that instantly enhanced my R workflow:

I work on a HPC cluster and we use the LSF scheduler to run our jobs. I conduct most of my interactive data analysis on my laptop if I am testing code on small datasets or daily for post processing data such as plots etc.
Working with NGS datasets, even my interactive analysis workflow has shifted to the cluster for two reasons

  1. Most of the interactive work still involves quite large datasets and so my heavy lifting is usually done on the cluster using an interactive ( i use emacs org mode quite heavily for this).
  2. Eventually, my workflows usually turn into multiple parallel jobs that are submitted to the scheduler and working on the cluster directly ensures that there are no hiccups in terms of dependencies, version etc

As many might agree, this was a little cumbersome as say once i had a working piece of code that could now be run in parallel many times, i would have to somehow transform this to a script that could be submitted on a command line using Rscript or R CMD BATCH as batch jobs. Two issues that immediately come to mind with this workflow is that

  1. if something changed in the main code it would take some time to get everything working right again.
  2. Many times I would have something ready with part of the analysis which could take a few hours, say a bootstrap estimation, and to get it up and running on the scheduler would either require interrupting my current interactive session or  start another session while the process runs in the current session. This means I would have to go through setting up my environment ( data and variables) exactly as the previous one to pick up where I left off.

I came across the R BatchJobs package a few months ago, and was excited, but was unable to play around with it. Recently, I started working on some co-expression analysis using the WGCNA package and was also testing out some glasso approaches to test. With the number of genes some of my code for bootstrapping WGCNA was taking about a couple of hours, and one of the  glasso runs was taking about 10 hours.  I now had the bootstrap code ready to go, but i would  have had to take myself away from my interactive analysis to write my scripts to submit the batch jobs.

Enter R BatchJobs to save the day. Here i will present a quick hack to get started right away as I haven’t gone through the entire package in detail, rather I just picked the functions that would get me off the ground running on  a cluster using an LSF scheduler ( I can imagine that it will be much different for the other schedulers supported).

Th first is to create a cluster scripts template file and my file is posted below.

## Default resources can be set in your .BatchJobs.R by defining the variable
## 'default.resources' as a named list.

## remove everthing in [] if your cluster does not support arrayjobs
#BSUB -J <%= job.name %>[1-<%= arrayjobs %>] # name of the job / array jobs
#BSUB -o <%= log.file %> # output is sent to logfile, stdout + stderr by default
#BSUB -q <%= resources$queue %> # Job queue
##BSUB -W <%= resources$walltime %> # Walltime in minutes
##BSUB -M <%= resources$memory %> # Memory requirements in Kbytes

# we merge R output with stdout from LSF, which gets then logged via -o option
module load R/3.2.2
Rscript –no-save –no-restore –verbose <%= rscript %> /dev/stdout

 

Then  i saved it in a specific location with the name lsfTemplate.tmpl and I was off and running with just 3 functions as below

The first function reads in the LSF template and sets up the configuration for the scheduler in the environment
cluster.functions <- makeClusterFunctionsLSF(“/data/talkowski/ar474/lsfTemplate.tmpl”)

Next is to create a registry for which you need 3 pieces of information

  1.  the Id : I think of this as a project Id rather than a job id and I will explain this soon
  2. the file.dir : This is where everything gets stored, so make sure you have plenty of space here
  3. src.dirs: Any .R files in this folder will be “sourced” in the bsub job

reg <- makeRegistry(id=”test_boot_reg”, seed=123,
file.dir=”/data/talkowski/Samples/16pMouse_TissueAtlas/DataAnalysis/wgcnaAnalysis/boot_wgcna/BatchJobs”,
src.dirs=”/data/talkowski/Samples/16pMouse_TissueAtlas/DataAnalysis/wgcnaAnalysis/Rfunctions”)

The next function is the batchMap function which I think of as the .lsf scripts we write to submit jobs. You need to pass the registry, a function that you want to run,  a vector to split over and any additional arguments that you want to pass to the function

batchMap(reg,bootRun,seq(from=10,by=120, length.out=10),more.args=list(indat=thresh.wide.dat,nBoot=nBoot, nSamp=nSamp, nCol=nCol))

The main trick here is that you can wrap most anything within the function, but you need to specify  a vector or list after the function, which in the documentation is given as

… [any]
Arguments to vectorize over (list or vector).

Suppose you wanted to pass a large  job of using a for loop to bootstrap a dataset 10000 times ,calculating the correlation matrix each time and then return the average of the correlation matrix. Say you just wanted to submit this a single job to the cluster with the batchMap function. I found that if i used the dataset as the main argument it gets vectorized, so i made the first argument to my function a dummy variable and pass it a single value

exampleBoot<- function(a, indat, nBoot) { bootstrap using for loop here and return the result}

and then call batchMap like below

batchMap(reg,exampleBoot,0,more.args=list(indat=thresh.wide.dat,nBoot=nBoot))

now the above generates code to submit one job to the cluster and we can see the jobIds usin

getJobIds(reg)

Finally, you submit the jobs using the command below and any additional bsub options can be passed using the resources argument.

submitJobs(reg,resources=list(queue=”medium”), progressbar=FALSE,max.retries=0)

One thing to note is that i had to change the max.retries value from the default as i was getting an error documented in my other post, so you should check to see if that works for you.

The fun part is that say suppose you wanted to submit 10 jobs of a 100 jobs of a 100 bootstraps each that is exactly what i have done in earlier in first batchMap example. Here i now provide a sequence for the vector and the function automatically creates a job for each element of the vector. So i decided to use the dummy variable and pass a sequence of seeds. I also now specify the number of bootstraps I want in my function which i pass as additional arguments

batchMap(reg,bootRun,seq(from=10,by=120, length.out=10),more.args=list(indat=thresh.wide.dat,nBoot=nBoot, nSamp=nSamp, nCol=nCol))

The tremendous advantage  of this whole process  is that I am still in my interactive session. So i tested out some code to run WGCNA and wanted to run a full fledged bootstrap. I just wrapped my code in a function in the session and submitted them as jobs to the LSF scheduler and now i can continue working with the same data set for testing other types of analysis.

Other useful functions are

showStatus(reg)

res1 <- loadResult(reg,1)

To Be UPDATED .. multiruns

I wanted to run bootstrap estimates at multiple sample sizes, so I created a sampleSizes vector to loop over. You can also point a directory where you store your generic R scripts and these will get sourced for each job that you run on the cluster, otherwise if you have already sourced those files in your current environment they will be accessible for the jobs. So I am setting up to run 1000 bootstraps for each of the sample sizes below and create a list for for the registry as well which seems to work well. I will add more explanation if anyone needs it.

sampleSizes <- c(8,10,12,14,16,18,20,25,30,50)
outDirPrefix <- “/data/talkowski/Samples/16pMouse_TissueAtlas/DataAnalysis/wgcnaAnalysis/boot_wgcna/BatchJobs”
funcDirs <- “/data/talkowski/Samples/16pMouse_TissueAtlas/DataAnalysis/wgcnaAnalysis/Rfunctions”
seedMat <- matrix(seq(from=10,by=110, length.out=10*length(sampleSizes)),ncol=10, byrow=T)
if(nCol==”all”)nCol <- ncol(expr)
nBoot <- 100
##regId <- “boot_wgcna”
reg <- list()
for (i in 1:length(sampleSizes))
{
sSize <- sampleSizes[i]
regId <- paste(“boot_wgcna”,sSize, sep=”_”)
outDir <- paste(outDirPrefix,sSize, sep=”_”)
reg[[i]] <- makeRegistry(id=regId, seed=123,file.dir=outDir,src.dirs=funcDirs)
batchMap(reg[[i]],bootRun,as.vector(seedMat[i,]),
more.args=list(indat=expr,nBoot=nBoot, nSamp=sSize, nCol=nCol))
getJobIds(reg[[i]])
}

for( i in 1:length(reg))
{
submitJobs(reg[[i]],resources=list(queue=”medium”), progressbar=FALSE,max.retries=0)
}

 

##reg <- loadRegistry(file.dir=outDirPrefix)
##removeRegistry(reg)

Troubleshooting R BatchJobs Error is.list(control)

I was testing out the R package BatchJobs and ran into this error which was a little hard to trouble shoot.

The output (if any) follows:
:

:
running
:
‘/source/R/3.2.2/lib64/R/bin/R –slave –no-restore –no-save –no-restore –file=/data/talkowski/Samples/16pMouse:
_TissueAtlas/DataAnalysis/wgcnaAnalysis/boot_wgcna/BatchJobs/jobs/01/1.R –args /dev/stdout’
:

:
WARNING: ignoring environment value of R_HOME
:
Loading required package: BBmisc
:
Loading required package: methods
:
Loading registry: /data/talkowski/Samples/16pMouse_TissueAtlas/DataAnalysis/wgcnaAnalysis/boot_wgcna/BatchJobs/regis:
try.RData
:
Loading conf:
:
2016-03-24 16:09:17: Starting job on node cmu095.research.partners.org.
:
Auto-mailer settings: start=first+last, done=first+last, error=all.
:
Setting work dir: /data/talkowski/Samples/16pMouse_TissueAtlas/DataAnalysis/wgcnaAnalysis/Rscripts
:
Error : is.list(control) is not TRUE
:
Error in doWithOneRestart(return(expr), restart) : bad error message
:
Calls: <Anonymous> -> sendMail -> warningf
:
Setting work back to: /data/talkowski/Samples/16pMouse_TissueAtlas/DataAnalysis/wgcnaAnalysis/Rscripts
:
Memory usage according to gc:
:
used (Mb) gc trigger (Mb) max used (Mb)
:
Ncells 352464 18.9 592000 31.7 460000 24.6
:
Vcells 475908 3.7 1023718 7.9 786431 6.0
:
Execution halted
:

After loading the .RData files in the registry folder and running through trace route I finally figured out that it was mainly due to the max.retries option in the submitJobs function. I changed it form the default 10 to 0.  The error was resolved and I could start submitting my jobs to the LSF using R.

I am really glad for the people who developed this package and their efforts. Given the fix i found, I would rather not take away their time by posting a bug on github, as this might be potentially a system specific issue. If the developers have any time and are inclined to post their comments on this someday it would be great.