Sphinx docs make : ImportError: cannot import name normalize

I ran into this simple error that could have consumed my whole day. I have set up sphinx in a conda environment. However, when i try to run `make html` I end up with the following error

CIS2X1NFGTF1:bioflows aragaven$ make html
Traceback (most recent call last):
  File "/Users/aragaven/anaconda2/lib/python2.7/site-packages/Sphinx-1.4.6-py2.7.egg/sphinx/__main__.py", line 14, in <module>
  File "/Users/aragaven/anaconda2/lib/python2.7/site-packages/Sphinx-1.4.6-py2.7.egg/sphinx/__init__.py", line 51, in main
  File "/Users/aragaven/anaconda2/lib/python2.7/site-packages/Sphinx-1.4.6-py2.7.egg/sphinx/__init__.py", line 61, in build_main
    from sphinx import cmdline
  File "/Users/aragaven/anaconda2/lib/python2.7/site-packages/Sphinx-1.4.6-py2.7.egg/sphinx/cmdline.py", line 14, in <module>
    import optparse
  File "/Users/aragaven/anaconda2/lib/python2.7/optparse.py", line 419, in <module>
    _builtin_cvt = { "int" : (_parse_int, _("integer")),
  File "/Users/aragaven/anaconda2/lib/python2.7/gettext.py", line 569, in gettext
    return dgettext(_current_domain, message)
  File "/Users/aragaven/anaconda2/lib/python2.7/gettext.py", line 533, in dgettext
  File "/Users/aragaven/anaconda2/lib/python2.7/gettext.py", line 468, in translation
    mofiles = find(domain, localedir, languages, all=1)
  File "/Users/aragaven/anaconda2/lib/python2.7/gettext.py", line 440, in find
    for nelang in _expand_lang(lang):
  File "/Users/aragaven/anaconda2/lib/python2.7/gettext.py", line 133, in _expand_lang
    from locale import normalize
ImportError: cannot import name normalize
make: *** [html] Error 1

I was puzzled as this had just worked before. Some googling revealed that this could be due to some conflicts in python version. This immediately led me to consider that perhaps there were python version differences in my root environment and another environment i was using for creating some of the documentation. This was indeed the case and once i source activated into the other environment it worked without any errors.


Error installing R/igraph unable to load shared object ‘../igraph.so’: libgfortran.so.4: cannot open shared object file: No such file or directory


I ran into this tiny error that could have consumed my whole day. I had set up an AWS ubuntu 16.04 (Xenial) image and installed R. I think I followed some random web page and ended up installing the latest version of R v3.4.2

I was trying to install this package “`phangorn“` which has igraph as it’s dependency and lo behold, i could not install it kept failing with this error:


Google turned up a few links that seemed to be helpful including installing libxml2-dev

The link below helped me first trouble shoot the foreign-graphml error #+BEGIN_SRC sh igraph_hacks_internal.h:42:0: warning: “strdup” redefined

^ In file included from /usr/include/string.h:630:0, from src/foreign-gml-parser.y:54: /usr/include/x86_64-linux-gnu/bits/string2.h:1291:0: note: this is the location of the previous definition

^ gcc -std=gnu99 -I/usr/share/R/include -DNDEBUG -DUSING_R -I. -Iinclude -Ics -Iglpk -Iplfit -ICHOLMOD/Include -IAMD/Include -ICOLAMD/Include -ISuiteSparse_config -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -DNDEBUG -DNPARTITION -DNTIMER -DNCAMD -DNPRINT -DPACKAGE_VERSION=\”1.1.1\” -DINTERNAL_ARPACK -DIGRAPH_THREAD_LOCAL=/**/ -fpic -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c foreign-graphml.c -o foreign-graphml.o foreign-graphml.c: In function ‘igraph_write_graph_graphml’: foreign-graphml.c:1408:46: error: expected ‘)’ before ‘GRAPHML_NAMESPACE_URI’ ret=fprintf(outstream, “\n”); ^ /usr/lib/R/etc/Makeconf:159: recipe for target ‘foreign-graphml.o’ failed make: * [foreign-graphml.o] Error 1 ERROR: compilation failed for package ‘igraph’

removing ‘/home/ubuntu/R/x86_64-pc-linux-gnu-library/3.4/igraph’

ERROR: dependency ‘igraph’ is not available for package ‘phangorn’

removing ‘/home/ubuntu/R/x86_64-pc-linux-gnu-library/3.4/phangorn’

The downloaded source packages are in ‘/tmp/Rtmp7mCd4h/downloaded_packages’ Warning messages: 1: In install.packages(“phangorn”, repos = “http://cran.mtu.edu“) : installation of package ‘igraph’ had non-zero exit status 2: In install.packages(“phangorn”, repos = “http://cran.mtu.edu“) : installation of package ‘phangorn’ had non-zero exit status #+END_SRC

However, they all did not seem to be address the second round of errors where ever after compiling, the igraph.so failed to load.

Turns out it was a simple thing to fix. The key is to recognize that the second line of the error message was the culprit, even though it does not actually throw an error

libgfortran.so.4: cannot open shared object file: No such file or directory

There was no libgfortran.so.4 installed on my machine and this is not available by default on Xenial.

ubuntu@ip-172-31-93-178:/usr/local/lib$ find /usr -name "libgfortran*"

However, the R version I had installed was somehow compiled with this version and further investigation reveraled that this was basically part of the gcc-7 toolchain. So, for my purposes I installed gcc-7 and gfortran-7 from a ‘ppa’ on ubuntu based on this SO post and this post. So I added the ppa:jonathonf/gcc-7.1 as specified in one of the comments and then installed as follows

sudo apt-get install gcc-7 g++-7 gfortran-7

the gfortran-7 is key as that is what installs the gfortran command and Voila!!! now I can install igraph.

Hope this is helpful to someone out there. #+END_SRC

Fixing rJava error on Mac OSX El Capitan

I was trying to reinstall RDAVIDWebService after upgrading to R-3.3.1 on El Capitan and I followed the steps from my previous blog post. Now you dont have to fix the ssl issue, but you still have to reconfigure java. After the step for reinstalling rJava I still ran into this error

Error : .onLoad failed in loadNamespace() for 'rJava', details:
  call: dyn.load(file, DLLpath = DLLpath, ...)
  error: unable to load shared object '/Library/Frameworks/R.framework/Versions/3.3/Resources/library/rJava/libs/rJava.so':
  dlopen(/Library/Frameworks/R.framework/Versions/3.3/Resources/library/rJava/libs/rJava.so, 6): Library not loaded: @rpath/libjvm.dylib
  Referenced from: /Library/Frameworks/R.framework/Versions/3.3/Resources/library/rJava/libs/rJava.so
  Reason: image not found
Error: package or namespace load failed for ‘RDAVIDWebService’

Ugh.. some issues with the rJava library. Google turned up with this result from StackOverflow and i just ran the following command in the terminal

sudo ln -f -s $(/usr/libexec/java_home)/jre/lib/server/libjvm.dylib /usr/local/lib

and it worked right away

UPGRADING R on Mac OSX: Quick and Dirty

A quick note on painless upgrade of R in the OSX environment and I think this should work on most systems. I was originally running 3.2.2 on my macbook and when i tried to install a package i ran into dependency issues and bugs that were fixed in the later version. So, of course back to installing the latest version of R. One issue that always crops up is that since the last install, I have downloaded a bunch of packages from both CRAN and BioConductor and wanted a quick way of updating and re-installing. Previously, I had just worked from a clean install as I did not have any major analysis ongoing and this time it was different as I do have analysis ongoing and did not want to deal with the install on demand. Googling turned up a few suggestions and I used it to construct a quick way to upgrade

First, before re-installing get the .libPaths() value for the current install. Now reinstall R. After reinstalling, we get a list of packages installed in the previous version using the commands below

## shows the path for the new version
[1] "/Library/Frameworks/R.framework/Versions/3.3/Resources/library"

So we use the paths from the previous version of R

> package_df <- as.data.frame(installed.packages("/Library/Frameworks/R.framework/Versions/3.2/Resources/library"))
> package_list <- as.character(package_df$Package)
> install.packages(package_list)

This works to install all packages from CRAN, however I got this error message after:

Warning message:
packages ‘affy’, ‘affyio’, ‘airway’, ‘ALL’, ‘annotate’, ‘AnnotationDbi’, ‘AnnotationForge’, ‘aroma.light’, ‘Biobase’, ‘BiocGenerics’, ‘biocGraph’, ‘BiocInstaller’, ‘BiocParallel’, ‘BiocStyle’, ‘biomaRt’, ‘BioNet’, ‘Biostrings’, ‘biovizBase’, ‘BSgenome’, ‘BSgenome.Hsapiens.UCSC.hg19’, ‘Category’, ‘cellHTS2’, ‘chipseq’, ‘clipper’, ‘clusterProfiler’, ‘ComplexHeatmap’, ‘cqn’, ‘DEGraph’, ‘DESeq’, ‘DESeq2’, ‘DO.db’, ‘DOSE’, ‘DynDoc’, ‘EDASeq’, ‘edgeR’, ‘EnrichmentBrowser’, ‘FGNet’, ‘fibroEset’, ‘gage’, ‘gageData’, ‘genefilter’, ‘geneplotter’, ‘GenomeInfoDb’, ‘GenomicAlignments’, ‘GenomicFeatures’, ‘GenomicRanges’, ‘ggbio’, ‘globaltest’, ‘GO.db’, ‘GOSemSim’, ‘GOSim’, ‘GOstats’, ‘GOsummaries’, ‘graph’, ‘graphite’, ‘GSEABase’, ‘hgu133a.db’, ‘hgu133plus2.db’, ‘hgu95 [... truncated]

As I suspected, the packages from BioConductor were not installed. So I decided to use the same approach and came up with the following:

  • Get a list of the packages installed in the current version from CRAN
> package_df_new <- as.data.frame(installed.packages("/Library/Frameworks/R.framework/Versions/3.3/Resources/library"))
> package_list_new <- as.character(package_df_new$Package)
  • Compare that list to the old list and the packages not in the new list are from BioConductor
> package_bioc <- package_list[-c(which(package_list %in%package_list_new))]
  • Finally, install those packages from Bioconductor
> source("https://bioconductor.org/biocLite.R")
trying URL 'https://bioconductor.org/packages/3.3/bioc/bin/macosx/mavericks/contrib/3.3/BiocInstaller_1.22.3.tgz'
Content type 'application/x-gzip' length 54312 bytes (53 KB)
downloaded 53 KB

The downloaded binary packages are in
Bioconductor version 3.3 (BiocInstaller 1.22.3), ?biocLite for help
> biocLite(package_bioc)

Steps to create R packages in Emacs Org-mode

Create a package skeleton

I am following the steps in Hadley Wickam’s excellent tutorial on how to create R-packages, but i have decided to see whether I can use emacs-org mode to do it. First lets load the devtools package


Now we create a skeleton for the package using devtools::*create*. Since, we are using emacs and not RStudio i set the rstudio option to false

Package: bootWGCNA
Title: What the Package Does (one line, title case)
Authors@R: person("First", "Last", email = "first.last@example.com", role = c("aut", "cre"))
Description: What the package does (one paragraph).
Depends: R (>= 3.2.2)
License: What license is it under?
LazyData: true

Now we edit the basic DESCRIPTION in the meta data file and add the title etc

We can add Imports and Suggests to the DESCRIPTION using devtools as follows

devtools::use_package("WGCNA", pkg="~/Documents/Research/R-packages/bootWGCNA")
Adding WGCNA to Imports
Refer to functions with WGCNA::fun()

It might print the message below based on whether you run the function in the R console or not. The best way is to check the DESCRIPTION file in another emacs buffer

Finally, we create an automated tests folder as follows


I think now you can just create your R files in the package and edit clean it up using emacs. You could test a function in org-mode keeping track of your work and once done, just create an R file in the package directory structure

Quick Note: ggplot2 namespace error on loading

More than anything else this is a quick reminder to myself and perhaps anyone else who encounters this error. When running an R session with the WGCNA package loaded, i get an error with ggplot2. This is because WGCNA loads the Hmisc namespace which in turn loads the ggplot2 namesake. However, both ggplot2 and Hmisc packages are not loaded. Therefore, when  i try to load ggplot2 i get this error

Error in unloadNamespace(package) :
namespace ggplot2€™ is imported by €˜Hmisc€™ so cannot be unloaded
Error in library(ggplot2) :
Package €ggplot2€™ version 2.0.0 cannot be unloaded

I keep forgetting the load/detach/unload stuff.. so seeing that I had to google this a second time I decided to write this up so that I remember

First unload the WGCNA package


then unload the WGCNA namespace


finally unload the Hmisc namespace


and now library(ggplot2) will work as before


Quick Hacks for R BatchJobs: An awesome package that instantly enhanced my R workflow:

I work on a HPC cluster and we use the LSF scheduler to run our jobs. I conduct most of my interactive data analysis on my laptop if I am testing code on small datasets or daily for post processing data such as plots etc.
Working with NGS datasets, even my interactive analysis workflow has shifted to the cluster for two reasons

  1. Most of the interactive work still involves quite large datasets and so my heavy lifting is usually done on the cluster using an interactive ( i use emacs org mode quite heavily for this).
  2. Eventually, my workflows usually turn into multiple parallel jobs that are submitted to the scheduler and working on the cluster directly ensures that there are no hiccups in terms of dependencies, version etc

As many might agree, this was a little cumbersome as say once i had a working piece of code that could now be run in parallel many times, i would have to somehow transform this to a script that could be submitted on a command line using Rscript or R CMD BATCH as batch jobs. Two issues that immediately come to mind with this workflow is that

  1. if something changed in the main code it would take some time to get everything working right again.
  2. Many times I would have something ready with part of the analysis which could take a few hours, say a bootstrap estimation, and to get it up and running on the scheduler would either require interrupting my current interactive session or  start another session while the process runs in the current session. This means I would have to go through setting up my environment ( data and variables) exactly as the previous one to pick up where I left off.

I came across the R BatchJobs package a few months ago, and was excited, but was unable to play around with it. Recently, I started working on some co-expression analysis using the WGCNA package and was also testing out some glasso approaches to test. With the number of genes some of my code for bootstrapping WGCNA was taking about a couple of hours, and one of the  glasso runs was taking about 10 hours.  I now had the bootstrap code ready to go, but i would  have had to take myself away from my interactive analysis to write my scripts to submit the batch jobs.

Enter R BatchJobs to save the day. Here i will present a quick hack to get started right away as I haven’t gone through the entire package in detail, rather I just picked the functions that would get me off the ground running on  a cluster using an LSF scheduler ( I can imagine that it will be much different for the other schedulers supported).

Th first is to create a cluster scripts template file and my file is posted below.

## Default resources can be set in your .BatchJobs.R by defining the variable
## 'default.resources' as a named list.

## remove everthing in [] if your cluster does not support arrayjobs
#BSUB -J <%= job.name %>[1-<%= arrayjobs %>] # name of the job / array jobs
#BSUB -o <%= log.file %> # output is sent to logfile, stdout + stderr by default
#BSUB -q <%= resources$queue %> # Job queue
##BSUB -W <%= resources$walltime %> # Walltime in minutes
##BSUB -M <%= resources$memory %> # Memory requirements in Kbytes

# we merge R output with stdout from LSF, which gets then logged via -o option
module load R/3.2.2
Rscript –no-save –no-restore –verbose <%= rscript %> /dev/stdout


Then  i saved it in a specific location with the name lsfTemplate.tmpl and I was off and running with just 3 functions as below

The first function reads in the LSF template and sets up the configuration for the scheduler in the environment
cluster.functions <- makeClusterFunctionsLSF(“/data/talkowski/ar474/lsfTemplate.tmpl”)

Next is to create a registry for which you need 3 pieces of information

  1.  the Id : I think of this as a project Id rather than a job id and I will explain this soon
  2. the file.dir : This is where everything gets stored, so make sure you have plenty of space here
  3. src.dirs: Any .R files in this folder will be “sourced” in the bsub job

reg <- makeRegistry(id=”test_boot_reg”, seed=123,

The next function is the batchMap function which I think of as the .lsf scripts we write to submit jobs. You need to pass the registry, a function that you want to run,  a vector to split over and any additional arguments that you want to pass to the function

batchMap(reg,bootRun,seq(from=10,by=120, length.out=10),more.args=list(indat=thresh.wide.dat,nBoot=nBoot, nSamp=nSamp, nCol=nCol))

The main trick here is that you can wrap most anything within the function, but you need to specify  a vector or list after the function, which in the documentation is given as

… [any]
Arguments to vectorize over (list or vector).

Suppose you wanted to pass a large  job of using a for loop to bootstrap a dataset 10000 times ,calculating the correlation matrix each time and then return the average of the correlation matrix. Say you just wanted to submit this a single job to the cluster with the batchMap function. I found that if i used the dataset as the main argument it gets vectorized, so i made the first argument to my function a dummy variable and pass it a single value

exampleBoot<- function(a, indat, nBoot) { bootstrap using for loop here and return the result}

and then call batchMap like below


now the above generates code to submit one job to the cluster and we can see the jobIds usin


Finally, you submit the jobs using the command below and any additional bsub options can be passed using the resources argument.

submitJobs(reg,resources=list(queue=”medium”), progressbar=FALSE,max.retries=0)

One thing to note is that i had to change the max.retries value from the default as i was getting an error documented in my other post, so you should check to see if that works for you.

The fun part is that say suppose you wanted to submit 10 jobs of a 100 jobs of a 100 bootstraps each that is exactly what i have done in earlier in first batchMap example. Here i now provide a sequence for the vector and the function automatically creates a job for each element of the vector. So i decided to use the dummy variable and pass a sequence of seeds. I also now specify the number of bootstraps I want in my function which i pass as additional arguments

batchMap(reg,bootRun,seq(from=10,by=120, length.out=10),more.args=list(indat=thresh.wide.dat,nBoot=nBoot, nSamp=nSamp, nCol=nCol))

The tremendous advantage  of this whole process  is that I am still in my interactive session. So i tested out some code to run WGCNA and wanted to run a full fledged bootstrap. I just wrapped my code in a function in the session and submitted them as jobs to the LSF scheduler and now i can continue working with the same data set for testing other types of analysis.

Other useful functions are


res1 <- loadResult(reg,1)

To Be UPDATED .. multiruns

I wanted to run bootstrap estimates at multiple sample sizes, so I created a sampleSizes vector to loop over. You can also point a directory where you store your generic R scripts and these will get sourced for each job that you run on the cluster, otherwise if you have already sourced those files in your current environment they will be accessible for the jobs. So I am setting up to run 1000 bootstraps for each of the sample sizes below and create a list for for the registry as well which seems to work well. I will add more explanation if anyone needs it.

sampleSizes <- c(8,10,12,14,16,18,20,25,30,50)
outDirPrefix <- “/data/talkowski/Samples/16pMouse_TissueAtlas/DataAnalysis/wgcnaAnalysis/boot_wgcna/BatchJobs”
funcDirs <- “/data/talkowski/Samples/16pMouse_TissueAtlas/DataAnalysis/wgcnaAnalysis/Rfunctions”
seedMat <- matrix(seq(from=10,by=110, length.out=10*length(sampleSizes)),ncol=10, byrow=T)
if(nCol==”all”)nCol <- ncol(expr)
nBoot <- 100
##regId <- “boot_wgcna”
reg <- list()
for (i in 1:length(sampleSizes))
sSize <- sampleSizes[i]
regId <- paste(“boot_wgcna”,sSize, sep=”_”)
outDir <- paste(outDirPrefix,sSize, sep=”_”)
reg[[i]] <- makeRegistry(id=regId, seed=123,file.dir=outDir,src.dirs=funcDirs)
more.args=list(indat=expr,nBoot=nBoot, nSamp=sSize, nCol=nCol))

for( i in 1:length(reg))
submitJobs(reg[[i]],resources=list(queue=”medium”), progressbar=FALSE,max.retries=0)


##reg <- loadRegistry(file.dir=outDirPrefix)

Fixing Kernel panic in mac OSX El Capitan with Reference to hk.uds.netusb.controller

I had installed the Hawking wireless function app on Yosemite and on update to El Capitan, a few weeks ago, I realized that it was no longer compatible. More annoyingly, my mac kept rebooting with kernel panic error  and the backtrace pointed to the hk.uds.netusb.controller kernel extension. Googling  traced the error back to the Hawking kernel module. First time round, I ended up booting into recovery mode and re-installing the OS.

Today, my mac gave the same issues, apparently the kernel module got installed again somehow (or it was never removed) and I was geared up to re-install. However, when booting into recovery mode ( press option while boot up) exploring a little i found that I could get a terminal open. I was relieved as I had not backed up the mac and was a little worried about re-install wiping my drive.

So I wondered if I could somehow remove the kernel module manually instead of re-installing like last time. I came across this post on the Apple forums.. the pointed that I could remove the kernel module. One point to remember is that your root is the recovery HD not your normal boot up disk. To get into your normal boot-up disk you need to actually cd into /Volumes/MacHD ( my startup disk was called MacHD) and then follow the instructions in the above post..  Basically move the  /Library/Extensions/kudshawking.kext   folder under /Volumes/MacHD  somewhere else.

As an aside this might be also be a good time to back up, at least over the network using rsync or something, if needed… but of course i didn’t do it yet.

Then i came across this post on manually installing kernel extensions. However, I could not find the kext caches they referenced. Further Googling brought me to this post on someone trying to re-install a custom kernel and i found they used the kextcache command.  The kextcache command has a -system-cache option that updates the system cache, but we need to point it to the right system cache ( in my case on MacHD)  using the  -u option. My final command was something like

kextcache -system-cache -u /Volumes/MacHD

and Voila !!!….after reboot my mac was back online again.

Hopefully this post is useful to someone… I think this might work for any kernel modules that break under El Capitan… the one thing that comes to mind is the NetGear Genie modules.

Troubleshooting R BatchJobs Error is.list(control)

I was testing out the R package BatchJobs and ran into this error which was a little hard to trouble shoot.

The output (if any) follows:

‘/source/R/3.2.2/lib64/R/bin/R –slave –no-restore –no-save –no-restore –file=/data/talkowski/Samples/16pMouse:
_TissueAtlas/DataAnalysis/wgcnaAnalysis/boot_wgcna/BatchJobs/jobs/01/1.R –args /dev/stdout’

WARNING: ignoring environment value of R_HOME
Loading required package: BBmisc
Loading required package: methods
Loading registry: /data/talkowski/Samples/16pMouse_TissueAtlas/DataAnalysis/wgcnaAnalysis/boot_wgcna/BatchJobs/regis:
Loading conf:
2016-03-24 16:09:17: Starting job on node cmu095.research.partners.org.
Auto-mailer settings: start=first+last, done=first+last, error=all.
Setting work dir: /data/talkowski/Samples/16pMouse_TissueAtlas/DataAnalysis/wgcnaAnalysis/Rscripts
Error : is.list(control) is not TRUE
Error in doWithOneRestart(return(expr), restart) : bad error message
Calls: <Anonymous> -> sendMail -> warningf
Setting work back to: /data/talkowski/Samples/16pMouse_TissueAtlas/DataAnalysis/wgcnaAnalysis/Rscripts
Memory usage according to gc:
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 352464 18.9 592000 31.7 460000 24.6
Vcells 475908 3.7 1023718 7.9 786431 6.0
Execution halted

After loading the .RData files in the registry folder and running through trace route I finally figured out that it was mainly due to the max.retries option in the submitJobs function. I changed it form the default 10 to 0.  The error was resolved and I could start submitting my jobs to the LSF using R.

I am really glad for the people who developed this package and their efforts. Given the fix i found, I would rather not take away their time by posting a bug on github, as this might be potentially a system specific issue. If the developers have any time and are inclined to post their comments on this someday it would be great.

R sink() … closeAllConnections

This is more to remind me about the closeAllConnections() function in R and hope its also useful for someone out there. While using sink() in a function to write output to a file, if that function exits then R does not resume writing to the console, even if you use the sink() function to close. I think this is because the connection is still open in the functions environment. The closeAllConnections tip that i found from this stackOverflow thread resolved it.  See also  documentation in base R