Jeffrey Perkel / Jul 30 2019

Julia on Nextjournal

This computational notebook demonstrates a few features of the Julia language. It was written to accompany a Nature Toolbox article on Julia, published 30 July 2019 (Nature 572, 141-142 (2019)).

This is an executable document. To modify and run it, create an account on Nextjournal, then open this notebook and click the 'remix' button to create your own working copy. (A copy of the underlying Markdown document is available on GitHub.)

Syntactic sugar

Julia allows programmers to insert Greek characters (really, any Unicode symbol) into their code, making it easier to read. Some of these symbols are predefined, such as pi (3.141593...). We'll use that symbol to calculate the area and circumference of a circle. (Type \pi [TAB] to create the pi symbol.)

# We use semicolons to suppress the output of certain cells
r = 5; # radius of a circle
A = π * r^2;
C = 2 * π * r;
println("Area = ", round(A, digits = 2))
println("Circumference = ", round(C, digits = 2))

Greek characters can also represent functions. Here, we'll define two functions f(x) and g(x), and use the Greek sigma (\Sigma) to sum those functions over an interval.

function (f::Function,from::Integer,to::Integer)
  # note that Julia allows you to specify the 'type' of parameters passed to 
  # a function, which makes for more optimized (and error-checked) code. 
  # Here, the parameter 'f' is defined as a function; 'from' and 'to'
  # are integers.
    sum = 0
    for i = from:to
        sum += f(i)
    end
    return sum
end
∑ (generic function with 1 method)
f(x) = 2x + 4 # function `f` (note that no multiplication symbol is required between 2 and x.)
g(x) = x^2 + 4 # function `g`

sum = (f,1,10);
# note the '$sum' string formatting syntax
println("The sum of \'f(x) = 2x + 4\' over the interval [1,10] is: $sum") 
sum = (g,1,10);
println("The sum of \'g(x) = x^2 + 4\' over the interval [1,10] is: $sum")

If you're so inclined, you can even use emoji as variable and function names. How about a 'smiley cat'? (Type \:smiley_cat: [TAB])

😺= 7
for i in 1:😺
  println(i)
end

Distributed computing

Julia also supports distributed computing -- the distribution of computational work across multiple processor cores, whether locally or remotely. The following example was generously provided by Simon Byrne, a software engineer at Caltech and member of the Climate Modeling Alliance team.

Consider a simple Monte Carlo estimation of (see here for further details). If we generate uniform random variables , then

P(x2+y2<=1)=π4P(|x^2 + y^2| <= 1) = \frac{\pi}{4}

Then we can estimate by sampling and scaling the proportion by 4.

Firstly we consider a simple serial implementation, where we accumulate into a variable c:

function mc_pi(n)
    c = 0
    for i = 1:n
        x = rand()
        y = rand()
        c += (x^2 + y^2 <= 1)
    end
    c/n*4
end
mc_pi (generic function with 1 method)

Julia code is compiled 'just-in-time'. The first time it runs, it is relatively slow. So, we run it twice, and time the output. We run the calculation a billion times.

@time mc_pi(10^9) 
3.141491676

This second run takes advantage of precompiled code, and thus should be slightly faster

@time mc_pi(10^9) 
3.141545028

To do this in parallel, first we load the Distributed package (which comes with Julia), and add some processes

using Distributed
localprocs = addprocs(4); # start 4 workers on my local computer

We need to slightly rewrite the code to be amenable to parallelism: in this case, we use the @distributed macro, specifying a "reducer" function (in this case, +), which will combine the outputs of all the loops.

function mc_pi_parallel(n)
    c = @distributed (+) for i = 1:n
        x = rand()
        y = rand()
        x^2 + y^2 <= 1
    end
    c/n*4
end
mc_pi_parallel (generic function with 1 method)

Again, the first time we run this, the code is slowed due to JIT compilation overhead

@time mc_pi_parallel(10^9) 
3.141642008

If we rerun the code, we should see a small speedup -- my local machine only has 2 physical cores. Still, the code is about 43% faster than the non-parallelized implementation above (2.4 sec vs 4.2 sec). And it's also faster than the first run of mc_pi_parallel(), which had to be compiled.

@time mc_pi_parallel(10^9) # only a small speedup
3.14159282
rmprocs(localprocs...) # release the allocated workers...
Task (done) @0x00007fcebb9b1fc0

Graphing

Next, we'll graph the Fibonacci sequence, where each value equals the sum of the two previous values: Thus, the sequence is: 0, 1, 1, 2, 3, 5, 8, ...

function fibonacci(i = 25) 
    # compute the Fibonacci sequence. By default, return the first 25 values.
    @assert(i >= 2) # basic input check
    f1 = 0
    f2 = 1
    result = [f1, f2]
    for i in 1:(i-2)
        f3 = f1 + f2
        push!(result, f3) # add the new value to the array
        f1 = f2
        f2 = f3
    end
    return result
end
fibonacci (generic function with 2 methods)
myarray = fibonacci();

We use Julia's package manager to load the libraries needed to plot our results.

using Pkg
Pkg.add("Plots")
using Plots
Pkg.add("PyPlot")
pyplot()
Plots.PyPlotBackend()
plot(1:25,myarray,label="",color="blue", markershape = :circle, markercolor = :red, 
    xlabel = "Sequence no.", ylabel = "Fibonacci no.", title = "The first 25 Fibonacci numbers") 

Sequence analysis

Julia can also handle DNA sequence analysis, using the BioJulia packages.

Pkg.add("BioSequences")
using BioSequences

First, we'll read in a sequence file -- a circular DNA, called a plasmid, from the bacterium, Yersinia pestis. sequence.fasta is a FASTA-formatted file downloaded from https://www.ncbi.nlm.nih.gov/nuccore/NC_005816.1?report=fasta&log$=seqview.

sequence.fasta
reader = FASTA.Reader(open(
sequence.fasta
,"r")) my_dna = read(reader) close(reader)

Convert the file into a BioJulia DNASequence object.

my_dna = DNASequence(sequence(my_dna))
9609nt DNA Sequence: TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCT…TAATATTGACGCATGAGGGAATGCGTACCCCGACCCCTG

From https://github.com/jperkel/example_notebook/blob/master/My_sample_notebook.ipynb, we know that there is a 'hypothetical protein' located at 3485:3857 in this plasmid. But those coordinates are from Python, which is zero-indexed (that is, the first element of an array is 0, not 1). For Julia, which is not (the first element is 1), the location would be 3486:3857. Here, we extract that gene sequence, convert it to RNA, and translate it.

my_gene=my_dna[3486:3857]
my_rna=RNASequence(my_gene)
372nt RNA Sequence: GUGAGCAAAAAACGAAGACCCCAGAAAAGGCCGCGCCGG…CCUUAUCCGGUAACUAUCGUCUUGAGUCCAACCCGGUAA
my_ptn = BioSequences.translate(my_rna)
124aa Amino Acid Sequence: VSKKRRPQKRPRRRRFFHRLRPPDEHHKNRRSSQRWRNP…GISVRCRSFAPSWAVCTNPPFSPTTAPYPVTIVLSPTR*

Now let's see what info the NCBI Entrez database has on this protein. From https://github.com/jperkel/example_notebook/blob/master/My_sample_notebook.ipynb we learned that the accession number for this sequence is NP_995570.1.

Pkg.add("BioServices")
using BioServices.EUtils
Pkg.add("EzXML")
using EzXML
res = efetch(db="protein",id="NP_995570.1",rettype="gb",retmode="xml");

We'll need to parse the XML output. What gene is this?

doc = parsexml(res.body)
seq = findfirst("/GBSet/GBSeq",doc)
nodecontent(findfirst("GBSeq_definition", seq))
"hypothetical protein YP_pPCP04 (plasmid) [Yersinia pestis biovar Microtus str. 91001]"

What is its sequence?

entrez_ptn = AminoAcidSequence(nodecontent(findfirst("GBSeq_sequence",seq)))
123aa Amino Acid Sequence: MSKKRRPQKRPRRRRFFHRLRPPDEHHKNRRSSQRWRNP…VGISVRCRSFAPSWAVCTNPPFSPTTAPYPVTIVLSPTR

Let's compare the Entrez sequence with our FASTA sequence using a pairwise alignment...

Pkg.add("BioAlignments")
using BioAlignments
costmodel = CostModel(match=0, mismatch=1, insertion=1, deletion=1);
pairalign(EditDistance(), my_ptn, entrez_ptn, costmodel)
PairwiseAlignmentResult{Int64,BioSequence{AminoAcidAlphabet},BioSequence{AminoAcidAlphabet}}: distance: 2 seq: 1 VSKKRRPQKRPRRRRFFHRLRPPDEHHKNRRSSQRWRNPTGLKDTRRFPPEAPSCALLFR 60 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ref: 1 MSKKRRPQKRPRRRRFFHRLRPPDEHHKNRRSSQRWRNPTGLKDTRRFPPEAPSCALLFR 60 seq: 61 PCRLPDTSPPFSLREAWRFLIAHAVGISVRCRSFAPSWAVCTNPPFSPTTAPYPVTIVLS 120 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ref: 61 PCRLPDTSPPFSLREAWRFLIAHAVGISVRCRSFAPSWAVCTNPPFSPTTAPYPVTIVLS 120 seq: 121 PTR* 124 ||| ref: 121 PTR- 123

It's a match!

Calling Python

Finally, we'll demonstrate Julia's ability to utilize code written in another language, using the PyCall package to call the Python folium mapping library (see our June 2018 mapping feature).

Pkg.add("PyCall")
using PyCall

If you don't have folium installed, you'll need to do so. You can do that here using the Julia Conda package.

Pkg.add("Conda")
using Conda
Conda.add_channel("conda-forge")
Conda.add("folium")

Now we create a simple map: a few points in London, Oxford and Cambridge, overlaid on either a street map, or on a map of geological data provided by the Macrostrat Project. Note that the code below is written in Python. By bracketing the code with the py macro (py"""{code}""" or py"{code}"), Julia knows to pass that code to PyCall directly.

The resulting map is interactive: you can zoom, pan, click the points of interest, and alternate between the two layers (by clicking on the tiles icon in the upper-right corner of the map). If you click anywhere on the map, a popup will appear showing the latitude and longitude of that position. [NOTE: IF THE MAP RENDERS WITH NO LAYER SHOWING, CLICK THE TILES ICON IN THE UPPER-RIGHT CORNER TO ACTIVATE THEM]

0.6s
py"""
import folium

def draw_map():
    coords = { 
        0: { "name": "Nature", "lat": 51.533925, "long": -0.121553 },
        1: { "name": "Francis Crick Institite", "lat": 51.531877, "long": -0.128767 },
        2: { "name": "MRC Laboratory for Molecular Cell Biology", "lat": 51.524435, "long": -0.132495 },
        3: { "name": "Kings College London", "lat": 51.511573, "long": -0.116083 },
        4: { "name": "Imperial College London", "lat": 51.498780, "long": -0.174888 },
        5: { "name": "Cambridge University", "lat": 52.206960, "long": 0.115034 },
        6: { "name": "Oxford University", "lat": 51.754843, "long": -1.254302 },
        7: { "name": "Platform 9 3/4", "lat": 51.532349, "long": -0.123806 }
    }
    
    m = folium.Map(location = [51.8561, -0.2966], tiles = 'CartoDB positron', zoom_start = 9)
		
    # add the points of interest...
    for key in coords.keys():
        folium.CircleMarker(
        location=[coords[key]['lat'], coords[key]['long']],
        popup=coords[key]['name'],
        color=('crimson' if coords[key]['name'] == 'Nature' else 'blue'),
        fill=False,
        ).add_to(m)
    
    # pull in the Macrostrat tile layer
    folium.TileLayer(tiles='https://tiles.macrostrat.org/carto/{z}/{x}/{y}.png', 
                 attr='Macrostrat', name='Macrostrat').add_to(m)
    folium.LayerControl().add_to(m) # allow user to switch between map layers
    folium.LatLngPopup().add_to(m) # provide a lat/long popup

    return m
"""

# draw the map
py"draw_map()"

Computational reproducibility

Document session for computational reproducibility.

versioninfo()
Pkg.installed()
Dict{String,Union{Nothing, VersionNumber}} with 44 entries: "Interact" => v"0.10.2" "HTTP" => v"0.8.0" "PlotlyBase" => v"0.2.5" "BioServices" => v"0.3.2" "Netpbm" => v"0.3.1" "BioSequences" => v"1.1.0" "EzXML" => v"0.9.3" "ImageMagick" => v"0.7.3" "RecipesBase" => v"0.6.0" "DataFrames" => v"0.18.1" "PlotlyJS" => v"0.12.3" "GR" => v"0.39.1" "BioAlignments" => v"1.0.0" "CSSUtil" => v"0.1.0" "FixedPointNumbers" => v"0.5.3" "NRRD" => v"0.5.1" "MAT" => v"0.5.0" "HDF5" => v"0.11.1" "Makie" => v"0.9.3" "AbstractPlotting" => v"0.9.6" "FileIO" => v"1.0.6" "ImageCore" => v"0.7.4" "GLMakie" => v"0.0.5" "CSV" => v"0.4.3" "CSVFiles" => v"0.15.0" ⋮ => ⋮