Tips to create beautiful, publication-quality plots in Julia
I was about to send an e-mail to my students with a series of tips to produce good looking plots with Julia, and decided to post the tips here instead. I hope this is useful for more people, and please let me know of any other tips, nice examples, and possible corrections.
First example
I will describe how to produce the figure below [LINK]. It contains many details in its construction which are worth mentioning:
To start with, I am using Plots
with GR
(the default option), with
using Plots
I will also use the following packages:
using LaTeXStrings
using Statistics
using ColorSchemes
And I will use one function from a in-house package we have to build one density function from data (probably other options exist):
using M3GTools # from https://github.com/mcubeg/M3GTools
Initially, the layout of the plot is set using
plot(layout=(2,2))
meaning two rows and two columns. I start defining a variable, called sp
(for subplot
), which will define in which subplot the following commands will operate:
sp=1
Subplot 1 contains data for a series of labels (1G6X to 1BXO) which are colored sequentially. This was done as follows. The list of labels is defined with
names = [ "1AMM", "1ARB", "1ATG", "1B0B", "1BXO", "1C52", "1C75", "1D06",
"1D4T", "1EW4", "1FK5", "1G67", "1G6X", "1G8A", "1GCI" ]
To plot the data associated with each label with a different color, I used:
for i in 1:length(names)
c = get(ColorSchemes.rainbow,i./length(names))
plot!(subplot=sp,x,y[i,:],linewidth=2,label=names[i],color=c)
end
(I am assuming that in x
the data is the same for all plots, and is stored in vector x[ndata]
, and the plotted data in y
is in an array y
of size y[length(names),ndata]
.
One of the limitations of GR
as plotting back-end is the managing of special characters. To define the labels of the axes, therefore, we use LaTeXStrings
and, furthermore, we change the font of the text such that it is not that different from the standard font of the tick labels and legend:
plot!(xlabel=L"\textrm{\sffamily Contact Distance Threshold / \AA}",subplot=sp)
plot!(ylabel=L"\textrm{\sffamily Probability of~}n\leq n_{XL\cap DCA}",subplot=sp)
The interesting features of the second plot are the overlapping bars, and the variable labels in the x
axis and their angle.
The labels in the x
-axis are defined in a vector (here, amino acid residue types):
restypes = [ "ALA", "ARG", "ASN", "ASP", "CYS", "GLU", "GLN",
"GLY", "HIS", "ILE", "LEU", "LYS", "MET", "PHE",
"PRO", "SER", "THR", "TRP", "TYR", "VAL" ]
Start with
sp=2
to change where the next commands will operate.
The plot contains two sets of data (red and blue), which we plot using bar!
. First the red data, labeled DCAs. We use alpha=0.5
so that the red color becomes more soft:
bar!(dca_data,alpha=0.5,label="DCAs",color="red",subplot=sp)
The second set of data, "XLs", will be blue and will overlap the red data. We also used this call to bar!
to define the xticks
with custom labels, and the rotation of the labels:
bar!(xl_data,alpha=0.5,xrotation=60,label="XLs",
xticks=(1:1:20,restypes),color="blue",subplot=sp)
Finally, we set the labels of the axes, also using Latex and changing fonts:
bar!(xlabel=L"\textrm{\sffamily Residue Type}",
ylabel=L"\textrm{\sffamily Count}",subplot=sp)
The peculiarity of the third plot (sp=3
) (bottom left) is that we have two data sets defined in different ranges, but we want to plot bars with the same width for both sets. This requires a "trick".
Initially, we tested some different number of bins for one of the sets until we liked the result. We found that for the blue set 40 bins were nice:
histogram!(xl_data,bins=40,label="XLs",alpha=1.0,color="blue",subplot=sp)
Now we need to adjust the number of bins of the other set such that both have the same width. We find out the bin width by computing the range of the "XL" (blue) set above, and dividing it by 40:
xl_bin = ( maximum(xl_data) - minimum(xl_data) ) / 40
The number of bins of the other (DCA - red) set, will be, therefore, computed from the maximum and minimum values of this set and the bin width:
ndcabins = round(Int64,( maximum(all_dca) - minimum(all_dca) ) / xl_bin)
And this number of bins is used to plot the bars of the red set:
histogram!(dca_data,bins=ndcabins,label="DCAs",alpha=0.5,color="red",subplot=sp)
In this plot we also plot some dots indicating the mean of each distribution, something that we did with:
m1 = mean(dca_data)
scatter!([m1,m1,m1,m1],100,104,108,112],
label="",color="red",linewidth=3,linestyle=:dot,subplot=sp,markersize=3)
(the y-positions of the dots were set by hand). And, of course, we use Latex to set the axis labels again:
histogram!(xlabel=L"\textrm{\sffamily C}\alpha\textrm{\sffamily~Euclidean Distance} / \textrm{\sffamily~\AA}",subplot=sp)
histogram!(ylabel=L"\textrm{\sffamily Count}",subplot=sp)
The fourth plot (sp=4
, bottom right) is similar to the third, but it contains a density function (instead of the bars) for one of the data sets ("All contacts" - green). This density function was computed using our own function, using:
x, y = M3GTools.density(all_max_contact_surfdist,step=1.0,vmin=1.0)
and plotted with:
plot!(x,y,subplot=sp,label="All contacts",linewidth=2,color="green",alpha=0.8)
We also added the figure labels A, B, C, D. This was done with the annotate
option. The trick here is to add these annotations to the last plot, such that they stay above every other plot element:
fontsize=16
annotate!( -1.8-16.5, 500, text("A", :left, fontsize), subplot=4)
annotate!( -1.8, 500, text("B", :left, fontsize), subplot=4)
annotate!( -1.8-16.5, 200, text("C", :left, fontsize), subplot=4)
annotate!( -1.8, 200, text("D", :left, fontsize), subplot=4)
(the positions were set by hand, but they are quite easy to align because we need only two positions in x
and two positions in y
).
Last but not least, we save the figure in PDF format (saving it to PNG directly does not provide the same result, at least in my experience):
plot!(size=(750,750))
savefig("./all.pdf")
PDF is a vector graphic format, so that the size does not define the resolution. The size=(750,750)
is used to define the overall size of the plot in what concerns the relative font sizes. Thus, this size is adjusted until the font sizes are nice taking into account the final desired plot size in print.
If required (and I do that), I open this final plot in GIMP, converting it to a bitmap with 300dpi resolution, and save it to TIFF or PNG depending on what I want to do with the figure later.
Second example
A second example [LINK]. This example is interesting because we have added non-linear fits to scatter plots, and there are some tricks to get the same colors for specific sets of data in different plots and annotations.
The example figure is this one:
Here, we use the following packages:
using Plots
using DelimitedFiles
using LsqFit
using LaTeXStrings
We used the DelimitedFiles
package to read the data, with
file = "./data/data.dat"
data = readdlm(file,comments=true,comment_char='#')
time = data[:,1] # time in the first column
hbonds = data[:,3] # data in the third column
The layout is the same as that of the first example plot(layout=(2,2))
, and I will focus in the new features used only. Subplots 1 and 2 (upper ones), are bar plots which contain error bars:
labels=["BCL as acceptor","BCL as donor"]
bar!(labels,y,label="",subplot=sp,color=[1,2],yerr=yerr,ylim=[0,ymax])
plot!(ylabel=L"\textrm{\sffamily Urea H-bonds}",subplot=sp)
plot!(title="Deprotonated",subplot=sp)
Note that "ymax" was adjusted, so that in this case it is the same in both plots, for comparison. The error bars are added with yerr
, and the labels of the x-axis were defined with the labels
vector, defined before the plot.
We will perform exponential fits to some of our data to produce the plots "C" and "D". We define the model here (it will be used by the LsqFit
package):
# Exponential fit model
@. model(x,p) = exp(-x/p[1])
p0 = [ 0.5 ] # initial guess
For each data set, the fit is performed with
We will perform exponential fits to some of our data to produce the plots "C" and "D". We define the model here (it will be used by the LsqFit
package):
fit = curve_fit(model,times,lifetime,p0)
(times
and lifetime
are the vector containing the actual x
and y
data).
And the final characteristic time is, in this case, the first element of the array that is retrieved by the coef
function of LsqFit
, given the fit
result:
tau = coef(fit)[1]
Using the parameter from the fit, we can generate data to plot a line corresponding to the model. The trick here is to the use the collect
function to generate a x
vector, and then the model already defined to obtain the y
data given the parameters:
x = collect(0:0.01:10)
y = model(x,[tau])
The fit will be plotted as a line, accompanied by the scatter of the actual data:
idata=1
plot!(x,y,linewidth=2,subplot=sp,label="BCL as acceptor",color=idata)
scatter!(times,lifetime,label="",color=idata,subplot=sp)
Note the color definition idata=1
. This will guarantee that the the two data sets are ploted with the same color. Now we want to write an annotation with that same color. This is tricky, and is done with:
color=get_color_palette(:auto, plot_color(:white), 5)[idata]
(I don't even understand the details of this command, but it works). It will retrieve the color in the current colorscale associated with the index idata
. With this it is possible to write annotations with the desired colors, but again some tricks are required. We need to parse the string using raw
and latextrings
, to use the text
option of the annotate
function and change the color of the text:
note=raw"\large\textrm{\sffamily "*"$tau_avg"*raw"} \pm \textrm{\sffamily "*"$tau_std"*raw"}"
annotate!( 0.0, 0.04, text(latexstring(note), :left, 7, color=color),subplot=sp)
(the complication here with the raw
function is only because we want to use the Latex fonts and the inline_formula not implemented symbol in those annotations).
Using LaTeX fonts and formatting for tick labels
One way to change the tick labels to whatever format one wants is to format them by converting them to strings. For example:
using Plots, Printf
x = rand(10); y = rand(10);
ticks = collect(0:0.2:1)
ticklabels = [ ("%5.1f",x) for x in ticks ]
plot(x,y)
plot!(xticks=(ticks,ticklabels))
A sofistication of this procedure, using the Formatting package, allows one to use the LaTeX fonts, and scientific notation in the axis. Here is a function that converts the number to scientific notation using LaTeX, and then uses the result as tick labels:
using LaTeXStrings
using Formatting
# First parameter: number, second parameter: number of decimal places
# optional font parameter: if anything else than "sf", will be default latex font
function latex_sci_not( x , ndec; font="sf" )
xchar = strip(Formatting.sprintf1("%17.$(ndec)e",x))
data = split(xchar,"e")
inonzero = findfirst( i -> i != '0', data[2][2:length(data[2])])
if font == "sf"
f = "\\textrm{\\sffamily "
fe = "\\textrm{\\sffamily\\scriptsize "
else
f = "{"
fe = "{"
end
if inonzero == nothing
string = latexstring("$f$(data[1])}")
else
if data[2][1] == '-'
string = latexstring("$f$(data[1])}\\times $f 10}^{$fe$(data[2][1])$(data[2][inonzero+1:length(data[2])])}}")
else
string = latexstring("$f$(data[1])}\\times $f 10}^{$fe$(data[2][inonzero+1:length(data[2])])}}")
end
end
return string
end
x = rand(10) ; y = rand(10) ;
ticks = collect(0:0.2:1)
ticklabels = [ latex_sci_not(x,2) for x in ticks ]
plot(x,y,xticks=(ticks,ticklabels))
plot!(size=(300,300))
savefig("teste.pdf")
The resulting plot is below, where the ticks in the x-axis were converted to LaTeX serif font family and to scientific notation using the function above.