2020
using LinearAlgebra
using Distributions
using Random
using StatsBaseB=zeros(nrep,nanim) #matrix of 0's
Z=Array{Float64}(undef,nanim,nsnp)In R or Fortran we are used to do element-wise operations in a matrix like multiplication by a scalar (\(a \times \mathbf{B}\)) raising elements of a matrix to the square \(\mathbf{B} \odot \mathbf{B}\) , or taking the absolute value of a matrix \(abs(\mathbf{B})\) as:
a*B
B^2
abs(B)
sum2pq=2*sum(freq*(1-freq))
A*Ba*B
B**2
abs(B)
sum2pq=2*sum(freq*(1-freq))
A*Bin Julia this doesn’t work (references here and here ). We need to do:
a .* B
B .^ 2
abs.(B)
sum2pq=2*sum(freq .* (1 .- freq))
A .* Bthe syntax imposes (optional in Fortran but opposite to R) to put “:” to indicate that this is an array slice
for i in 1:nsnp
# sum paternal and maternal, skip id
pos=1+(i-1)*2+1
Z[:,i]=geno[:,pos];
Z[:,i]=Z[:,i]+geno[:,pos+1];
endthis can be pretty dangerous, for instance: this is correct:
LHS=zeros(5,5)
LHS[1,2:5] .= 1
#or
LHS .= 1but this is not correct
LHS=zeros(5,5)
LHS[1,2:5]=1 #error
LHS=1 #changes our matrix to a scalarworks naturally, for instance to get \(\mathbf{X}'\mathbf{V}^{-1}\mathbf{X}\):
(X'*inv(V)*X)There are dedicated functions in DelimitedFiles
using DelimitedFiles
# reading (there are many possibilities)
@time Y=readdlm("kk")I have found that in the cluster it is good to indicate dimensions (probably related to I/O latency problems) as follows:
f=open("kk","r")
nanim=countlines(f) #ojo que se va a l final
close(f)
f=open("kk","r")
nsnp=size(split(readline(f)),1)
close(f)
@time Y=readdlm("kk",dims=(nanim,nsnp))To read a vector (a matrix of 1 column):
f=open("kk","r")
z=vec(readdlm(f))
close(f)more compactly:
z=vec(readdlm("kk"))again, there are many options. I like using spaces (not tabs) as separators and thus:
f=open("kk","w")
writedlm(f,z," ") # the 3rd value is the separator
close(f)more compactly
writedlm("kk",z," ")In the package Using Statistics there are a few elementary things such as correlations and variances, but other things have moved to different packages. I try to get correlation and basic linear regression between two variables x and y, and I have found package GLM here.
It seems that GLM will admit only incidence matrices, not vectors, so for a linear regression \(y=a+bx\) I need to translate \(x\) to matrix (2D array). I found the trick here:
using GLM
x=randn(10000)
y= x .+ randn(10000)
cor(x,y)
X=reshape(x,:,1)
lm(X,y)
# add general mean
X=hcat(ones(10000,1),x)
lm(X,y)Using @formula is more R-like.
I found this here. The documentation of Julia for statistical analysis is confusing and very quickly many of them do DataScience (plotting data) or ML (MachineLearning) but a good honest multiple regression is hard to find. It seems that using @formula needs DataFrames.
using DataFrames
using GLM
x = randn(100)
y = 0.9 .* x + 0.5 * rand(100)
df = DataFrame(x=x, y=y)
ols = lm(@formula(y ~ x), df) # R-style notationThe coefficients of ols are extracted using coef()
In the same notes there is an example of cross-classified effects.