由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
Linux版 - python code performance --- normal or too slow? (转载)
相关主题
Regular Expression question: how to enumerate all matches? (转载)9.10 update 出问题,网络不能用了
ask a question about compile shared library using libtool (转载)把Linux 下 C++程序建成WEB SERVICE
WinXP 和Debian 5 双系统后Partition Magic 8打不开了怎样得到一个目录里所有上个月创建文件的大小总和?
发包子:征集志愿者翻译“Chrome扩展开发指南”装FREENAS,出现这个错误,高手给看看
请教一个Python问题, 怎么读出一个data structure中的全部members (内容)?scripting language lua
怎样把snoop的结果抽出来(假设已变成hexadecimal and ASCII format)?请教一个Openmpi编译的问题
我自己写了一个程序,专门对付GFW的,给我妈用 (转载)Windows Word file: 怎么弄不同的footnote/headnote (转载)
script questionCisco VPN for 64-bit ubuntu 11.04
相关话题的讨论汇总
话题: genotype话题: geno话题: chrs话题: alt话题: ref
进入Linux版参与讨论
1 (共1页)
i***r
发帖数: 1035
1
【 以下文字转载自 Programming 讨论区 】
发信人: iiiir (哎呀我最牛), 信区: Programming
标 题: python code performance --- normal or too slow?
发信站: BBS 未名空间站 (Tue Jan 7 11:21:52 2014, 美东)
file is 2.5GB with 18,217,166 lines
my python script took about 20-30 minutes to finish
seems slow?
Thanks!!
input file data structure (showing first two lines, wrapped):
chromo pos ref alt dc1 dc2 dc3 dtm bas din
crw itb ptw spw isw irw inw ru1 ru2
ru3 im1 im2 im3 im4 xj1 xj2 qh1 qh2
ti1 ti2 glw mxa rwa ysa ysb ysc cac jaa
jac
chr01 242806 G T 0/0 0/0 . 0/0 0/0 0/0
0/0 0/0 0/0 0/0 0/0 0/0 0/0 . 0/0
0/0 0/0 . 0/0 0/0 0/0 0/0 0/0 0/0 0/
0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0
0/0
my python code is to:
1. parse the header and produce first file
2. parse the body and translate 0s and 1s to ATGC etc to produce second file
.
import sys
def geno_to_base(ref, alt, genotype):
assert len(genotype) == 3, "genotype not in 0/1 format"
allele1 = alt if genotype[0] else ref
allele2 = alt if genotype[-1] else ref
return '{} {} '.format(allele1, allele2)
def translate_geno(ref, alt, genotype):
'''genotype needs to be either . or 0/0 format'''
return '0 0 ' if genotype == '.' else geno_to_base(ref, alt, genotype)
def line_parse(line):
chrs, pos, ref, alt, *geno = line.split()
all_genotype = [translate_geno(ref, alt, g) for g in geno]
return chrs, pos, ''.join(all_genotype)
if __name__ == "__main__":
fn = sys.argv[1] # required
fin = open(fn)
tfam = open('out.tfam','w')
tped = open('out.tped', 'w')
# write tfam
header = next(fin)
for i,h in enumerate(header.split()[4:]):
tfam.write('{}t{}t0t0t0t0n'.format(i,h))
# write tped
morgan = 0
for i,l in enumerate(fin):
rs_id = 'snp{}'.format(i+1)
chrs, pos, all_geno = line_parse(l)
chrs = int(chrs[3:]) # only need the number
tped.write( '{} {} {} {} {}n'.format(chrs, rs_id, morgan, pos, all_
geno) )
tfam.close()
tped.close()
1 (共1页)
进入Linux版参与讨论
相关主题
Cisco VPN for 64-bit ubuntu 11.04请教一个Python问题, 怎么读出一个data structure中的全部members (内容)?
./test input and ./test < input怎样把snoop的结果抽出来(假设已变成hexadecimal and ASCII format)?
问个网络问题我自己写了一个程序,专门对付GFW的,给我妈用 (转载)
How to write script to dl online streamscript question
Regular Expression question: how to enumerate all matches? (转载)9.10 update 出问题,网络不能用了
ask a question about compile shared library using libtool (转载)把Linux 下 C++程序建成WEB SERVICE
WinXP 和Debian 5 双系统后Partition Magic 8打不开了怎样得到一个目录里所有上个月创建文件的大小总和?
发包子:征集志愿者翻译“Chrome扩展开发指南”装FREENAS,出现这个错误,高手给看看
相关话题的讨论汇总
话题: genotype话题: geno话题: chrs话题: alt话题: ref