I find that one of the common tasks in bioinformatics is reading a file of sequence data, often from a FASTA file, and getting it into a usable format.
Below is a function that tries to open a given filename and read it like as FASTA file. I assumes that the names of sequences are indicated by a '>' and that the sequence starts on the following line. In my function any underscores ('_') are replaced by spaces, which isn't necessary, but useful for data I tend to use.
The function returns dictionary of the sequences with the names as a key and a list of the names in a list so you can read them in the original order if you so wish (EDIT: Using an OrderedDict would be better).
def FASTA(filename):
try:
f = file(filename)
except IOError:
print "The file, %s, does not exist" % filename
return
order = []
sequences = {}
for line in f:
if line.startswith('>'):
name = line[1:].rstrip('\n')
name = name.replace('_', ' ')
order.append(name)
sequences[name] = ''
else:
sequences[name] += line.rstrip('\n').rstrip('*')
print "%d sequences found" % len(order)
return order, sequences
Comments (3)
HNS on 23 Oct 2013, 7:19 p.m.
Hi Sir,
Can you explain this code in details?
Anonymous on 14 Aug 2015, 12:02 a.m.
Wow- thank you! I was most of the way through building something similar, but stuck. Your solution is clear and useful!
Sewunet Abera on 21 Feb 2017, 9:32 a.m.
Could you pls attach comments for the command lines (alternative commands, why those commands are needed and examples you think works best)