FASTA parser


5 May 2011

I find that one of the common tasks in bioinformatics is reading a file of sequence data, often from a FASTA file, and getting it into a usable format.

Below is a function that tries to open a given filename and read it like as FASTA file. I assumes that the names of sequences are indicated by a '>' and that the sequence starts on the following line. In my function any underscores ('_') are replaced by spaces, which isn't necessary, but useful for data I tend to use.

The function returns dictionary of the sequences with the names as a key and a list of the names in a list so you can read them in the original order if you so wish (EDIT: Using an OrderedDict would be better).

def FASTA(filename):
  try:
    f = file(filename)
  except IOError:                     
    print "The file, %s, does not exist" % filename
    return

  order = []
  sequences = {}
    
  for line in f:
    if line.startswith('>'):
      name = line[1:].rstrip('\n')
      name = name.replace('_', ' ')
      order.append(name)
      sequences[name] = ''
    else:
      sequences[name] += line.rstrip('\n').rstrip('*')
            
  print "%d sequences found" % len(order)
  return order, sequences

Comments (3)

HNS on 23 Oct 2013, 7:19 p.m.

Hi Sir,

Can you explain this code in details?

Anonymous on 14 Aug 2015, 12:02 a.m.

Wow- thank you! I was most of the way through building something similar, but stuck. Your solution is clear and useful!

Sewunet Abera on 21 Feb 2017, 9:32 a.m.

Could you pls attach comments for the command lines (alternative commands, why those commands are needed and examples you think works best)