I find that one of the common tasks in bioinformatics is reading a file of sequence data, often from a FASTA file, and getting it into a usable format.

Below is a function that tries to open a given filename and read it like as FASTA file. I assumes that the names of sequences are indicated by a '>' and that the sequence starts on the following line. In my function any underscores ('_') are replaced by spaces, which isn't necessary, but useful for data I tend to use.

The function returns dictionary of the sequences with the names as a key and a list of the names in a list so you can read them in the original order if you so wish.

def FASTA(filename):
    f = file(filename)
  except IOError:                     
    print "The file, %s, does not exist" % filename

  order = []
  sequences = {}
  for line in f:
    if line.startswith('>'):
      name = line[1:].rstrip('\n')
      name = name.replace('_', ' ')
      sequences[name] = ''
      sequences[name] += line.rstrip('\n').rstrip('*')
  print "%d sequences found" % len(order)
  return order, sequences


Hi Sir,

Can you explain this code in details?

