FetchSeqs Practical

NOTE - THIS PAGE IS UNDER REVISION

PRACTICAL ASSIGNMENT – NCBI C Toolkit

 

Processing Arguments: Strings, Memory, Variables, Pointers, Casting and Linked Lists.

 

RESOURCES: NCBI C Toolkit Cross Reference

http://www.ncbi.nlm.nih.gov/IEB/ToolBox/C_DOC/lxr/ident/

 

NCBI C toolkit Documentation (Old…)

http://www.ncbi.nlm.nih.gov/IEB/ToolBox/SDKDOCS/INDEX.HTML


In this practical session you will create an application that retrieves sequences by GI or Accession number from BLAST style databases starting with the readseq code.

The application will utilize a number of C functions, variable structures and memory constructs like Linked Lists.

 

Make sure you have downloaded and compiled the NCBI C toolkit.  Ensure you can compile and run the test application from the previous lecture notes. Ensure that you have defined the $NCBI environment variable.

Obtain a fresh copy of the readseq code from the course website attachments.
readseq.tar.gz

Previous Instructions:

Extract readseq.tar.gz to a separate directory

make -f make.readseq

Copy formatdb from $NCBI/build  to your directory

Format the pdbaa.faa database of PDB amino acid sequences with the command:

formatdb -t PDB -i pdbaa.faa -o T

 

Run the newly compiled readseq application and see what happens.

 
Comment out the “printf” statement in the Main() function at the bottom of readseq.c and try compiling and running it again.

 




PART A. Prepare a new library and program.
(45 minutes)

Task 1 - Create a linkable library out of the readeseqs.c code:

Make a separate library copy of the C code:

  • Copy the readseqs.c code to a new source file seqfast.c
  • Remove the Main function from the seqfast.c
  • Change the #include <devtools.h> to use the new seqfast.h

Create a matching header file:

  • Copy the devtools.h header file to a new source file seqfast.h
  • Change _DEVTOOLS_DBITER_ to _SEQFAST_
  • Update the comments in the *.h file to reflect your changes.

Make your functions “Library” portable for either static or dynamic linking.

  • Look up the LIBCALL macro on the NCBI C Toolkit Cross Reference Site – see how it is used in function calls in various library *.c and *.h files.
  • Insert the LIBCALL macro into the *.c and *.h files for each function call
  • note that some functions are declared static for internal libarary use and should not be exprorted from the library.

Question – how does the LIBCALL macro make your library portable?


Make a makefile for your new library

  • Copy make.readseq to make.seqfast
  • Edit make.seqfast so that its target is just the file seqfast.o
  • Compile seqfast.c with the makefile
  • Using the command line ar command – create a static library called libseqfast.a that contains seqfast.o
  • Edit make.seqfast so that it creates the libseqfast.a library automatically

 

Task 2 - Create a simple program streamseqs that uses the libseqfast.a library to echo all the sequences out of the database.

  • Copy the original readseqs.c file into a new c file streamseqs.c
  • Change the include directive (remove devtools.h) to include seqfast.h
  • Remove all the functions except Main()
  • Update the comments accordingly
  • Copy the original make.readseq to a file named make.streamseqs
  • Edit the makefile so that it makes the executable streamseqs from streamseqs.c and properly links with libseqfast.a
  • Compile and test streamseqs

 Question - Which program is larger - readseqs or streamseqs?

 


PART B Create a simple program fetchseqs that processes command-line parameters with GetArgs

 

  • Copy the working streamseqs.c file into a new c file fetchseqs.c
  • Update the comments accordingly
  • Copy the working make.streamseqs to a file named make.fetchseqs
  • Edit the makefile so that it makes the executable fetchseqs from fetchseqs.c
  • Compile and test fetchseqs – it should be exactly like streamseqs at this point.
  • Comment out the "printf" statement so it stops printing for now.

 

Look up “Getting Program Arguments” in the CoreLib part of the NCBI C Toolit Documentation web site.  Search for “GetArgs” and “Args” in the C Toolkit Cross Reference. Look at a number of instances of programs in the /demo directory that use arguments for examples.


Modify fetchseqs.c so that it handles arguments as follows:

  • Add the following Args structure before the Main() function.
  • This defines the command line arguments we will use:

#define NUMARGS 7

static Args myargs[NUMARGS] = {
/*0*/ { "Single GI number", "0", NULL, NULL, TRUE, 'g', ARG_INT, 0.0, 0, NULL},
/*1*/ { "Input File List of GI or Accessions, one per line", "stdin", NULL, NULL, FALSE, 'i', ARG_FILE_IN, 0.0, 0, NULL},
/*2*/ { "Output File", "stdout", NULL, NULL, FALSE, 'o', ARG_FILE_OUT, 0.0, 0, NULL},
/*3*/ { "Single Accession Code", "NULL", NULL, NULL, TRUE, 'a', ARG_STRING, 0.0, 0, NULL},
/*4*/ { "Quiet Mode (T/F)", "F", NULL, NULL, TRUE, 'q', ARG_BOOLEAN, 0.0, 0, NULL},
/*5*/ { "Database To Use", "nr", NULL, NULL, TRUE, 'd', ARG_DATA_IN, 0.0, 0, NULL},
/*6*/ { "Report ONLY \n\t(0) FASTA Files\n\t\(1) FASTA Definition Lines \n\t(2) Accession Codes \n\t(3) GI numbers\n\t", "0", NULL, NULL, TRUE, 'r', ARG_INT, 0.0, 0, NULL} };

 
 
 
  • in Main() add in the call to GetArgs
 if ( !GetArgs("FetchSeqs", NUMARGS, myargs) ) return 1;

  • Compile and test.

 FetchSeqs   arguments:

  -g  Single GI number [Integer]  Optional
    default = 0
  -i  Input File List of GI or Accessions, one per line [File In]
    default = stdin
  -o  Output File [File Out]
    default = stdout
  -a  Single Accession Code [String]  Optional
    default = NULL
  -q  Quiet Mode (T/F) [T/F]  Optional
    default = F
  -d  Database To Use [Data In]  Optional
    default = nr
  -r  Report ONLY
        (0) FASTA Files
        (1) FASTA Definition Lines
        (2) Accession Codes
        (3) GI numbers
         [Integer]  Optional
    default = 0

  • Using if and printf statements, write C code that prints out the value of any argument that is passed in.

FYI These are the variables within the Args structure (this is declared elsewhere - do not put into your program!)
typedef struct mainargs {
      const char *prompt;        /* prompt for field */
      const char *defaultvalue;      /* default */
      char *from;            /* range or datalink type */
      char *to;
      Nlm_Boolean   optional;  /* is the arg optional? */
      Nlm_Char      tag;      /* argument on command line */
      Nlm_Int1      type;     /* type of value */
      Nlm_FloatHi   floatvalue; /* result goes here float */
      Nlm_Int4      intvalue;  /* result goes here int */
      CharPtr       strvalue; /* result goes here string */
  } Nlm_Arg, * Nlm_ArgPtr; 

 
A working version of this is attached below as fetchseqs1.c
 
  • Compile and test your version of fetchseqs. 
  • Proceed when you have successfully built the command-line interface, and it recognizes and reports what parameters you pass it.
  • Make sure to test that every command line parameter is correctly handled.

PART C.  Change fetchseqs to perform the required command-line operations and fetch and output information from the database.

 

Task 1 – Add an output function that writes to a stream a sequence given a single GI number.


This is the output function:
Boolean WriteSequenceToStream(FILE *fpStream, ReadDBFILEPtr rdbfp, Int4 Gi)
{

        CharPtr pcSequence = NULL;
       
    if(fpStream==NULL) {
        ErrPostEx(SEV_ERROR,0,0,"WriteSequenceToStream: Passed fpStream was NULL.");
        return(FALSE);
    }
    if(rdbfp==NULL) {
        ErrPostEx(SEV_ERROR,0,0,"WriteSequenceToStream: Passed ReadDBFILEPtr was NULL.");
        return(FALSE);
    }
      
        pcSequence = GetSeqByGI(rdbfp, Gi);
/* can change this to break FASTA up into readable length lines according to Unix or MS-DOS standard formats.. */
/* for now this just prints out one long string without any breaks in it */
    fprintf(fpStream, "%s\n", pcSequence);
        MemFree(pcSequence);
    return(TRUE);
}

Your Main() must be altered from the previous version to do the following:

  • You need a file handle to the output file or stdout
  • Compare the input string to the string "stdout"
  • Use FileOpen to open the file name that is passed as an argument
    if (!StringCmp(myargs[arg_output_filename].strvalue,"stdout")) {
     /* this test returns 0=FALSE if they match */

            f = stdout;
        }
        else  {
            f = FileOpen(myargs[arg_output_filename].strvalue, "w");
            if (f == NULL) {
                printf("Unable to open output stream %s\n",
                         myargs[arg_output_filename].strvalue);   

                return 2;
  /* you will hit this error if you don't have permissions to write in an intended directory */

           }
        }


  • Fix the database default to "pdbaa.aa" in the Args structure.
  • Write code to open the database passed in as argument 

 
    rdbfp=OpenProteinFastaDB(myargs[arg_database].strvalue); 

        /* This library function reports mislabeled file errors by itself */
    if (rdbfp == NULL) return 3;

  • If the GI number passed in with -g is > zero, then call the WriteSequenceToStream function with the GI passed in and the output stream file handle. 

/* At this stage the program can retrieve ONE sequence by GI number */

        if (myargs[arg_gi_number].intvalue > 0)  { /* single GI value as argument - ignore other arguments */
            WriteSequenceToStream(f, rdbfp, myargs[arg_gi_number].intvalue);
            return 0;
        }

  • Use FileClose to close the file handle f, which ensures its contents are flushed out.

 

Attached below is the working example of the code to this point is fetchseqs2.c

Task 2 – Add a function that opens the input file (if specified) and reads GI numbers one line at a time, building a ValNode linked list.

 

  •  A function called ProcessInput returns the finished linked list.
  •  It uses the C function fgets which reads into a static character array pcBuf
  • ValNodeAddInt allocates memory for a new node, copys the GI number into it and adds it to the tail of the list that starts at pvnHead

ValNodePtr ProcessInputFile(FILE *fpStream)
{
   ValNodePtr pvnHead = NULL;
   Char pcBuf[100];
   CharPtr pcTest = NULL;
   static long iGI = 0;

   do {
       pcBuf[0] = '\0';
       pcTest = fgets(pcBuf, (size_t) 100, fpStream);
       if (pcTest != NULL) {  /* we assume that no input file error checking is needed - but may change later to recognize Accession Numbers */
          sscanf(pcBuf, "%ld", &iGI);
          ValNodeAddInt(&pvnHead, 0, (Int4) iGI); /* This adds a link onto the linked list */
       }
   }  while (pcTest != NULL);
   return pvnHead;
}


  • Before you can use this function you need to open the Input file with a file handle fin, as above.
  • After you call it, you need to close the Input file.
fin = FileOpen(myargs[arg_input_filename].strvalue, "r"); /* open input file for reading */
if (fin == NULL) {
     fprintf(f,"Unable to open input stream %s\n", myargs[arg_input_filename].strvalue);   
     goto bad_exit; 
    /* you will hit this error if you specify the wrong file name or a missing file */

}


pvnList = ProcessInputFile(fin); /* Parses out GIs, one per line, and makes a linked list */
FileClose(fin);

  • You can now use the linked list for the output
  • Keep the pointer to the head of the linked list for freeing it later.
        pvnHere = pvnList;
        while (pvnHere != NULL) {
           WriteSequenceToStream(f, rdbfp, (Int4) pvnHere->data.intvalue);
           pvnHere = pvnHere->next;
        }

  • Now free the linked list at the end of the program

ValNodeFree(pvnList); /* free the linked list if any... */ 

Attached below is the working example of the code to this point is fetchseqs3.c


Task 3. Create Functions that write Deflines or PDB codes to the output stream, then call them depending on the -r (Report) parameter value.

 
Boolean WriteDeflineToStream(FILE *fpStream, ReadDBFILEPtr rdbfp, Int4 Gi)
{

        CharPtr pcDefline = NULL;
       
    if(fpStream==NULL) {
        ErrPostEx(SEV_ERROR,0,0,"WriteDeflineToStream: Passed fpStream was NULL.");
        return(FALSE);
    }
    if(rdbfp==NULL) {
        ErrPostEx(SEV_ERROR,0,0,"WriteDeflineToStream: Passed ReadDBFILEPtr was NULL.");
        return(FALSE);
    }
      
        pcDefline = GetDeflineByGI(rdbfp, Gi);
    fprintf(fpStream, "%s\n", pcDefline);
        MemFree(pcDefline);
    return(TRUE);
}


Boolean WritePDBCodeToStream(FILE *fpStream, ReadDBFILEPtr rdbfp, Int4 Gi)
{

        CharPtr pcPDBCode= NULL;
       
    if(fpStream==NULL) {
        ErrPostEx(SEV_ERROR,0,0,"WritePDBCodeToStream: Passed fpStream was NULL.");
        return(FALSE);
    }
    if(rdbfp==NULL) {
        ErrPostEx(SEV_ERROR,0,0,"WritePDBCodeToStream: Passed ReadDBFILEPtr was NULL.");
        return(FALSE);
    }
      
        pcPDBCode = GetPDBCodeByGI(rdbfp, Gi);
    fprintf(fpStream, "%s\n", pcPDBCode);
        MemFree(pcPDBCode);
    return(TRUE);
}
 


This handles the case of the -r flag for the single GI passed in with -g:


if (myargs[arg_gi_number].intvalue > 0)  {
/* single GI value as argument - which kind of output? */

               if ((int) myargs[arg_report].intvalue == report_fasta)
                 WriteSequenceToStream(f, rdbfp, myargs[arg_gi_number].intvalue);
               else if ((int) myargs[arg_report].intvalue == report_defline)
                 WriteDeflineToStream(f, rdbfp, myargs[arg_gi_number].intvalue);
               else if ((int) myargs[arg_report].intvalue == report_acc)
                 WritePDBCodeToStream(f, rdbfp, myargs[arg_gi_number].intvalue);              
            goto done;
}


This handles the -r flag for the ValNode list of GIs pass in as a file:

pvnHere = pvnList;
while (pvnHere != NULL) {  /* go through the linked list - pick the required output */
           if ((int) myargs[arg_report].intvalue == report_fasta)
                 WriteSequenceToStream(f, rdbfp, (Int4) pvnHere->data.intvalue);
           else if ((int) myargs[arg_report].intvalue == report_defline)
                 WriteDeflineToStream(f, rdbfp, (Int4) pvnHere->data.intvalue);
           else if ((int) myargs[arg_report].intvalue == report_acc)
                 WritePDBCodeToStream(f, rdbfp, (Int4) pvnHere->data.intvalue);
           pvnHere = pvnHere->next;
}

Attached below is the working example of the code to this point is fetchseqs4.c .  Note that some of the arguments are still "vaporware".



PART D. SELF PRACTICE

There is a limit to how much you can learn from worked examples.  
Next you need to write some code all by yourself.  Here are some tasks you can do on your own to gain the necessary practice. You will
 add Accession number searching, line breaking and true FASTA file output to fetchseqs.

For this you will need to download an NCBI FASTA formatted file with accession numbers (not PDB codes).  A sample can be found here:

ftp://ftp2.blueprint.org/pub/chogue/EcoliBL21(DE3).faa

Use formatdb and test it out with your current version of fetchseqs.c

Task 1. - Make a new function in the libseqfast.a library that retrieves a sequence by Accession number using the internal NCBI toolkit readdb_acc2fasta call.

·         Look up readdb_acc2fasta in the NCBI C Toolkit Cross Reference

·         Add a new library function GetSeqByAcc that retrieves a sequence by Accession number into the libseqfast.a library.

·         The function will call readdb_acc2fasta.  Make the function by copying and modifying the GetSeqByGI functions in seqfast.c

·         Modify the error reporting appropriately if a NULL string is passed in or if a search result does not find a match.

·         Add a new function prototype for GetSeqByAcc in seqfast.h

·         Compile the library using the makefile you prepared earlier

 

 

Task 2. - Alter the Args and command line processing code in fetchseqs so it will work with either PDB codes or Accession numbers using your new library routine.  Clean up any unused arguments like "q".

 

Task 3. – Rebuild the input file loader so that it can work with either lists of GI or Accession number, or mixed lists.

  • Sort out whether each loaded line is an integer number (GI) or an accession number (string).  See if there are any non-numeral characters in the string, if so, assume it is an Accession number or PDB code.
  • Is there a decimal version code in the accession number?
  • If all characters are numbers – it maybe a GI?  Convert the string to an integer and see if it returns a non-NULL result? 
  • Add the integer or string value to a ValNode linked list, and flag the choice field with a number that represents the kind of identifier held by the ValNode.
  • How do you free a ValNode list that has allocated strings attached to it?
  • When you process the linked list, use the appropriate function call at each value to handle GI or Accession numbers
Task 4. – Change the Sequence Output code WriteSequenceToStream so that it will break lines like a real FASTA file.
 
Task 5. - Compose a new function WriteFASTAToStream that accurately reconstructs FASTA files from the databae.
 

 

ċ
fetchseqs1.c
(3k)
Christopher Hogue,
Nov 13, 2009, 3:59 AM
ċ
fetchseqs2.c
(5k)
Christopher Hogue,
Nov 13, 2009, 4:06 PM
ċ
fetchseqs3.c
(7k)
Christopher Hogue,
Nov 13, 2009, 4:06 PM
ċ
fetchseqs4.c
(9k)
Christopher Hogue,
Nov 13, 2009, 4:06 PM
ċ
make.fetchseqs
(4k)
Christopher Hogue,
Nov 13, 2009, 7:32 PM
ċ
make.seqfast
(4k)
Christopher Hogue,
Nov 13, 2009, 6:47 PM
ċ
make.streamseqs
(4k)
Christopher Hogue,
Nov 13, 2009, 7:07 PM
ċ
testGIs
(0k)
Christopher Hogue,
Nov 13, 2009, 6:42 PM
Comments