GCG Workshop, October 2003

GCG User Manual


Emphasis on SeqLab Interface


Author: Thomas Girke (thomas.girke@ucr.edu)

Bioinformatics Core, UCR


Contents


    A. Introduction

    B. Starting SeqLab and Command-Line GCG

    C. Using the Working Directory

    D. Working with the Main SeqLab Window

    E. Editing and Annotating Sequences

    F. Importing and Exporting of Sequences

    G. Trace Files, Assembly and Mapping

    H. Sequence Features

    I. Print and Export Graphics

    J. Overview of GCG Programs

        1. Database Searches: BLAST, NetBLAST, PsiBLAST, HMMER, etc.

        2. Alignments: Pileup

        3. Pattern Finding: MEME, Motifs, FindPattern

        4. Protein Analysis

        5. Useful Tools: FrameSearch, FrameAlign, etc.

    K. Large-Scale Sequence Analysis: BLAST Example

    L. List Files

    M. Building a Search Set

    N. Creating Personal Sequence Databases with DataSet

    O. Creating BLASTable Sequence Databases

    P. Exercises



A. Introduction


The Wisconsin Package GCG is a software suite that contains over 130 sequence analysis tools. It was developed by the Genetics Computer Group in Madison and is now maintained and distributed by Accelrys. It can be accessed remotely from any networked computer. There are three different interfaces for accessing GCG:

All three applications are installed on the UNIX server cache.ucr.edu where they share the same sequence databases. The instructions for setting up an account can be found on our GCG page. This workshop will focus on SeqLab, since it is the most powerful and versatile GCG interface.


To run SeqLab from a PC, you need to configure X-win32 (for Mac OS X: X11, configuration) and PuTTY according to the configuration page. For transfering files between your local machine and the GCG server, I recommend using WinSCP (for Mac OS X: Fugu). More help on configuration issues can be found on our GCG page.



B. Starting SeqLab and Command-Line GCG


SeqLab:


Start GCG command-line (can be in addition to SeqLab):


Help: In SeqLab you find help documents by clicking on the Help menu in the window of the different applications. On the command-line you can open these help files with the command "genhelp" or "genmanual". To retrieve help for specific programs, just type its name ofter these commands. Additional information can be found in the (Online GCG Manual, usr: genhelp, pwd: version102). General help on UNIX can be found on the same page under User's Guide.


C. Using the Working Directory

The Working Directory window is one of the most important components of SeqLab. In this window you specify the directory where SeqLab writes output files. Remember, in GCG you generally create in each session many output files. Not using this feature can create a big mess in your account.

To access this function, go to:



New directories can be created by typing in their name in the Selection text box and then clicking OK. A convenient tool to create and manage new directories is WinSCP.

SeqWeb users can copy their files on the command-line from /usr/local/seqweb/2.0.2/seqweb/html/user/your_account_name/work/ into their home directory.


D. Working with the Main SeqLab Window

Main List

The Main List window is SeqLab's project managing tool that allows you to organize data on a project-by-project basis. Here and in the Editor (s. below) you select the sequences you want to analyze with the different tools which are available in GCG. You can switch between the Editor and Main List in the Mode menu (3).




Editor




E. Editing and Annotating Sequences


Transformations

The common sequence editing and search functions can be found in the Edit menu:

            • select the sequence files or sequence areas you want

            • go into the Edit menu and select Reverse, Find, Translate, etc.

A nice feature of SeqLab is that it allows you to perform these operations on many sequences at once instead of doing it on a one-by-one basis as it is the case in most other sequence editors.

Annotations

To add annotations to a sequence or alignment you can do this within the sequences (see Sequence Features) or in a separate comment line. To add a comment line, you select in the Editor window File -> New Sequence -> Text. A new line appears which can be moved under the sequence of your choice by using the copy and paste buttons. Switch to Insert mode and add your comments. All changes can be saved in RSF format.




Note: To create and edit sequences from the command-line, you can use SeqEd which is an additional interactive sequence editor in GCG.


F. Importing and Exporting Sequences

Import

There are three major ways to import sequences into GCG:

  1. One-by-one import:

  1. Import of MSF alignments (FASTA formatted alignments can be imported via batch import):

  1. Batch import (imports single sequences as well):


Export

There are two possibilities to export sequences from GCG:


  1. Sequences and alignments that were modified in the Editor can be exported into MSF or GenBank format by selecting them in the Editor and choosing File -> Export -> <select format>.

  2. To export sequences to FastA and Staden format, you select the sequences in the Main List and choose Functions -> Importing/Exporting -> <select format>. When you select FastA as output format you have the choice (under Options) to export each sequence to a separate file or into one FastA batch file. The latter one is often preferred if you want to import your files later into other databases.



G. Trace Files, Assembly and Mapping

Import of trace files:




Assembly

Due to time restrictions, the workshop will provide only a brief summary of the different sequence assembly tools which are available in GCG.


The "Gel..." tools are linked together and need to be used in the specified sequence:





Mapping









H. Sequence Features


View Features

Annotation features such as introns, domains and structural information from public and personal databases can be graphically displayed in the Edit window by choosing in the Display menu the options Features Coloring or Graphics Features.

To display features from unaligned sequences in a Pileup alignment, do the following from the Editor:


Edit Features

Highlight sequence file or sequence area in the Edit window, then choose from the Windows menu the option Features and provide in the resulting window your annotation information. Graphical symbols can be chosen in the sub-windows Edit and Add.




For advanced users:

Features can be customized in the file feature.cols, which needs to be localized in the directory from where you start SeqLab (/home/user/). To move this file there, type on the command-line $ fetch feature.cols. Use your favorite editor to modify this file according to your preferences.



I. Print and Export Graphics


The easiest way to print graphics or integrate them into other graphical applications, is to save them in PostScript format and to transfer the resulting file to your local computer, where you can modify and print it in Ghostview, a free software which can be downloaded from this page: http://www.cs.wisc.edu/~ghost/index.htm. When you do this for the first time you have to enable the PostScript format in SeqLab under Options -> Graphics Devices -> Language: PostScript


a) To print sequences and alignments to a file:

b) To print graphics from other GCG applications such as PepPlot:





J. Overview of GCG Programs

Most GCG programs can be accessed through the option Functions in the Menu Bar of the main window, which provides access to currently 111 different sequence analysis tools. This workshop can provide only a brief introduction into a small selection of this massive collection of GCG programs.





For an efficient job and database management, please make yourself familiar with the following functions in the Windows menu: Job Manager, Output Manager and Database Browser.





  1. Database Searches

    - Lookup identifies sequence database entries by name, accession number, author, organism, keyword, title, reference, feature, definition, length, or date. The output is a list file of sequences, which can be used to load all specified sequences into the Main List or Editor.

    - BLAST searches local nucleic acid or protein databases. This important function will be introduced in the next paragraph (K).

    - NetBLAST searches NCBI's database online.

    - FastA does a Pearson and Lipman search for similarity between a query sequence and a group of sequences of the same type. For nucleotide searches, FastA may be more sensitive than BLAST.

    - SSearch does a rigorous Smith-Waterman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). This may be the most sensitive method available for similarity searches. Compared to BLAST and FastA, it can be very slow.

    - PSI-BLAST: Position specific iterative BLAST (PSI-BLAST) refers to a feature of BLAST in which a profile (or position specific scoring matrix, PSSM) is constructed (automatically) from a multiple alignment of the highest scoring hits in an initial BLAST search. The PSSM is generated by calculating position-specific scores for each position in the alignment. Highly conserved positions receive high scores and weakly conserved positions receive scores near zero. The profile is used to perform a second (etc.) BLAST search and the results of each "iteration" used to refine the profile. This iterative searching strategy results in increased sensitivity.

    - HMMER can be used to perform sensitive database searching using statistical descriptions of a sequence family's consensus. Related software packages are PSI-BLAST and SAM. A very nice user guide on HMMER can be found on Sean Eddy's home page (http://hmmer.wustl.edu/).

    HmmerAlign aligns multiple sequences to a profile HMM. It can be used to create alignments of large numbers of sequences. HmmerBuild builds a profile HMM from a given multiple sequence alignment. HmmerCalibrate determines appropriate statistical significance parameters for a profile HMM prior to doing database searches. HmmerConvert converts HMMER profile HMMs to other formats. HmmerEmit generates sequences probabilistically from a profile HMM. HmmerPfam searches a profile HMM database with a sequence. HmmerSearch searches a sequence database with a profile HMM.

  2. Alignments (Example)

    - Pileup creates a multiple alignment of unaligned sequences. The alignment is written to a MSF file which can be imported into many alignment editing tools, such as GeneDoc.

  3. Pattern Finding

    - MEME finds conserved motifs in a group of unaligned sequences.

    - Motifs looks for sequence motifs by searching through proteins for patterns defined by PROSITE.

    - FindPatterns looks for patterns defined by the user.

  4. Protein Analysis: Browse through the different protein analysis tools to identify which ones may be useful for your work.

    Example:

    - PeptideStructure makes secondary structure predictions including alpha, beta, coil, turn, antigenicity, flexibility, hydrophobicity and surface probability. A very useful exercise on predicting structure and antigenicity of peptides can be found on this page: http://mcf.ahc.umn.edu/Tutorials.htmls/minitutor6.html

  5. Useful Tools

    - FrameSearch searches a group of protein sequences for similarity to one or more nucleotide query sequences, or searches a group of nucleotide sequences for similarity to one or more protein query sequences. For each sequence comparison, the program finds an optimal alignment between the protein sequence and all possible codons on each strand of the nucleotide sequence. Optimal alignments may include reading frame shifts.

    - FrameAlign creates an optimal alignment of the best segment of similarity (local alignment) between a protein sequence and the codons in all possible reading frames on a single strand of a nucleotide sequence. Optimal alignments may include reading frame shifts.

    - BackTranslate backtranslates an amino acid sequence into a nucleotide sequence. The output helps you identify areas with fewer ambiguities that might be candidates for synthetic probes.


K. Large-scale Sequence Analysis: BLAST Example


Many sequence analyses in GCG can be performed in a batch pipeline. The sequence search tools FASTA and BLAST are just two of many of those "batch" applications, which query sequences databases that are installed locally on cache.ucr.edu. The application NetBLAST allows you to perform online searches against sequence databases on the NCBI server, but it is limited to one sequence submission at a time.

To run many BLAST and FASTA searches at once on cache.ucr.edu, you must first select the sequences of your choice in the Main List or Editor. For selecting sequences you have several options:

To start the BLAST search with the selected sequences, you choose Functions -> Database Sequence Searching -> BLAST. In the open BLAST window you need to select whether you want to search a nucleotide/protein database (defines use of BLASTN, BLASTX, TBLASTN, TBLASTX) and the Search Set (specifies database). Under Options you set the search parameters:


Note: When you perform batch operations in GCG, the software names the output after the sequence/query ID#s and their file extensions correspond to the name of the search tool. Example: gi343848.tblastx.


For parsing of BLAST result, you can try to use on the command line our Perl script "blastParse" or this simple Perl one-liner:



L. List Files

List files are a very efficient way to perform analyses of specific sets of sequences. Since they contain only pointers to the sequences, they can save you a lot of storage space (no duplication of large sequence data) and allow very quick selections of defined sequence groups to perform various analyses simultaneously. For instance, one can quickly create a list file for thousands of sequences in a spread sheet program and submit it to the sequence search tools of your choice. The format of a list file looks like this:


One way of creating a list file is to select the sequences of your choice in the Man List window and then save it as *.list under File -> Save List As.

An alternative and often more flexible way of creating list files is to use a spread sheet program or WordPad on your local machine (use file extension *.list). To import a list file into the Main List, there are two options:

Note: List files with more than 2000 sequences cannot be expanded (viewed) in the Main List.


M. Building a Search Set

In addition to specifying query sequences, certain SeqLab application allow you to specify database records that will be used for a search or analysis. Programs that accept user-defined search sets are FastA, FindPatterns, FrameSearch, Overlap, ProfileSearch, SSearch and StringSearch. In all these programs you specify the search set by clicking on the Search Set button of the individual application, which opens a search set builder window. Note: Each application uses its own search set.


N. Creating Personal Sequence Databases with DataSet


To add your personal sequences to the Database Browser, you need to use the application DataSet. For this you first switch to the appropriate working directory (see C.), then you select your sequences or their list file in the Main List window, and choose: Functions -> Utilities -> Databases Utilities -> DataSet. You will be prompted with a dialog window where you assign a name and then press Run. This will add the following three files to your current working directory: *.header, *.ref and *.seq. When finished you should see your personal database in the Database Browser.


Note: A DataSet is different from a BLASTable database, which is explained in the next paragraph.



O. Creating BLASTable Sequence Databases


Create a new directory where you want to store your BLASTable databases and make it your working directory (see C.). Then you select the sequences that you wish to create a BLASTable set from, and choose: Functions -> Utilities -> Databases Utilities -> GCGtoBLAST.




You will be presented with a dialog window that allows you to assign a name to the set. Enter a name and press Run. This operation creates five new files in your current working directory: *.phr, *.pin, *.psd, *.psi and *.psq. All sequence data are contained in this file structure. To save storage space, you can now delete the initial sequence files. Searching the database that you created requires that you first access the Wisconsin Package from the command line so that you can properly modify a configuration file, which is necessary to add a reference to your new BLASTable database to the BLAST database Search Set menu. To do this you would do the following from the UNIX command line after starting the Wisconsin Package there:

At the end of the file, add a line like: /path/db-base-name p my own blast database

Here are some notes for editing this line:




P. Exercises


  1. File/Directory Management
    1. Within WinSCP: Create the following directories within the master directory Exercises: Seq, Pep, Database and Analysis. Use these directories to organize the work of the following exercises.

    2. Within SeqLab: Create the same directory structure with the working director manager in SeqLab (see C.).


  2. Import/Export
    1. Import trace files: Download the trace files 09.ab1 & 13.ab1, import them into SeqLab, view trace plus text sequences, export the latter into FASTA or GenBank format and view them with WordPad on your local machine.

    2. Import single sequences: Run in your web browser query "P450 & hydroxylase & acid & human [orgn]" against the NCBI Protein Database. Save the first ten proteins in FASTA and GenBank formats and import them one-by-one into SeqLab. Create alignment with Pileup.

    3. Batch import: Import entire proteome of Halobacterium spec. from ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Halobacterium_sp/AE004437.faa.

    4. Import alignments: Create multiple alignment of sequences from 2.2. using MultAlin. Import alignment in MSF and FASTA formats.

    5. Export: Export in single and batch sequence modes. Export alignment in MSF format.

  3. Features

    1. In sequence: view imported sequence from 2.2. in Editor, display and add features.

    2. In alignment: run Pileup with Lookup list file from 4.1. and transfer alignment annotations into Editor and find heme binding cystein residue, export alignment and view it in GeneDoc (only on PC).

  4. Database searches: Lookup, FASTA, SSearch, BLAST, HMMER

    1. Lookup: run query "CYPIII (All text) & P450 (Def)" in Lookup against SwissProt database.

    2. SSearch, FASTA, BLAST and PSI-BLAST: query with one of these sequences the SwissProt database using SSearch, BLAST and FASTA.

    3. HMMER: Align sequences from 4.1. Retrieve and align remote homologs from SwissProt database with HMMER: HmmerBuild, HmmerCalibrate, HmmerSearch and HmmerAlign.

  5. Create BLASTable database

    1. Create BLASTable database for proteome from Halobacterium spec. (imported under 2.3.).

  6. Pattern Searches

    1. Motifs: Use Motifs to find PROSITE patterns in protein alignment from 2.2., find pattern with Edit/Find and highlight it in all sequences at once using the Feature function.

    2. FindPattern: find out how many sequences in the SwissProt database share this pattern using FindPattern.

    3. Consensus and FitConsensus: retrieve the corresponding nucleotide sequences, align them, calculate consensus sequence with Consensus and query with it a small nucleotide database using FitConsensus.

    4. MEME and MotifSearch: use MEME to find conserved motifs in your set of unaligned nucleotide sequences. Use the resulting MEME profiles to query a small nucleotide database with MotifSearch.

  7. Phylogenetic Trees:

    1. PAUP: use PaupSearch to generate a bootstrapped tree from alignment under 2.2. Edit tree with PaupDisplay, Treeview (local) and PowerPoint.

    2. Distance Matrix: calculate distance matrix for alignment using Distances and plot its tree with Display.

  8. Molecular tools: Primer design, backtranslate

    1. Primer design: Design primers that amplify the longest ORFs of the two sequences from 2.1.

    2. Restriction map: generate restriction map for one of the sequences from 1.1. using Map.



    Back to Top

    5