2002b) A machine learning strategy for protein analysis

sift through and analyze the massive and increasingly available data on proteins, researchers need new computing methods. The authors use machine-learning methods in a novel, three-step strategy for protein structure prediction.

Intelligent Systems in Biology II

A Machine LearningStrategy for Protein Analysis

Pierre Baldi and Gianluca Pollastri,

Institute for Genomics and Bioinformatics,University of California,Irvine


To sift through andanalyze the massiveand increasinglyavailable data onproteins,researchersneed new computingmethods. The authorsuse machine-learningmethods in a novel,three-step strategy forprotein structureprediction.

enome and other sequencing projects are producing a deluge of DNA and pro-tein sequence data. In current databases and sequencing projects,roughly 30 per-

cent of proteins do not resemble any other known sequence and have no assigned struc-ture or function. Another 20 percent are homologous to a known sequence whose structure

or function (or both) is largely unknown.

Proteomics is the protein counterpart to genomics,the large-scale analysis of complete genomes. Pro-teomes contain a cell’s total protein expression at agiven time. Proteome analysis not only deals withdetermining protein-encoding genes’sequence andfunction,but is also strongly concerned with the pre-cise biochemical state of each protein in its post-translational form—that is,the form it takes after ithas been translated from its original DNA encoding.Traditional experimental techniques for determin-ing a protein’s structure and function,such as x-raydiffraction or Nuclear Magnetic Resonancemethods,remain slow and laborious,and do not scale up to cur-rent sequencing speeds. Furthermore,using experi-ments to determine how proteins function is a daunt-ing task:Protein interactions are complex,and theirnative operating environments are very specific,which can be difficult to replicate in the laboratory.Researchers are developing many new high-throughput experimental techniques for proteomicsapplications,including mass spectrometry and pro-tein chips. Still,given proteins’fundamental impor-tance to biology,biotechnology,and medicine,wemust continue developing computer methods thatcan rapidly sift through massive amounts of data andhelp determine the structure and function of all theproteins in a given genome.

We’re applying machine-learning methods to pro-teomic problems,and have developed a novel strat-egy for completely predicting protein 3D coordinates.The strategy has three stages:structural features pre-diction,topology prediction,and coordinate predic-1094-7167/02/$17.00 ©2002 IEEE

tion. Here,we offer an overview of the domain andour machine-learning techniques,and describe thesoftware suite we’ve developed,which is available athttp://promoter.ics.uci.edu/BRNN-PRED/.

Proteins: An overview

Proteins are polymer chains composed of 20 sim-pler building blocks,or amino acids,that function asthe molecular machines of living organisms.Although researchers first characterize proteins bytheir primary sequences—that is,the correspondingamino acid sequence—proteins typically fold intocomplex,three-dimensional structures that are essen-tial to their function. Some proteins serve as struc-tural building blocks for cells,but most are molecu-lar “processors”that interact with

Each other (as in signaling networks)

Smaller molecules (as in metabolic networks) and Genetic DNA information (as in regulatory networks)to form life’s complex circuitry of biochemical reactions. Protein classes

As Figure 1 shows,proteins can be partitioned intotwo classes:

Membrane proteins,which are embedded in cellmembranes and therefore live in a lipid environment Globular proteins,which are secreted from thecell or segregated to nonmembrane compartments(such as the nucleus or the cytoplasm),and there-fore live in aqueous environments



Word文档免费下载Word文档免费下载:2002b) A machine learning strategy for protein analysis (共8页,当前第1页)

2002b) A machine learning strategy for protein analysis相关文档