RCSB PDB Help

Sequence-based Clustering

Introduction

The amino acid sequence of a protein directs its folding into shapes that enable specific functions. For most of the proteins in cells, protein folding is a rapid and in most cases repeatable process (Anfinsen, 1973) suggesting that protein sequences have the necessary information to fold into functional proteins and that each protein sequence forms a characteristic structure. While local regions of the protein may adopt slightly different conformations in different biological contexts, the overall structure remains the same. In a few exceptions a protein may adopt a completely different shape in the presence of a specific environment or binding partner(s). Research directed towards predicting protein structure from sequence has been ongoing for more than 50 years. Recently, our ability to compute the 3D shapes of proteins using their amino acid sequence has made tremendous progress by applying machine learning techniques to the archived experimental structural data in the PDB (Baek et al., 2021, Jumper et al., 2021).

When exploring PDB structures, the level of similarity between the amino acid sequences of two or more proteins can be used to infer their structural and functional similarity (Sander and Schneider, 1991). Protein sequences that are 100% identical to each other belong to the same protein, but high levels of sequence identity (e.g., >90%) is also indicative of the same protein, perhaps with a few mutations or variations due to different sources of the protein. Lower levels of sequence similarity between protein sequences may indicate some relationship between their structures and functions. The threshold of sequence similarity that indicates structural homology depends on the length of the alignment. As a rule of thumb for protein sequences that are longer than 100 amino acids, >25% sequence identity indicates similar structure and function (Sander and Schneider, 1991).

Sequences and Sequence Clusters

As the single worldwide repository for macromolecular structures, the Protein Data Bank holds many structures with the same or similar sequence and structures. This redundancy enables deep understanding of the biology of these proteins. However, some bioinformatics analyses may benefit from grouping these redundant sequences and structures. For example, all protein structures of the same protein have the exact same sequence. These may be grouped together. Protein sequences where 90% of the sequence is identical is said to have a 90% sequence identity, while proteins whose sequences are only 30% identical have a 30% sequence identity. Grouping proteins into clusters by sequence identity is a way to reduce/remove redundancy in 3D structures (including experimental structures and Computed Structure Models or CSMs). The sequences in a particular cluster are expected to share structural and functional properties depending on the level of sequence identity.

What are Sequence Clusters?

The amino acid sequences of all proteins, whose 3D structures are available from RCSB.org (including experimental structures and CSMs) are grouped at different levels of sequence identity (e.g., 100%, 95%, 90%, 70%, 50% and 30%) to yield sequence clusters. These pre-computed sequence groups are available for exploring the PDB archive and grouping search results.

Why use Sequence Clusters?

Instead of using all sequences of the 3D structures available from RCSB.org for analysis, representative sequences from each of the sequence clusters can be used. Depending on the level of sequence similarity, properties and features of the representative proteins can be extended to other members in the cluster. Using sequence clusters has the following advantage:

  • It reduces the size of the sequence data set of all 3D structures available and can help simplify, optimize, and make their analysis more efficient.
  • Monitoring growth in the non-redundant sequence clusters enables monitoring the variety of structures being deposited to the PDB
  • It can be used to organize sequences from both experimental structures and CSMs to explore evolutionary relationships between specific proteins.

Documentation

How are protein sequences in the PDB clustered?

Sequence clusters are calculated using the MMseqs2 software (Steinegger and S?ding, 2017). Currently, only protein sequences are subject to clustering. The rationale for clustering considers the following points:

  • All protein chains of at least 10 amino acids are included in the clusters.
  • Sequence identity is defined as the percentage of identical residues between the two amino acid sequences in the alignment.
  • The sequence clustering process begins with an all by all comparison of protein sequences in the PDB.
  • Only alignments with sequence identity scores above the threshold (100%, 95%, 90%, 70%, 50% and 30%) and covering at least 80% (-c 0.80) of both sequences are retained.
  • The clustering is run with the following parameters of the MMseqs2 software:
    • The clustering uses --cluster-mode 1, which corresponds to the connected component algorithm.
    • 50% sequence identity and above: computed with easy-linclust
    • Below 50%: computed with easy-cluster and sensitivity is set to (-s 8) for MMseqs2's highest alignment sensitivity for clustering
  • For more details on the procedure, please refer to the mmseqs2 user guide.

Note: The sequence clusters are subject to change over time as new protein sequences continue to be added to the archive.

How to use sequence clusters to explore the 3D structures in RCSB.org?

Each week, RCSB PDB computes sequence clusters for all protein sequences available from RCSB.org [including experimental (PDB) structures, and available CSMs]. You can use these pre-computed clusters the following ways:

References

  • Anfinsen, C. (1973), Science, 181, 223-230; doi: 10.1126/science.181.4096.223
  • Baek, M., DiMaio, F., Anishchenko, I., Dauparas, J., Ovchinnikov, S., Lee, G. R., Wang, J., Cong, Q., Kinch, L. N., Schaeffer, R. D., Mill¨¢n, C., Park, H., Adams, C., Glassman, C. R., DeGiovanni, A., Pereira, J. H., Rodrigues, A. V., van Dijk, A. A., Ebrecht, A. C., Opperman, D. J., Sagmeister, T., Buhlheller, C., Pavkov-Keller, T., Rathinaswamy, M. K., Dalwadi, U., Yip, C. K., Burke, J. E., Garcia K. C., Grishin, N. V., Adams, P. D., Read, R. J., Baker, D. (2021). Accurate prediction of protein structures and interactions using a three-track neural network. Science (New York, N.Y.), 373, 871¨C876; doi: 10.1126/science.abj8754
  • Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., ?¨ªdek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S., Ballard, A. J., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., Back, T., Petersen, S., Reiman, D., Clancy, E.,, Zielinski, M., Steinegger, M., Pacholska, M., Berghammer, T., Bodenstein, S., Silver, D., Vinyals, O., Senior, A. W., Kavukcuoglu, K., Kohli, P., Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596, 583¨C589; doi: 10.1038/s41586-021-03819-2
  • Sander, C., Schneider, R. (1991). Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins, 9, 56¨C68; doi: 10.1002/prot.340090107
  • Steinegger, M., S?ding, J. (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol 35, 1026¨C1028. https://doi.org/10.1038/nbt.3988


Please report any encountered broken links to info@rcsb.org
Last updated: 5/1/2023
seductrice.net
universo-virtual.com
buytrendz.net
thisforall.net
benchpressgains.com
qthzb.com
mindhunter9.com
dwjqp1.com
secure-signup.net
ahaayy.com
tressesindia.com
puresybian.com
krpano-chs.com
cre8workshop.com
hdkino.org
peixun021.com
qz786.com
utahperformingartscenter.org
worldqrmconference.com
shangyuwh.com
eejssdfsdfdfjsd.com
playminecraftfreeonline.com
trekvietnamtour.com
your-business-articles.com
essaywritingservice10.com
hindusamaaj.com
joggingvideo.com
wandercoups.com
wormblaster.net
tongchengchuyange0004.com
internetknowing.com
breachurch.com
peachesnginburlesque.com
dataarchitectoo.com
clientfunnelformula.com
30pps.com
cherylroll.com
ks2252.com
prowp.net
webmanicura.com
sofietsshotel.com
facetorch.com
nylawyerreview.com
apapromotions.com
shareparelli.com
goeaglepointe.com
thegreenmanpubphuket.com
karotorossian.com
publicsensor.com
taiwandefence.com
epcsur.com
mfhoudan.com
southstills.com
tvtv98.com
thewellington-hotel.com
bccaipiao.com
colectoresindustrialesgs.com
shenanddcg.com
capriartfilmfestival.com
replicabreitlingsale.com
thaiamarinnewtoncorner.com
gkmcww.com
mbnkbj.com
andrewbrennandesign.com
cod54.com
luobinzhang.com
faithfirst.net
zjyc28.com
tongchengjinyeyouyue0004.com
nhuan6.com
kftz5k.com
oldgardensflowers.com
lightupthefloor.com
bahamamamas-stjohns.com
ly2818.com
905onthebay.com
fonemenu.com
notanothermovie.com
ukrainehighclassescort.com
meincmagazine.com
av-5858.com
yallerdawg.com
donkeythemovie.com
corporatehospitalitygroup.com
boboyy88.com
miteinander-lernen.com
dannayconsulting.com
officialtomsshoesoutletstore.com
forsale-amoxil-amoxicillin.net
generictadalafil-canada.net
guitarlessonseastlondon.com
lesliesrestaurants.com
mattyno9.com
nri-homeloans.com
rtgvisas-qatar.com
salbutamolventolinonline.net
sportsinjuries.info
wedsna.com
rgkntk.com
bkkmarketplace.com
zxqcwx.com
breakupprogram.com
boxcardc.com
unblockyoutubeindonesia.com
fabulousbookmark.com
beat-the.com
guatemala-sailfishing-vacations-charters.com
magie-marketing.com
kingstonliteracy.com
guitaraffinity.com
eurelookinggoodapparel.com
howtolosecheekfat.net
marioncma.org
oliviadavismusic.com
shantelcampbellrealestate.com
shopleborn13.com
topindiafree.com
v-visitors.net
djjky.com
053hh.com
originbluei.com
baucishotel.com
33kkn.com
intrinsiqresearch.com
mariaescort-kiev.com
mymaguk.com
sponsored4u.com
crimsonclass.com
bataillenavale.com
searchtile.com
ze-stribrnych-struh.com
zenithalhype.com
modalpkv.com
bouisset-lafforgue.com
useupload.com
37r.net
autoankauf-muenster.com
bantinbongda.net
bilgius.com
brabustermagazine.com
indigrow.org
miicrosofts.net
mysmiletravel.com
selinasims.com
spellcubesapp.com
usa-faction.com
hypoallergenicdogsnames.com
dailyupdatez.com
foodphotographyreviews.com
cricutcom-setup.com
chprowebdesign.com
katyrealty-kanepa.com
tasramar.com
bilgipinari.org
four-am.com
indiarepublicday.com
inquick-enbooks.com
iracmpi.com
kakaschoenen.com
lsm99flash.com
nana1255.com
ngen-niagara.com
technwzs.com
virtualonlinecasino1345.com
wallpapertop.net
casino-natali.com
iprofit-internet.com
denochemexicana.com
eventhalfkg.com
medcon-taiwan.com
life-himawari.com
myriamshomes.com
nightmarevue.com
healthandfitnesslives.com
androidnews-jp.com
allstarsru.com
bestofthebuckeyestate.com
bestofthefirststate.com
bestwireless7.com
britsmile.com
declarationintermittent.com
findhereall.com
jingyou888.com
lsm99deal.com
lsm99galaxy.com
moozatech.com
nuagh.com
patliyo.com
philomenamagikz.net
rckouba.net
saturnunipessoallda.com
tallahasseefrolics.com
thematurehardcore.net
totalenvironment-inthatquietearth.com
velislavakaymakanova.com
vermontenergetic.com
kakakpintar.com
jerusalemdispatch.com
begorgeouslady.com
1800birks4u.com
2wheelstogo.com
6strip4you.com
bigdata-world.net
emailandco.net
gacapal.com
jharpost.com
krishnaastro.com
lsm99credit.com
mascalzonicampani.com
sitemapxml.org
thecityslums.net
topagh.com
flairnetwebdesign.com
rajasthancarservices.com
bangkaeair.com
beneventocoupon.com
noternet.org
oqtive.com
smilebrightrx.com
decollage-etiquette.com
1millionbestdownloads.com
7658.info
bidbass.com
devlopworldtech.com
digitalmarketingrajkot.com
fluginfo.net
naqlafshk.com
passion-decouverte.com
playsirius.com
spacceleratorintl.com
stikyballs.com
top10way.com
yokidsyogurt.com
zszyhl.com
16firthcrescent.com
abogadolaboralistamd.com
apk2wap.com
aromacremeria.com
banparacard.com
bosmanraws.com
businessproviderblog.com
caltonosa.com
calvaryrevivalchurch.org
chastenedsoulwithabrokenheart.com
cheminotsgardcevennes.com
cooksspot.com
cqxzpt.com
deesywig.com
deltacartoonmaps.com
despixelsetdeshommes.com
duocoracaobrasileiro.com
fareshopbd.com
goodpainspills.com
hemendekor.com
kobisitecdn.com
makaigoods.com
mgs1454.com
piccadillyresidences.com
radiolaondafresca.com
rubendorf.com
searchengineimprov.com
sellmyhrvahome.com
shugahouseessentials.com
sonihullquad.com
subtractkilos.com
valeriekelmansky.com
vipasdigitalmarketing.com
voolivrerj.com
worldhealthstory.com
zeelonggroup.com
1015southrockhill.com
10x10b.com
111-online-casinos.com
191cb.com
3665arpentunitd.com
aitesonics.com
bag-shokunin.com
brightotech.com
communication-digitale-services.com
covoakland.org
dariaprimapack.com
freefortniteaccountss.com
gatebizglobal.com
global1entertainmentnews.com
greatytene.com
hiroshiwakita.com
iktodaypk.com
jahatsakong.com
meadowbrookgolfgroup.com
newsbharati.net
platinumstudiosdesign.com
slotxogamesplay.com
strikestaruk.com
techguroh.com
trucosdefortnite.com
ufabetrune.com
weddedtowhitmore.com
12940brycecanyonunitb.com
1311dietrichoaks.com
2monarchtraceunit303.com
601legendhill.com
850elaine.com
adieusolasomade.com
andora-ke.com
bestslotxogames.com
cannagomcallen.com
endlesslyhot.com
iestpjva.com
ouqprint.com
pwmaplefest.com
qtylmr.com
rb88betting.com
buscadogues.com
1007macfm.com
born-wild.com
growthinvests.com
promocode-casino.com
proyectogalgoargentina.com
wbthompson-art.com
whitemountainwheels.com
7thavehvl.com
developmethis.com
funkydogbowties.com
travelodgegrandjunction.com
gao-town.com
globalmarketsuite.com
blogshippo.com
hdbka.com
proboards67.com
outletonline-michaelkors.com
kalkis-research.com
thuthuatit.net
buckcash.com
hollistercanada.com
docterror.com
asadart.com
vmayke.org
erwincomputers.com
dirimart.org
okkii.com
loteriasdecehegin.com
mountanalog.com
healingtaobritain.com
ttxmonitor.com
nwordpress.com
11bolabonanza.com