SMART: recent updates, new developments and status in 2015

Letunic, Ivica; Doerks, Tobias; Bork, Peer

doi:10.1093/nar/gku949

Abstract

SMART (Simple Modular Architecture Research Tool) is a web resource (http://smart.embl.de/) providing simple identification and extensive annotation of protein domains and the exploration of protein domain architectures. In the current version, SMART contains manually curated models for more than 1200 protein domains, with ∼200 new models since our last update article. The underlying protein databases were synchronized with UniProt, Ensembl and STRING, bringing the total number of annotated domains and other protein features above 100 million. SMART's ‘Genomic’ mode, which annotates proteins from completely sequenced genomes was greatly expanded and now includes 2031 species, compared to 1133 in the previous release. SMART analysis results pages have been completely redesigned and include links to several new information sources. A new, vector-based display engine has been developed for protein schematics in SMART, which can also be exported as high-resolution bitmap images for easy inclusion into other documents. Taxonomic tree displays in SMART have been significantly improved, and can be easily navigated using the integrated search engine.

INTRODUCTION

Protein domain analysis remains an important research tool, made easy by various frequently used online domain resources and databases (e.g. 1–3). The SMART database (4) integrates manually curated hidden Markov models (5) for many domains with a powerful web-based interface offering various analysis and visualization tools. After more than 15 years since its inception it remains a popular and widely used by scientists worldwide. Here we give an overview of the major developments and new features introduced since our last update (6).

EXPANDED DOMAIN COVERAGE

SMART was never intended to be exhaustive, and was initially focused on mobile domains, which occur in various contexts while retaining similar function. Nevertheless, it continues to gradually expand its domain coverage with each new release. The current version introduces more than 200 new domains, bringing the total to 1204 distinct modules that can be detected. SMART's domain annotation includes a significant amount of manual work, in particular when selecting individual cut-off values and while creating the high quality underlying multiple sequence alignments. Other more exhaustive databases, like Pfam (3), already annotated many of these domains, but SMART's own manual annotation pipeline leads to partially different protein annotations, enabling increased hypothesis generation by biologists.

UPDATED PROTEIN DATABASES

The main protein database in SMART consists of the complete UniProt protein database (7) combined with predicted proteins from all stable Ensembl (8) genomes. The latest update greatly expanded the size of the database, which now contains more than 33 million proteins from around 350 000 species, subspecies and varietas. To minimize the impact of the inherently high redundancy of these databases, we use a per-species clustering method described in (9), which created 1.3 million multiprotein clusters with a total of 2.7 million proteins.

In addition to the regular protein database described above, SMART offers a ‘genomic’ analysis mode that contains only proteins from completely sequenced genomes. Synchronized with the upcoming STRING version 10 (10), this database has also been significantly extended, and currently contains ∼9.6 million proteins from 2031 complete genomes (238 Eukaryota, 1678 Bacteria and 115 Archaea), compared to 5 million proteins from 1133 species in the previous release.

REDESIGNED PROTEIN ANNOTATION PAGES

The current SMART version introduces completely re-designed protein annotation pages, with a new, vector-based protein schematic display engine (Figure 1). SMART protein schematics (‘bubblograms’) are now drawn using an Adobe Flash-based applet, greatly improving user experience. Schematics can be zoomed and exported into high resolution bitmap images. A function box within the interactive viewer provides access to several additional functions, for example, allowing users to toggle the display of intron positions or to navigate among various alternative representations of proteins containing overlapping domain predictions.

Figure 1.

Open in new tab Download slide

SMART annotation page for protein TEC_HUMAN. (a) Protein schematic representations are displayed using an interactive Flash applet. Schematics are zoomable without quality loss and exportable into high resolution bitmap images. Protein features selected in various data tables are dynamically highlighted directly in the viewer. Using the interactive scale, any protein region can be selected and submitted for further BLAST analysis. (b) The tabbed interface collects various sources of external information about the protein analyzed. (c) A movable and resizable popup dialog displays the most important bits of information for any selected feature, with links to complete annotation.

The protein viewer is connected to various parts of the annotation page. Selecting a predicted domain or other feature in any of the data tables (Figure 1) will automatically highlight its position in the protein. This is particularly useful for features that are not directly displayed in the protein schematic, either because they overlap other predicted domains or they get excluded due to E-value cutoffs.

Detailed information about any detected protein feature can be displayed in a simple floating popup dialog, streamlining the user experience and lowering the need to navigate across different web pages. In addition, these include several convenience functions, allowing users to copy the underlying amino acid sequence to their clipboard, or to submit the sequence for further Basic Local Alignment Search Tool (BLAST) analysis. These information dialogs contain condensed versions of respective annotation pages, with links to the full annotation. The new viewer also includes an interactive protein size scale, which allows users to directly select any protein region and submit it to a BLAST service of choice.

EXPANDED AND UPDATED EXTERNAL INFORMATION SOURCES

With the redesign of the protein annotation pages, the current version of SMART introduces two new external information sources: post-translational modification data and detailed orthology information (Figure 2).

Figure 2.

Open in new tab Download slide

New data sources included in the protein annotation pages and the redesigned taxonomy viewer. (a) A list of post-translational modifications present in the protein, as annotated by PTMcode (11). More than 60 types of modifications are included. (b) Orthologous groups that contain the protein, as annotated by eggNOG (12). Group descriptions and taxonomic classes are listed. (c) New taxonomic breakdown viewer, which supports very large trees and provides quick navigation through the integrated full text search engine.

Data on post-translational protein modifications is provided by version 2 of the PTMcode database (11), and is available for almost 400 000 proteins. SMART displays the total numbers of various post-translational modifications annotated in a particular protein, with links to the detailed annotation pages in PTMcode, where users can explore the modifications, their possible functional associations and the reasoning for calling them in detail.

Protein orthology data are parsed from the eggNOG database (12) and cover more than 7.7 million proteins from 3686 species. SMART's annotation pages show a detailed list of all orthologous groups that include the protein annotated, with their description and taxonomic class. Crosslinks to eggNOG are provided, with detailed overviews of each orthologous group as well as the associated alignments and phylogenetic trees.

EXPANDED BIOLOGICAL PATHWAY AND PROTEIN INTERACTION DATA

With the update of the underlying protein databases, we have also synchronized our protein interaction data with the latest version of the STRING database (10). Graphical representations of putative interaction partners are now available for more than 9.5 million proteins, which is a 3-fold increase from the previous release.

SMART's integration of biological pathways data was greatly expanded in the current version, and is now synchronized with version 2 of the interactive Pathways Explorer (iPath2) (13). Available for more than 2 million proteins, the pathway information now includes not only links to metabolic pathways, but also a selection of regulatory pathways and a large set of pathways involved in the biosynthesis of secondary metabolites.

Biological pathway data, protein interaction data and their associated graphical representations are now part of SMART's new tabbed annotation interface (Figure 1), displaying only subsets of the information as requested by users, making navigation simpler and more user-friendly.

UPDATED TAXONOMIC TREE DISPLAYS

SMART uses simple tree structures to display various taxonomic breakdowns in different parts of its user interface. For example, these are used to show the evolutionary information in domain annotation pages or to display the taxonomic breakdown of domain architecture queries. Since our current database contains proteins from more than 350 000 species, we developed a new tree display widget with an associated full text search engine (Figure 2c). It supports extremely large trees, which can still be navigated with ease. In addition, evolutionary breakdowns in the domain annotation pages now include both protein and domain counts for each taxonomic class, providing a much better overview for various domains that commonly occur in multiple copies per protein.

BACKEND OPTIMIZATIONS

The backend of SMART is a relational database management system, powered by the PostgreSQL engine, which stores the annotation of all SMART domains, protein annotation and sequences, taxonomy information and the pre-calculated protein analyses for the entire UniProt (7), Ensembl (8) and STRING (10) sequence databases. This includes the predictions of SMART and Pfam domains, as well as various protein intrinsic features, like signal peptides, transmembrane and coiled coil regions. Our last update expanded the number of annotated domains and other protein features to more than 100 million, which caused significant slowdowns in various domain architecture analysis queries and made it necessary to restructure significant portions of the database, and to rewrite many parts of the backend code.

FUNDING

Funding for open access charge: European Molecular Biology Laboartory.

Conflict of interest statement. None declared.

REFERENCES

1.

Sigrist

C.J.

,

de Castro

E.

,

Cerutti

L.

,

Cuche

B.A.

,

Hulo

N.

,

Bridge

A.

,

Bougueleret

L.

,

Xenarios

I.

.

New and continuing developments at PROSITE

,

Nucleic Acids Res.

,

2013

, vol.

41

(pg.

D344

-

D347

)

2.

Hunter

S.

,

Jones

P.

,

Mitchell

A.

,

Apweiler

R.

,

Attwood

T.K.

,

Bateman

A.

,

Bernard

T.

,

Binns

D.

,

Bork

P.

,

Burge

S.

, et al.

InterPro in 2011: new developments in the family and domain prediction database

,

Nucleic Acids Res.

,

2012

, vol.

40

(pg.

D306

-

D312

)

3.

Finn

R.D.

,

Bateman

A.

,

Clements

J.

,

Coggill

P.

,

Eberhardt

R.Y.

,

Eddy

S.R.

,

Heger

A.

,

Hetherington

K.

,

Holm

L.

,

Mistry

J.

, et al.

Pfam: the protein families database

,

Nucleic Acids Res.

,

2014

, vol.

42

(pg.

D222

-

D230

)

4.

Schultz

J.

,

Milpetz

F.

,

Bork

P.

,

Ponting

C.P.

.

SMART, a simple modular architecture research tool: identification of signaling domains

,

Proc. Natl. Acad. Sci. U.S.A.

,

1998

, vol.

95

(pg.

5857

-

5864

)

5.

Krogh

A.

,

Brown

M.

,

Mian

I.S.

,

Sjolander

K.

,

Haussler

D.

.

Hidden Markov models in computational biology. Applications to protein modeling

,

J. Mol. Biol.

,

1994

, vol.

235

(pg.

1501

-

1531

)

6.

Letunic

I.

,

Doerks

T.

,

Bork

P.

.

SMART 7: recent updates to the protein domain annotation resource

,

Nucleic Acids Res.

,

2012

, vol.

40

(pg.

D302

-

D305

)

7.

UniProt

Consortium.

.

Activities at the Universal Protein Resource (UniProt)

,

Nucleic Acids Res.

,

2014

, vol.

42

(pg.

D191

-

D198

)

8.

Flicek

P.

,

Amode

M.R.

,

Barrell

D.

,

Beal

K.

,

Billis

K.

,

Brent

S.

,

Carvalho-Silva

D.

,

Clapham

P.

,

Coates

G.

,

Fitzgerald

S.

, et al.

Ensembl 2014

,

Nucleic Acids Res.

,

2014

, vol.

42

(pg.

D749

-

D755

)

9.

Letunic

I.

,

Doerks

T.

,

Bork

P.

.

SMART 6: recent updates and new developments

,

Nucleic Acids Res.

,

2009

, vol.

37

(pg.

D229

-

D232

)

10.

Franceschini

A.

,

Szklarczyk

D.

,

Frankild

S.

,

Kuhn

M.

,

Simonovic

M.

,

Roth

A.

,

Lin

J.

,

Minguez

P.

,

Bork

P.

,

von Mering

C.

, et al.

STRING v9.1: protein-protein interaction networks, with increased coverage and integration

,

Nucleic Acids Res.

,

2013

, vol.

41

(pg.

D808

-

D815

)

11.

Minguez

P.

,

Letunic

I.

,

Parca

L.

,

Bork

P.

.

PTMcode: a database of known and predicted functional associations between post-translational modifications in proteins

,

Nucleic Acids Res.

,

2013

, vol.

41

(pg.

D306

-

D311

)

12.

Powell

S.

,

Forslund

K.

,

Szklarczyk

D.

,

Trachana

K.

,

Roth

A.

,

Huerta-Cepas

J.

,

Gabaldon

T.

,

Rattei

T.

,

Creevey

C.

,

Kuhn

M.

, et al.

eggNOG v4.0: nested orthology inference across 3686 organisms

,

Nucleic Acids Res.

,

2014

, vol.

42

(pg.

D231

-

D239

)

13.

Yamada

T.

,

Letunic

I.

,

Okuda

S.

,

Kanehisa

M.

,

Bork

P.

.

iPath2.0: interactive pathway explorer

,

Nucleic Acids Res.

,

2011

, vol.

39

(pg.

W412

-

W415

)

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Download all slides

Month:	Total Views:
November 2016	8
December 2016	12
January 2017	30
February 2017	47
March 2017	57
April 2017	62
May 2017	43
June 2017	33
July 2017	37
August 2017	34
September 2017	35
October 2017	30
November 2017	33
December 2017	70
January 2018	64
February 2018	65
March 2018	81
April 2018	92
May 2018	63
June 2018	53
July 2018	61
August 2018	65
September 2018	69
October 2018	75
November 2018	82
December 2018	73
January 2019	61
February 2019	71
March 2019	97
April 2019	109
May 2019	88
June 2019	51
July 2019	85
August 2019	116
September 2019	79
October 2019	78
November 2019	91
December 2019	70
January 2020	71
February 2020	101
March 2020	61
April 2020	63
May 2020	64
June 2020	69
July 2020	140
August 2020	91
September 2020	84
October 2020	102
November 2020	66
December 2020	80
January 2021	54
February 2021	46
March 2021	80
April 2021	73
May 2021	48
June 2021	86
July 2021	72
August 2021	61
September 2021	46
October 2021	59
November 2021	81
December 2021	55
January 2022	52
February 2022	58
March 2022	98
April 2022	68
May 2022	73
June 2022	60
July 2022	48
August 2022	51
September 2022	54
October 2022	39
November 2022	51
December 2022	65
January 2023	79
February 2023	68
March 2023	68
April 2023	72
May 2023	189
June 2023	65
July 2023	36
August 2023	60
September 2023	92
October 2023	62
November 2023	79
December 2023	94
January 2024	109
February 2024	157
March 2024	130
April 2024	46

Article Contents

SMART: recent updates, new developments and status in 2015

Abstract

INTRODUCTION

EXPANDED DOMAIN COVERAGE

UPDATED PROTEIN DATABASES

REDESIGNED PROTEIN ANNOTATION PAGES

EXPANDED AND UPDATED EXTERNAL INFORMATION SOURCES

EXPANDED BIOLOGICAL PATHWAY AND PROTEIN INTERACTION DATA

UPDATED TAXONOMIC TREE DISPLAYS

BACKEND OPTIMIZATIONS

FUNDING

REFERENCES

Comments

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Article Contents

SMART: recent updates, new developments and status in 2015

Abstract

INTRODUCTION

EXPANDED DOMAIN COVERAGE

UPDATED PROTEIN DATABASES

REDESIGNED PROTEIN ANNOTATION PAGES

EXPANDED AND UPDATED EXTERNAL INFORMATION SOURCES

EXPANDED BIOLOGICAL PATHWAY AND PROTEIN INTERACTION DATA

UPDATED TAXONOMIC TREE DISPLAYS

BACKEND OPTIMIZATIONS

FUNDING

REFERENCES

Comments

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only