exploration of novel motifs derived from mouse cdna sequences ·...

13
2002 10.1101/gr.193702. Article published online before print in February Access the most recent version at doi: 2002 12: 367-378; originally published online Feb 15, 2002; Genome Res. Hideo Matsuda Hideya Kawaji, Christian Schönbach, Yo Matsuo, Jun Kawai, Yasushi Okazaki, Yoshihide Hayashizaki and Exploration of Novel Motifs Derived from Mouse cDNA Sequences data Supplementary http://www.genome.org/cgi/content/full/12/3/367/DC1 "Supplemental Research Data" References http://www.genome.org/cgi/content/full/12/3/367#otherarticles Article cited in: service Email alerting click here top right corner of the article or Receive free email alerts when new articles cite this article - sign up in the box at the Notes http://www.genome.org/subscriptions/ go to: Genome Research To subscribe to © 2002 Cold Spring Harbor Laboratory Press Cold Spring Harbor Laboratory Press on April 16, 2008 - Published by www.genome.org Downloaded from

Upload: others

Post on 14-Jan-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

200210.1101/gr.193702. Article published online before print in FebruaryAccess the most recent version at doi:

2002 12: 367-378; originally published online Feb 15, 2002; Genome Res.  Hideo Matsuda Hideya Kawaji, Christian Schönbach, Yo Matsuo, Jun Kawai, Yasushi Okazaki, Yoshihide Hayashizaki and 

Exploration of Novel Motifs Derived from Mouse cDNA Sequences  

dataSupplementary

http://www.genome.org/cgi/content/full/12/3/367/DC1 "Supplemental Research Data"

References

http://www.genome.org/cgi/content/full/12/3/367#otherarticlesArticle cited in:  

serviceEmail alerting

click heretop right corner of the article or Receive free email alerts when new articles cite this article - sign up in the box at the

Notes  

http://www.genome.org/subscriptions/ go to: Genome ResearchTo subscribe to

© 2002 Cold Spring Harbor Laboratory Press

Cold Spring Harbor Laboratory Press on April 16, 2008 - Published by www.genome.orgDownloaded from

Exploration of Novel Motifs Derived from MousecDNA SequencesHideya Kawaji,1,2 Christian Schönbach,3,6 Yo Matsuo,4 Jun Kawai,5 YasushiOkazaki,5 Yoshihide Hayashizaki,5 and Hideo Matsuda1,61Department of Informatics and Mathematical Science, Graduate School of Engineering Science, Osaka University, Toyonaka560-8531, Japan; 2Nippon Telegraph and Telephone Software Corporation, Yokohama 231-8554, Japan; 3ComputationalGenomics Team, Bioinformatics Group, RIKEN Genomic Sciences Center (GSC), Yokohama 230-0045, Japan;4Computational Proteomics Team, Bioinformatics Group, RIKEN Genomic Sciences Center (GSC), Yokohama 230-0045,Japan; 5Laboratory for Genome Exploration Research Group, RIKEN Genomic Sciences Center (GSC), Yokohama 230-0045,Japan

We performed a systematic maximum density subgraph (MDS) detection of conserved sequence regions todiscover new, biologically relevant motifs from a set of 21,050 conceptually translated mouse cDNA(FANTOM1) sequences. A total of 3202 candidate sequences, which shared similar regions over >20 amino acidresidues, were screened against known conserved regions listed in Pfam, ProDom, and InterPro. The filteringprocedure resulted in 139 FANTOM1 sequences belonging to 49 new motif candidates. Using annotations andmultiple sequence alignment information, we removed by visual inspection 42 candidates whose members werefound to be false positives because of sequence redundancy, alternative splicing, low complexity, transcribedretroviral repeat elements contained in the region of the predicted open reading frame, and reports in theliterature. The remaining seven motifs have been expanded by hidden Markov model (HMM) profile searches ofSWISS-PROT/TrEMBL from 28 FANTOM1 sequences to 164 members and analyzed in detail on sequence andstructure level to elucidate the possible functions of motifs and members. The novel and conserved motifMDS00105 is specific for the mammalian inhibitor of growth (ING) family. Three submotifs MDS00105.1–3 arespecific for ING1/ING1L, ING1-homolog, and ING3 subfamilies. The motif MDS00105 together with a PHDfinger domain constitutes a module for ING proteins. Structural motif MDS00113 represents a leucine zipper-likemotif. Conserved motif MDS00145 is a novel 1-acyl-SN-glycerol-3-phosphate acyltransferase (AGPAT) submotifcontaining a transmembrane domain that distinguishes AGPAT3 and AGPAT4 from all other acyltransferasedomain-containing proteins. Functional motif MDS00148 overlaps with the kazal-type serine protease inhibitordomain but has been detected only in an extracellular loop region of solute carrier 21 (SLC21) (organic aniontransporters) family members, which may regulate the specificity of anion uptake. Our motif discovery not onlyaided in the functional characterization of new mouse orthologs for potential drug targets but also allowed us topredict that at least 16 other new motifs are waiting to be discovered from the current SWISS-PROT/TrEMBLdatabase.

The growing number of complete genomes and estimates thatthe human proteome may contain up to 500,000 proteins(Banks et al. 2000) underline the importance of understand-ing protein sequences, motifs, and their functional units(modules) to derive potential functions and interactions. Mo-tifs are conserved sequence patterns within a larger set of pro-tein sequences that share common ancestry. Conserved mo-tifs may be used to predict the functions of novel proteins ifthe relationship among the encoding genes is orthologous(Tatusov et al. 1996). However, the increasing number ofparalogs and mosaic proteins evolved from gene duplicationsand genomic rearrangement mechanisms led to different in-terpretations of the term motif and the concept of modules asconserved building blocks of proteins that have a distinctfunction (Bork and Koonin 1996, Henikoff et al. 1997). Amodule can consist of one motif or multiple adjacent motifs.

For example, C2H2 zinc fingers, leucine zippers, and POUdomains are DNA-binding modules. However, structural mo-tifs or active site conservation in a short stretch of sequencesdo not often reflect common ancestry. Therefore, biologicalinterpretation of motif findings requires additional efforts, forexample, literature, structural, and phylogenetic analysis.

To annotate and characterize new protein sequences orextend known protein families, motif searches are conductedby using regular expressions (PROSITE search), similaritysearch with matrices (BLASTP, profile search), or nonlinearpattern recognition (HMMor artificial neural network) withtraining sets comprising already known motifs. Strictly de-fined, new protein motifs are either conserved sequences ofcommon ancestry or convergence (functional motifs) withinseveral proteins that group together for the first time by simi-larity search and show statistical significance (Bork andOuzounis 1995). This definition poses a problem for motifextension or submotifs. Therefore, we adopted a case-by-caseapproach using the functional importance as threshold for anovel motif. Here we have chosen a systematic linkage clus-tering based on sequence similarity, followed by visual inspec-

6Corresponding authors.E-MAIL [email protected]; FAX 81-6-6850-6602.E-MAIL [email protected]; FAX 81-45-503-9158.Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.193702. Article published online before print in February 2002.

Letter

12:367–378 ©2002 by Cold Spring Harbor Laboratory Press ISSN 1088-9051/01 $5.00; www.genome.org Genome Research 367www.genome.org

Cold Spring Harbor Laboratory Press on April 16, 2008 - Published by www.genome.orgDownloaded from

tion, sequence, topological, and literature analysis of the mo-tif candidates to explore new motifs derived from 21,076mouse cDNA clones (RIKEN Genome Exploration ResearchGroup Phase II Team and FANTOM Consortium 2001).

RESULTS AND DISCUSSION

Computational Motif DetectionWe used a linkage-clustering method with all-to-all sequencecomparison (Matsuda et al. 1999) to extract 2196 homologousgroups of sequences from a nonredundant set of RIKENmouse cDNA sequences (Fig. 1). In 465 groups that containedfour or more sequences, 1531 motif candidates were detectedwith a motif detection algorithm as described in the Methodssection. The candidates, comprising 3202 sequences with12,251 conserved regions, were screened against known con-served regions found in Pfam (Bateman et al. 2000), ProDom(Corpet et al. 2000), and InterPro (Apweiler et al. 2000) data-

bases. The filtering procedure resulted in 49 novel motif can-didates containing 139 sequences and 216 conserved regions.Subsequent inspection using annotations and multiple se-quence alignments yielded seven possible new motifs(MDS00105, MDS00113, MDS00132, and MDS00145–MDS00148) comprising 28 sequences and 42 false positivemotifs.

HMMs were constructed from the motifs and applied ina HMMERsearch of the nonredundant SWISS-PROT/TrEMBL(SPTR) database (Bairoch et al. 2000). All motifs, includinginformation on HMMscores, E-values, cutoff thresholds, motifalignments, and chromosomal localization can be accessed atthe MDS motif database (URL http://motif.ics.es.osaka-u.ac.jp/MDS/). At present, the MDS motif database contains164 sequences belonging to seven motifs. From 28 FANTOM1sequences, seven sequences were derived from EST assembliesand therefore do not carry DDBJ accessions.

Estimation of Motif CoverageThe InterPro coverage of 21,050 FANTOM1 translated se-quences is 27.9% (5873 sequences) (RIKEN Genome Explora-tion Research Group Phase II Team and FANTOM Consor-tium, 2001). The coverage of seven MDS motifs is 0.133% (28sequences). If we extrapolate from the 136 hits of sevenmotifsto 707,571 sequences of the nonredundant SPTR database,excluding the 10,465 FANTOM1 sequences, the estimatednumber of new MDS motif-containing sequences would be927 (0.133% of 697,106 sequences). Because the number ofsequences per MDS motif varies from four (the minimumnumber of motif-containing sequences that our method de-tects) to 57 (MDS00148), the estimated number of not yetdiscovered MDS motifs in the current release of SPTR wouldrange from 16 to 231. The low number of new motifs mayreflect a constraint on the number of possible functions andinteractions for a given protein in the proteome. In addition,some of the new motifs will be lineage-specific because ofspecies-specific expansion of regulatory genes (Mortlock et al.2000). If the effects of cDNA library normalization and insertsize limitation are neglected, the FANTOM clones represent arandom sample of the mouse transcriptome.

Hypothetical Protein Comprising MotifsThree of seven new motifs have been found in hypotheticalproteins. Because we lack experimental information on theseproteins, we briefly summarize the predicted functions.MDS00132 members are encoded by mouse 2210414H16Rikand 330001H21Rik and human DKFZP586A0522 (SPTR acces-sions Q9H8H3, Q9H7R3, Q9Y422, and AAH08180) loci. Thehuman proteins belong to the generic methyltransferasefamily (InterPro IPR001601) and contain adjacent to theN-terminal located MDS00132 a SAM (S-adenosyl-L-methionine) binding motif (IPR000051). Consideringthe 80% sequence identity and 90% similarity toDKFZP586A0522 >146 residues (data not shown), it is likelythat hypothet ica l prote ins 2210414H16RIK and3300001H21Rik belong to the methyltransferase family. Mo-tif MDS00146 comprises 21 members of hypothetical proteinsor fragment derived from human, mouse, rat, fruitfly, andworm. Three members, mouse 1200017A24Rik (SPTR acces-sion Q9DB92) and human BA165F24.1 (Q9H1Q3) andFLJ00026 (Q9H7P2), carry at their C-terminus an aminoacyl-transfer RNA synthetases class-II signature (IPR002106), indi-cating possible involvement in the protein synthesis. The ho-Figure 1 Strategy of motif exploration.

Kawaji et al.

368 Genome Researchwww.genome.org

Cold Spring Harbor Laboratory Press on April 16, 2008 - Published by www.genome.orgDownloaded from

mology with a rat TRG protein fragment (SPTR accessionQ63603, not to be confused with TRG T-cell receptor gamma)did not reveal any functional information. Motif MDS00147is located at the N-terminus of four mouse and two humanhypothetical proteins. No other motifs have been detected inthe sequences. The remaining motifs—MDS00105,MDS00145, MDS00148, and MDS00113—have been the sub-ject of detailed analyses to explore potential functions of themotifs and their member sequences.

Inhibitor of Growth MotifMotif MDS00105 is specific for the ING family, comprisingthree subfamilies: the ING1/ING1L subfamily (Zeremski et al.1999, Gunduz et al. 2000, Saito et al. 2000), the ING3 sub-family, and the ING1-homolog subfamily including distanthomologs inDrosophila melanogaster, Arabidopsis thaliana, andSchizosaccharomyces pombe (Fig. 2A,B,C). The putative transla-tions of five RIKEN genes (21810011M06Rik, D6Wsu147e,1700027H23Rik, 1810018M11Rik, and 1300013A07Rik), inaddition to another 35 sequences derived from the nonredun-dant SPTR, share the conserved 51 amino acid MDS00105region near the N-terminus. Previous research on ING1 hasfocused predominantly on the PHD finger region because it isalso conserved in the yeast homolog YNG2 (Loewith et al.2000). However, the homology breaks down toward the N-terminus. It was shown that the PHD finger is not required forhistone acetyltransferase (HAT) activity association of YNG2but for human P33ING1 (Loewith et al. 2000). Recently,P33ING1 has been associated with SIN3/HDAC1 complex-mediated transcriptional repression (Skowyra et al. 2001).P24ING1 is believed to be a p53-mediated growth suppressorthat acts through transcriptional activation of CDKN1A.

The PHD finger contains seven conserved Cys and oneHis residues with metal-chelating potential. These residues areconserved throughout all subfamily members of the ING fam-ily. In contrast, the MDS00105 sequence region shows mod-erate sequence similarity among all ING family members butis conserved between human, mouse, and fruitfly within eachsubfamily. We therefore reconstructed HMMprofiles ofMDS00105 for each of the three subfamilies and derived thethree submotifs MDS00105.1–MDS00105.3 that distinguishING1/ING1L, ING1-homolog, and ING3 subfamily membersfrom each other. The E-values of each submotif HMMsearchresult significantly increased toward random values when ithit a sequence of another submotif member. For example,within the ING1/ING1L submotif HMM the E-values rangedfrom 4.1E-33 to 1E-27 for the ING1/ING1L members, whereasthey reached 1E-2 and 3.3E-1 for the ING1-homolog andING3 members, respectively. The E-value based discrimina-tion was also supported by regular expression submotifs Q-E-L-G-D-E-K-[IM]-Q, K-E-[FY]-[SG]-D-D-K-V-Q and [LM]-E-D-A-D-E-K-V-[AQ] that are specific for ING1/ING1L, ING1-homolog, and ING3 subfamilies, respectively (Fig. 2A,B,C).The A. thaliana and yeast sequences that were aligned withthe ING1-homolog sequences (Suppl. Fig. 1B) do not bear theK-E-[FY]-[SG]-D-D-K-V-Q signature, but they do share similarresidues [DTN]-E-K-V-[LTQ].

The ING1-homolog (Suppl. Fig. 1B) and ING3 (Suppl.Fig. 1C) subfamily members, which are moderately similar toING1 and ING1L, are still uncharacterized. The ING1 andING1L proteins are highly similar except for the N-terminusand a short region within the MDS00105 motif (Suppl. Fig.1A). The variation in the N-terminal region is the result of

alternative transcripts (Zeremski et al. 1999). The ING3 sub-family contains up to 14 consecutive serine residues betweenthe MDS00105.3 submotif and the PHD finger. All ING mem-bers were predicted to be nuclear proteins that share a C-terminal PHD (plant homeo domain) finger domain (Aaslandet al. 1995, Pascual et al. 2000), indicating involvement inDNA binding and transcriptional regulation.

21810011M06Rik or Ing2 (Fig. 2A, Suppl. Fig. 1A) is themouse ortholog of human ING1L (Chr 4q35.1), which ap-pears to be the paralog of ING1 (Chr 13q34). D6WSU147E,1700027H23Rik, and 1810018M11Rik are members of theING1-homolog subfamily (Fig. 2B, Suppl. Fig. 1B).D6Wsu147e appears to be the ortholog of the human ING1-homolog and was mapped to mouse Chr 6 (59.3 cM). Thehuman ING1-homolog (LocusLink accession LOC51147) hasbeen mapped to the syntenic region of Chr 12pter-12q14.3.ING1-homologs, 1700027H23RIK and 1810018M11RIK, areidentical to each other except for the N-terminal region. Theyare 64% identical to D6WSU147E and human ING1-homolog(p33 variant) and may therefore represent paralogs of ING1-homolog. 1300013A07Rik seems to be the ortholog of humanING3 (Chr 7q31) and a variant of mouse ING3 (Fig. 2C, Suppl.Fig1C). We identified an unspliced intron between potentialdonor splice site 799 (ACAG|GTAA) and position 1277, whichmay lead to premature termination and render the translationproduct nonfunctional.

Considering the earlier described experimental findings,we hypothesize that MDS00105 and its submotifs representbinding sites for distinct subfamily specific protein–proteininteraction with HAT, HDAC, MYC (Helbing et al. 1997) andother cell cycle-related proteins, whereas the unique regionsof each subfamily member may modulate interactions. Forexample, ING1-homolog subfamily members share a tyrosinekinase phosphorylation site in the immediate vicinity of themotif, whereas ING3 and ING1/ING1L subfamily memberslack this site. The conservation of the submotifs in humanand Drosophila coincides also with the presence of TP53 andits Drosophila homolog p53 (Ollmann et al. 2000). Yeast andplants lack TP53 homologs and do not share the conservedsubmotifs. Therefore, the functions of yeast and plant INGhomologs in transcriptional and cell cycle regulation mayhave been differently conserved.

1-Acyl-SN-Glycerol-3-Phosphate AcyltransferaseSubfamily MotifMotif MDS00145 is specific for mammalian 1-acyl-SN-glycerol-3-phosphate acyltransferases (AGPAT) AGPAT3 andAGPAT4 (Fig. 3A). RIKEN clones 4930526L14 and2210417G15 represent Agpat3 (Chr 10 41.8 cM), which is theortholog of human AGPAT3 on Chr 21q22.3. The FANTOM1mapping of RIKEN clones 4930526L14 and 2210417G15 toChr 16 69.90–71.20 appears to be caused by an Agpat3 relatedsequence on Chr 16. Agpat4 (clone 1500003P24) has beenmapped to mouse Chr 17, 7.3–8.2 cM and a syntenic regionon human Chr 6 that contains AGPAT4 and is close toMAP3K3. AGPAT4 is a paralog of AGPAT1 (Chr 6p21) locatedin major histocompatibility complex class III region.

PSORT(Nakai et al. 1999) program results indicated thatAGPAT3 and AGPAT4 are endoplasmic reticulum (ER) mem-brane proteins. The latter is in concordance with previousfindings for human AGPAT1 (Aguado and Campbell 1998). Ifthe signal peptide is not cleaved at predicted positions 19 or73—as it is often observed for ER membrane proteins—

Exploration of Novel Motifs from Mouse cDNA

Genome Research 369www.genome.org

Cold Spring Harbor Laboratory Press on April 16, 2008 - Published by www.genome.orgDownloaded from

AGPATs carry three transmembrane helices. MDS00145 spansthe C-terminal region facing the ER lumen (Suppl. Fig. 2).AGPAT3 and AGPAT4 lack a 25 amino acid region at the C-terminus that causes a split of MDS00145 when AGPAT1 andAGPAT2 in the alignment (Fig. 3B). The Pfam defined acyl-

transferase domain is located between the second and thirdtransmembrane helices and shared by all AGPAT members.

AGPATs (EC2.3.1.51) are involved in phospholipid me-tabolism by converting lysophosphatidic acid (LPA) intophosphatidic acid (PA). The AGPAT substrate is 1-acylglycerol

Figure 2 Multiple sequence alignments of the MDS00105 motif region. The proteins are designated by the species abbreviation and the name.The positions of motif MDS00105 are indicated by “+”. Submotif positions based on regular expressions are designated by “#&”. Multiplesequence alignments of representative complete sequences are shown in Supplement Figure 1A,B,C. (A) Alignment ING1 and ING1L MDS00105.1motif region. The SWISS-PROT/TrEMBL accessions of the sequences are as follows: (HSA)P47ING1, Q9UIJ4; (HSA)P33ING1A, O43658; (MMU)Ing1[isoform 1], Q9QXV3 isoform 1; (HSA)P47ING1A, Q9UK53; (HSA)P24ING1B, Q9UBC6; (HSA)ING1C, Q9P0U6; (HSA)P33ING1B [fragment];CAC38067; (HSA)P33ING1, O00532; (HSA)P24ING1C, Q9NS8; (HSA)P33ING1, Q9UIJ2; (HSA)P33ING1B, Q9UK52; (HSA)ING1 [isoform],Q9H007; (HSA)ING1, Q9HD98; (HSA)ING1 [fragment], Q9HD99; (HSA)ING2, Q9H160; (HSA)ING1L, O95698; (MMU)Ing2, Q9ESK4;(MMU)2810011M06Rik, Q9CZD8; and (DMEL)CG7379, Q9VEF5. (B) Alignment of ING1 homolog MDS00105.2 motif region. The SWISS-PROT/TrEMBL accessions of the ING1 homolog sequences are as follows: (HSA)LOC51147 [P33ING1 homolog], Q9UNL4; (HSA)MY036, Q9H3J0;(HSA)MGC12557, AAH07781; (HSA)MGC4757 [similar to (MMU)1700027H23Rik], AAH13038; (HSA)MGC12080, AAH09127;(MMU)D6Wsu147e, Q9D7F9; (MMU)D6Wsu147e [variant]; (MMU)1810018M11Rik, Q9D8Y8; (MMU)1810018M11Rik [partial];(MMU)1700027H23Rik, Q9D9V8; (HSA)MGC12485 [similar to (MMU)1819918M11Rik], Q9BS30; (DMEL)CG9293, Q9VJY8; (SPOM)C3G9.08,O42871; (ATHA)14013, Q9LIQ6; and (ATHA)F20D21.22, Q9SLJ4. (C) Alignment of the ING3MDS00105.3 motif region. The SWISS-PROT/TrEMBLaccessions of the other ING3 subfamily sequences are as follows: (HSA)P47ING3 [variant 1], Q9NXR8; (HSA)P47ING3 [variant 2], Q9HC99;(HSA)ING3, O60394; (MMU)p47Ing3, Q9ERB2; (MMU)13000 13A07Rik, Q9DBG0; (HSA)DKFZp586C1218, CAC48260; and (DMEL)CG6632,Q9VWS0.

Kawaji et al.

370 Genome Researchwww.genome.org

Cold Spring Harbor Laboratory Press on April 16, 2008 - Published by www.genome.orgDownloaded from

3-phospate. Neither the catalytic nor the substrate bindingsites of AGPAT are known. Because AGPAT1 and AGPAT2overexpression experiments by West et al. (1997) showed en-hanced IL6 and TNF-alpha transcription and synthesis onIL1B stimulation, the AGPAT family members are potentialtargets to modulate inflammatory responses. The transmem-brane domain-based alignment of mouse and human AGPATsequences with distant homologs revealed two conservedblocks in the ER and cytoplasmic loop region. The latter wasalso described in an experimental E. coli AGPAT study (Lewinet al. 1999). The presence of potential active sites on both sitesof the ER is corroborated by the recently discovered reverse,ATP-independent and CoA-dependent LPA synthesis from PA(Yamashita et al. 2001). The signature E-G-T-R-X(4–8)-[SD]-X(1–18)-L-P is conserved among all AGPAT family members.We propose that these residues are required for substrate bind-ing on the cytoplasmic site. Possible catalytic sites may in-volve the N-H-X4-D and AGPAT3 and four specific H-K-L-Y-Q-E and G-N-X-[DE] signatures (Suppl. Fig. 2) because His or

Gln and Glu or Asp can facilitate a nucleophilic attack onthioester groups of the glycerol phosphate group of fatty acidCoAs. The divergence of AGPATs in the ER-sided regionaround MDS00145 motif of AGPAT3 and AGPAT4 indicates aregulatory function MDS00145 for transacylation specificityor activity.

Solute Carrier 21 Family MotifMotif MDS00148 (Fig. 4A) spans a 35–37 amino acids longextracellular oriented loop region between two transmem-brane domains that is conserved among members of themammalian SLC21 family (organic anion transporters), or-ganic anion transporter polypeptide-related (OATPRP), re-lated Drosophila and Caenorhabditis elegans organic aniontransporters (Tsuji et al. 1999). All member sequences haveseven to 12 transmembrane regions depending on the trans-membrane helix prediction algorithm used. We assume 12transmembrane regions. Sequences of rat Slc21a10 (Lst-1), ratOat-k2, and 1700022M03Rik are shorter because of exon de-

Figure 3 (A) Multiple sequence alignment of AGPAT3 and AGPAT4 specific motif MDS00145. The SWISS-PROT/TrEMBL accessions of thesequences are as follows: (HSA)AGPAT3 [isoform 1], Q9NRZ7; (HSA)MGC16906, AAH11971; (HSA)AGPAT3 [isoform 2], Q9NRZ701; (MMU)Ag-pat3, Q9D517; (MMU)Agpat4, Q9DB84; (RNO)Agpat4, BAB62290; and (HSA)AGPAT4, Q9NRZ5. The putative translations of Agpat3 variant 1(clone ID 2210417G15) and variant 2 (clone ID 4831410M06) are derived from DDBJ entry AK008965 and FANTOM1 database because variant2 has no DDBJ accession number. (B) Alignment of AGPAT3 and AGPAT4 C-terminal region with other AGPAT family members. The positions ofthe AGPAT3 and AGPAT4 specific motif MDS00145 are indicated by “+”. The multiple sequence alignment of the complete sequences is shownin Suppl. Fig. 2.

Exploration of Novel Motifs from Mouse cDNA

Genome Research 371www.genome.org

Cold Spring Harbor Laboratory Press on April 16, 2008 - Published by www.genome.orgDownloaded from

Figure 4 (A) Multiple sequence alignment of extracellular domain containing the OATP family motif MDS00148. The proteins are designatedby the species abbreviation and the name. The positions of motif MDS00148 are designated by “+”. Potential N-glycosylation sites N-x-[ST] areindicated. A multiple sequence alignment of MDS00148 representatives covering the entire sequence length and a phylogenetic tree are shownin Suppl. Fig. 3A and B. The SWISS-PROT/TrEMBL accessions of the sequences are as follows: (MMU)4921511I05Rik, Q9D5W6;(MMU)4933404A18Rik, Q9D4B7; (MMU)1700022M03Rik, Q9DA23; (RNO)Tst1, AAK63015; (MFA)hypo. protein [61.1 kD], BAB69708;(DMEL)CG11332, Q9I7P1; (HSA)SLC21A12, Q9H4T8; (HSA)SLC21A12 [POATP], Q9UI35; (HSA)MGC22943 [similar to SLC21A12], AAH15727;(HSA)SLC21A12 [OATP-E], Q9UIG7; (HSA)LOC115191 [SLC21A12-related], Q9H8P2; (RNO)Oatp-e, Q99N01; (MMU)Slc21a2, Q9EPT5;(RNO)Slc21a2, Q00910; (HSA)SLC21A2, Q92959; (MMU)5830414C08Rik, Q9JKV0; (HSA)OATPRP3, Q9GZV2; (HSA)SLC21A11, Q9UIG8;(HSA)MGC659 [similar to SLC21A11], Q9BW73, (RNO)Pgt2, Q99N02; (RNO)Oat-k2, Q9WTM0; (RNO)Slc21a4, P70502; (MMU)Slc21a1,Q9QXZ6; (RNO)Oatp2, O35913; (RNO)Oatp3, O88397; (RNO)Slc21a7, Q9EQR8; (RNO)Slc21a3, P46720; (MMU)MGC6720, AAH13594;(MMU)Slc21a7, AAK39416; (MMU)Slc21a5, Q9EP96; (MMU)Slc21a13, Q99J94; (RNO)Oatp5, Q9QYE2; (HSA)OATP, AAG30037; (HSA)SLC21A3[OATP-A], CAB97006; (HSA)SLC21A3 [OATP], P46721; (HSA)SLC21A3 [OATP1B], Q9UL38; (MMU)Slc21a14, Q9ERB5; (RNO)LOC845111,Q9EPZ7, (HSA)SLC21A14, Q9NYB5; (MFA)hypo. protein [66.7 kD], Q9GMU6; (HSA)SLC21A8, Q9NPD5; (HSA)SLC21A6, Q9Y6L6;(MMU)Slc21a10(MLST-1), Q9JJJ1; (MMU)Slc21a10(Lst-1), Q9JJL3; (MMU)MG1351894, Q9JI79; (RNO)Slc21a10(Lst-1a), Q9JHF6;(RNO)Slc21a10(Lst-1c), Q9JIM2; (RNO)Slc21a6, Q9QZX8; (HSA)OATRPR4, Q9H2Y9; (DMEL)CG3811, Q9VLB3; (DMEL)CG3380, Q9W269;(DMEL)CG3380 [variant], AAK77236; (DMEL)CG3387, Q9W271; (HSA)OATPRP2, Q9H2Z0; (HSA)SLC21A9, O94956; (HSA)DKFZP586I0322,Q9UFU1; (RNO)moatp1 [splice variant], Q9JHI3; (CELE)Y70G10A.3, Q9XWC5; (DMEL)CG7571, Q9VVH9; and (CELE)F21G4.1, Q93550. Motifsequences that contain an hmmpfam-predicted kazal-type serine protease inhibitor domain are labeled with a “k”. (B) Proposed topology of anorganic anion transporter. The primary structure is shown for 4921511I05Rik, assuming 12 transmembrane helices. The intra- and extracellulardomains are separated by horizontal dashed lines, which symbolize the cell membrane. The motif region MDS00148 is indicated by arrows.Potential extracellular N-glycosylation sites are symbolized by asterisks. Potential disulfide bonds of Cys residues are shown as double lines.

372 Genome Researchwww.genome.org

Cold Spring Harbor Laboratory Press on April 16, 2008 - Published by www.genome.orgDownloaded from

letion or truncation (Suppl. Fig. 3A). SLC21 family proteinscatalyze the Na+-independent transport of organic anions(e.g., estrogen and prostaglandin) as well as conjugated andunconjugated bile acids (taurocholate and cholate). SLC21members are tissue specific, for example, SLC21A6 (Abe et al.1998, Noe et al. 1997) is expressed in brain and liver andkidney but not in testis, heart, spleen, lung, and muscle.SLC21A6 can uptake taurocholate, estrogen conjugates, di-goxin, and cardiac glycosides.

The motif MDS00148 is located in the extracellular re-gion between transmembrane helices 9 and 10. Interestingly,33 of 60 proteins shown in Fig. 4A were predicted to carry inthis region a kazal-type serine protease inhibitor domain asdefined by InterPro (IPR002350, Release 3.2). The 3D struc-ture of kazal-type domains typically includes two short �-he-lices and a three-stranded antiparallel ß-sheet. Secondarystructure prediction using the DSCprogram (King et al. 1997)and the PredictProtein server at http://maple.bioc.columbia.edu/predictprotein/ (Rost 1996) confirmed the antiparallel ß-sheet but not the presence of �-helices. Moderate structuralsimilarities of the kazal-type domain to the follistatin andosteonectin domains were noted earlier, but closer analysisshowed that they differ in the cysteine bonding pattern andN-terminal regions (Hohenester et al. 1997). Our HMM profiledetected none of the serine protease inhibitor members, in-cluding membrane-associated agrins. The hmmpfam searchusing the Pfam profile domains detected a kazal-type domainswith E-values ranging from 0.0057 to 0.041 in 11 sequences(Fig. 4A). When applying the threshold of the Pfam gatheringmethod to build Pfam full alignments for the kazal-type do-main in an hmmpfam search against the Pfam Release 6.6,

only two sequences , (HSA)SLC21A3 [OATP] and(HSA)SLC21A3 [OATP1B] (SPTR accessions P46721 andQ9UL38), showed a bit score >3.0, which qualifies them askazal-type domain members. In the absence of crystal struc-ture data of an SLC21 family protein, we conclude thatMDS00148 represents a novel module with some structuralsimilarities to the kazal-type domain. MDS00148 might haveevolved from a protease inhibitor domain but acquired dif-ferent functionality when transferred into an ancestral trans-membrane SLC21 family protein.

The presence of 11 Cys residues in the pattern C-X(22–23,27–28)-C-X(3,5)-C-X-C-X(7,8)-C-X(10–11)-C-X3-C-X(7,14–

16)-C-X-C-X(13–37)-C-X(3–5)-C could support four loops (Fig.4B) that are stabilized by disulfide bonds between Cys resi-dues. The other extracellular regions between transmembranehelices form shorter loops. The loops of the extracellular re-gions may interact with each other through a network of hy-drogen and disulfide bonds to form a lid-like structure. Be-cause the extracellular region between transmembrane helices9 and 10 is conserved between mammals, Drosophila and C.elegans, we propose that substrate specificity is determined byconserved cysteines, positively charged residues located be-tween the conserved cysteine residues, and the loop length ofextracellular regions. Cysteines have been shown to be in-volved in the transport activity of dopamine and serotonintransporters (Chen et al. 2000, Chen et al. 1997). Based on thesimilarities in the extracellular loop regions and phylogeneticanalysis (Suppl. Fig. 3B), we predict that the novel mouseSLC21 members may share anion uptake specificities ofSLC21A12 [OATP-E]. SLC21A12 can transport estradiol-17ß-glucuronide, ß-lactam antibiotic benzylpenicillin, and prosta-

Figure 4 (continued)

Exploration of Novel Motifs from Mouse cDNA

Genome Research 373www.genome.org

Cold Spring Harbor Laboratory Press on April 16, 2008 - Published by www.genome.orgDownloaded from

Figure 5 (A) ClustalW sequence alignment of motif MDS00113. (The SWISS-PROT/TrEMBL accessions of the sequences are as follows:(HSA)DKFZP434B1517, Q9UFF7; (HSA)IMAGE4304203, AAH09972; (HSA)FLJ11937 [FLJ21739], Q9H6X6; (HSA)FLJ11937, Q9HAA1; (HSA)-FLJ00109, Q9H7I0; DMEL)CG15609, Q9V7X1; (HSA)KIAA1668 [FLJ14966], BAB55422; (MMU)4921517J23Rik, Q9D5U9; (HSA)KIAA0819 [frag-ment], O94909; (DMEL)CG11259 [variant], AAK93415; DMEL)CG11259, Q9VU34; (MMU)EST57332, 4930438G05; (MMU)EST57332 [variant],4933434L05; (HSA)KIAA1668, Q9BY92; (HSA)KIAA1668 [Unknown], Q9BVL9; (HSA)DJ1014D13.3, Q9UH43; (HSA)FLJ23471, Q9H5F9; (RNO)-Fra1, P10158 (HAS)FRA1 [variant], CAC50237; (HSA)FRA1, P15407; and (MMU)Fra1, P48755. The putative translations of segment EST573322and its variant (clone ID 4930438G05 and 4933434L05) were taken directly from the FANTOM1 database because the sequence has no DDBJaccession number. (B) Topology-based multiple sequence alignment of 13 MDS00113 representatives. Regions that were predicted as coiled coilwere aligned in the upper part. The lower part of the alignment contains the noncoiled regions. The positions of motif MDS00113 and leucineheptad repeats, the predicted coiled coil regions, and the coiled coil trigger sequence (xxLExchxcxccx) are indicated. Letters a to g indicate helixpositions. NLS symbolizes the nuclear localization signal. (C) Domain map of MDS00113 containing putative proteins. The abbreviations andInterPro accessions are as follows: M, MDS00113; L, leucine heptad repeat; N, nuclear localization signal; C, coiled coil; CH, calponin-homologydomain, IPR001715; LIM, LIM domain, IPR001718; DER, delayed-early response, IPR002259; B, basic-leucine zipper, IPR001871; PNDO, pyridinenucleotide-disulfide oxidoreductase; PRR, proline-rich region, IPR000694; and L7/12, ribosomal L7/L12 C-terminal domain, IPR000206.

Kawaji et al.

374 Genome Researchwww.genome.org

Cold Spring Harbor Laboratory Press on April 16, 2008 - Published by www.genome.orgDownloaded from

glandin E2 (Tamai et al. 2000). SLC21A12 is expressed in vari-ous cancer cells but not in normal blood cells.

While revising this paper we noticed that the Pfam Re-lease 6.6 (August 2001) includes two domains of the SLC21family (OATP_N, PF03132; OATP_C, PF03137). In an hmmp-fam search against the Pfam Release 6.6, 58 and 47 of the 60proteins are predicted to have (according to the Pfam gather-ing method) the OATP_N domain and the OATP_C domain,respectively. The graphical view of MDS00148 in the MDSmotif database visualizes the multidomain information. TheOATP_N domain overlaps with the C-terminus of theMDS00148 motif in 1, 3, 4, 6, and 12 residues for 25, 1, 16, 2,and 1 sequences, respectively. The Pfam OATP_N motif ap-pears to be the N-terminal extension of the MDS00148 motifthat was originally detected on September 28, 2000. No over-laps have been detected for OATP_C. Because the filtering ofmotif candidates against known conserved regions was con-ducted before the release of Pfam 6.6, the partially overlap-ping MDS00148 motif candidates were originally not re-moved. To counter the problems arising from updates ofother motif databases, the motif discovery pipeline (Fig. 1) hasbeen retrofitted with an “update screening” step.

Leucine Zipper-Like MotifMotif MDS00113 includes 20 members (Fig. 5A) with con-served sequences of 43 amino acids length that carry either aleucine zipper signature characteristic for Fos related antigen1 (FRA1) or a leucine zipper-like motif (16 members). We ana-lyzed 13 representative members in detail (Fig. 5B). Ten of 13members contain a leucine heptad repeats. The program “pat-tern search” of ANTHEPROT V5.0rel.1.0.5 (Geourjon et al.1995) calculated the theoretical frequency of the heptad re-peat L-x6-L-x6-L-x6-L as 1.07E-5 using PROSITERel.16.0 (Hof-mann et al. 1999). Because the leucine repeat could have oc-curred by chance, it would be risky to infer the leucine zipperfunction of DNA binding. Therefore, we reanalyzed the se-quences for features of known and characterized leucine zip-pers occurring in transcription factors. Sequences were evalu-ated for (1) alpha helical coiled-coil region (O’Shea et al. 1989)with mostly 3,4-hydrophobic repeat of apolar amino acids at

positions a and d of the helix, (2) an overlap of the coiled-coilregion with the leucine heptad repeat, (3) coiled-coil triggersequences [IVL]-X-[DE]-I-X-[RK]-X and [IVL]-[DE]-X-I-X-[RK]-X (Frank et al. 2000) or the 13-residue trigger motif xxLEx-chxcxccx (Kammerer et al. 1998, Wu et al. 2000), (4) a basicDNA binding region preceding the heptad repeat, and (5)nuclear localization signals (Hicks 1995). Applying these cri-teria, only the FRA1 sequences qualified as basic leucine zip-per with DNA binding function (Landschulz et al. 1988 and1989). Eight sequences contain a nuclear localization signal.Three proteins (FLJ21739, FLJ11937, and DKFZP434B1517)were predicted by PSORTas cytoplasmic proteins, whereas allhave a “COILS” (Lupas et al. 1991) predicted coiled-coil re-gion. Eleven sequences bear a coiled-coil trigger sequence ortrigger motif. Only the three FRA1 sequences have a coiled-coil region and a trigger motif overlapping with the heptadrepeat, whereas EST573322, 4921517J23Rik, DJ1014D13.3,CG15609, FLJ00109, and FLJ11937 contain a potential tan-dem coiled-coil region that is separated by the heptad repeat(Fig. 5C). Drosophila protein CG11259 has a tandem coiled-coil region with a leucine heptad repeat located in the firstcoiled-coil. The basic region of these seven sequences is eithernonexistent or does not immediately precede the heptad re-peat (Fig. 5B).

The conservation of structural features among evolution-ary unrelated sequences indicates domain shuffling. Giventhe sequence conservation with FRA1, it is conceivable thatan ancestral functional basic leucine zipper region was sub-jected to recombination and mutation events that degener-ated the basic leucine zipper. Domain mapping using the In-terPro Scan program provided further evidence for domainshuffling. We detected a calponin homology, LIM, delayedearly response, pyridine nucleotide-disulfide oxidoreductase,ribosomal L7/L12, and Pro-rich domains (Fig. 5C). The dis-persed location of LIM and calponin homology domainsamong FLJ21739, FLJ11937, DKFZP434B1517, CG15609, andCG11259 does not indicate any ancestral relationship.

Combining the results, we propose that the tandemcoiled-coil containing proteins bind to proteins, rather thanDNA, in a similar fashion as the group D basic helix-loop-

Figure 5 (continued)

Exploration of Novel Motifs from Mouse cDNA

Genome Research 375www.genome.org

Cold Spring Harbor Laboratory Press on April 16, 2008 - Published by www.genome.orgDownloaded from

helix (HLH) proteins (Atchley and Fitch 1997). Group D HLHproteins, which lack the basic region in the first helix, canform protein–protein dimers that act, for example, as a nega-tive dominant regulator of transcription factor MyoD on DNAbinding. The effect of the other domains on the proposedprotein binding function of these multidomain proteins re-mains unclear.

False Positive MotifsFrom 42 false positives we describe seven cases that were notobvious from visual inspection and required detailed analysis.Motif MDS00118 contains 4931406F04Rik, testis expressedgene 11 (Tex21-pending, 4931412D23RIK, 4931421K24RIK),and amyotrophic lateral sclerosis 2 (juvenile) chromosomeregion candidate 3 (ALS2CR3). 4931406F04Rik and Tex21-pending show 95% identity over the entire sequence length.There is no sequence similarity between Tex21-pending andALS2CR3 except for 51 amino acid region of MDS00118.Closer inspection of the region revealed that only 10(21) of 26residues within MDS00118 (10–35) were identical (similar).The identity (similarity) was limited to short stretches of[IVL]-D-L or [KR]-E-E-[LI] that are believed to be a low com-plexity region that was not masked by the SEG filter.

Motif candidate MDS00131 is identical to a 3 � 37 resi-dues tandem repeat reported in mouse nucleolar RNA helicaseII DDX21 (Valdez and Wang 2000). The motif candidate is afalse positive because each repeat unit contains the knownsignature sequence of a mononuclear localization signal. Thesecond false positive, MDS00102, represents the immuno-globulin variable-like N-domains reported in sequences of thecarcinoembryonic antigens (Keck et al. 1995).

Secondary structure analysis and fold predictions of se-quences containing motifs MDS00121, MDS00129,MDS00137, and MDS00139 (Suppl. Fig. 4) did not produceany conclusive results. We therefore went back to DNA leveland checked the ORF prediction and for repeat elements. ARepeatMasker (Smit and Green 1997) search revealed thatthe DECODER (Fukunishi and Hayashizaki 2001) predictedORFs contain B1 (MDS00122), B2 (MDS00121), intracisternalA-particle (IAP) LTR (MDS00137), and mammalian apparentLTR-retrotransposon (MDS00139) repeat elements, respec-tively. Therefore, the sequences constitute transcribed repeatsor repeat containing transcripts. We report the finding as awarning regarding data processing and demonstration of thelimits of semiautomated annotation approaches.

RIKEN clone 2610016M14 is of biological interest be-cause it carries a potential in-frame inserted IAP LTR in theDNA mismatch repair geneMsh3 (Watanabe et al. 1996). IAPsare endogenous proviral elements that originated from defec-tive retroviruses. IAPs are highly expressed in transformed celllines and during normal mouse embryogenesis. There areabout 1000 IAP-related elements per haploid mouse genome,but only a few IAPs are transcribed (Mietz et al. 1992). TheIAP-LTR insertion is located seven base pairs downstream ofexon 5, relative to the mouse genomic Msh3 sequence, andmay have resulted in exon skipping (Wandersee et al. 2001).Loss of Msh3 function causes a partial mismatch repair defectand may enhance tumorigenesis but does not result in cancerpredisposition (de Wind et al. 1999).

CONCLUSIONMotif discovery can be performed with relative ease on ge-nome scale. However, the exploration of new motifs or sub-

motifs for potential biological functions is a time-consumingprocess depending on human expert knowledge. This casestudy enabled us to derive valuable process information andrules for semiautomation that will not only aid the nextround of functional annotation of mouse cDNA clones (FAN-TOM) but also the initial steps of protein–protein interaction,regulatory, and active site target selections in a drug discoveryprocess.

METHODS

Preparation of a Nonredundant Sequence SetThe FANTOM cDNA collection contains 21,076 clones. Wehave used DECODERto predict the ORF of each cDNA se-quence, which yielded 21,050 potential coding sequences. Asthe clone set showed some redundancies that could lead tofalse positive motifs, we clustered the putative translationsusing DDS (Huang et al. 1997) and ClustalW (Thompson etal. 1994). From each cluster, we selected the longest sequenceas representative of the cluster. As a result, we obtained 15,631nonredundant sequences.

Extraction of Homologous Sequences and ClusteringSequences were compared against each other with BLASTPofthe NCBI-Toolkit (Altschul et al. 1997) applying an E-value of0.1 and using the SEGfilter option (Wootton 1993) to removelow-complexity regions. The BLASTPresults of each compari-son were then analyzed with a clustering algorithm (Matsudaet al. 1999) to extract homologous groups of sequences. Theclustering algorithm is based on graph theory. Briefly, eachsequence is considered as a vertex. If the similarity betweenany pair of sequences exceeds a user-defined threshold (E-value 0.1, in our case), an arc is drawn between the sequencepair or the two vertices.

The resulting graph is also covered by subgraphs whosevertices are connected with at least a fraction P (a user-definedratio; here P = 40%) of the other vertices. The groups of sub-graphs may overlap with each other if some sequences sharetwo or more homologous regions with a different set of se-quences, for example, multidomain-containing sequences.The method is equivalent to complete-linkage clustering if Pis set to 100%. In contrast, single-linkage clustering requiresonly one arc to any member in a group and P becomes virtu-ally 0% when the number of members is large.

Detection of Homologous Regions in SubgraphGroupsHomologous regions in groups were detected by anothergraph-theoretic method. We extracted all fixed-length subse-quences (blocks) and performed ungapped pairwise align-ments among all block pairs. The length of the initial blockwas 20 amino acids. The alignment score was measured usingthe BLOSUM50 score matrix. A block graph was constructed byregarding blocks as vertices. The vertices were connected byweighted arcs if the blocks showed at least the user-definedsimilarity (BLOSUMscore 50). Highly connected componentsin the block graph were detected with a maximum-densitysubgraph (MDS) algorithm (Matsuda 2000). Here, density is agraph-theoretic term that is defined as the ratio of the sum ofthe similarity scores between blocks to the number of blocks.Homologous regions longer than 20 amino acids were ob-tained by combining overlapping blocks that were detectedby this method.

Filtering Out Known Conserved RegionsThe detected blocks were screened for known conserved re-gions detected by HMMER2.1.1 (Washington University,http://hmmer.wustl.edu/) in Pfam (Release 5.5), BLASTP in

Kawaji et al.

376 Genome Researchwww.genome.org

Cold Spring Harbor Laboratory Press on April 16, 2008 - Published by www.genome.orgDownloaded from

ProDom (Release 2000.1), and InterPro Scan in InterPro(Release 2.0) databases. Blocks that overlapped with at leastone residue of known motifs or domains were discarded. Theremaining blocks were labeled with the original discoverydate and subjected to further overlap screening with the up-dates of ProDom, InterPro and Pfam. If a previously novelmotif candidate overlapped with a newly reported motif ofthe updated motif databases, the motif candidate was rein-spected manually to evaluate the significance of the overlapregarding motif extension or submotif.

HMM Search against RIKEN andSWISS-PROT/TrEMBL SequencesOf 21,076 sequences, 1908 are EST assemblies that were notsubmitted to DDBJ (see also http://fantom.gsc.riken.go.jp/doc/faq.html), and 8703 of 19,168 submitted sequences donot have a coding sequence (CDS) assignment in GenBankRelease 125. The nonredundant SWISS-PROT/TrEMBL data-base (SPTR) (Bairoch et al. 2000) composed of SWISS-PROT40.0, TrEMBL 18.0, and TrEMBL_new of October 23, 2001contains 10,465 translated sequences of the FANTOM1 set. Atotal of 10,603 translated FANTOM1 sequences are not inSPTR. To expand the number of motif members, we con-structed (HMMs) from the conserved blocks and searchedwith HMMERthe SPTR database, 10,603 DECODER-predictedFANTOM1 translations, and 1908 DECODER-predicted trans-lations of the EST assemblies.

MDS Motif DatabaseThe results of the HMMsearches and accessory informationsuch as HMM scores, E-values, protein names, motif align-ments, and chromosomal mapping information were storedin flat files. A graphical user interface and Perl/CGI scriptsfacilitate the search and display of motifs and false positives.The MDS motif database is accessible at URL http://motif.ics.es.osaka-u.ac.jp/MDS/.

Extended Sequence AnalysesSecondary structure analyses of sequences were performedwith the locally installed ANTHEPROT V5.0rel.1.0.5 software,DSCpackage (King et al. 1997), and on the external Predict-Protein server (Rost 1996). Functional sites, for example,phosphorylation and N-glycosylation sites were predictedwith ANTHEPROTfrom the PROSITE database (Hofmann et al.1999). The locally installed PSORT2program was used to pre-dict the cellular localization of proteins. Chromosomal local-ization information was retrieved from the FANTOM map ofRIKEN clones on human genome and mouse chromosomesand LocusLink (Pruitt et. al. 2001). Alignments were per-formed with the locally installed ClustalW 1.8 program,postprocessed with the coloring software MView 1.41 (Brownet al. 1998) and edited by hand to improve the alignmentquality. For the analysis of SLC21 family sequences, we con-structed a phylogenetic tree by the maximum-likelihoodmethod using MOLPHY ProtML(Adachi and Hasegawa 1996).The tree was obtained by the “Quick Add Search” using theJones-Taylor-Thornton model (Jones et al. 1992) of aminoacid substitution and retaining the 300 top ranking trees (op-tions -jf -q -n 300). Bootstrap values of the tree were calculatedby analyzing 1000 replicates using the resampling of the es-timated log-likelihood (RELL) method (Kishino et al. 1990).

ACKNOWLEDGMENTSWe thank Wolfgang Fleischmann (European BioinformaticsInstitute), Richard Baldarelli (MGI, The Jackson Laboratory)for valuable comments, and Tadashi Ogawa (Computer/Informatics Facilities, RIKEN GSC) for technical assistance.This work was supported in part by ACT-JST (Research andDevelopment for Applying Advanced Computational Science

and Technology) of Japan Science and Technology Corpora-tion (JST) to H.M. and Y.H.

The publication costs of this article were defrayed in partby payment of page charges. This article must therefore behereby marked “advertisement” in accordance with 18 USCsection 1734 solely to indicate this fact.

REFERENCESAasland, R., Gibson, T.J., and Stewart, A.F. 1995. The PHD finger:

Implications for chromatin-mediated transcriptional regulation.Trends Biochem. Sci.. 20: 56–59.

Abe T., Kakyo, M., Sakagami, H., Tokui, T., Nishio, T., Tanemoto,M., Nomura, H., Hebert, S.C., Matsuno, S., Kondo, H., et al.1998. Molecular characterization and tissue distribution of a neworganic anion transporter subtype (oatp3) that transports thyroidhormones and taurocholate and comparison with oatp2. J. Biol.Chem. 273: 22395–22401.

Adachi, J. and Hasegawa, M. 1996. MOLPHY version 2.3: Programs formolecular phylogenetics based on maximum likelihood, ComputerScience Monographs, no. 28. The Institute of StatisticalMathematics, Tokyo. (ftp://ftp.ism.ac.jp/pub/ISMLIB/MOLPHY).

Aguado, B. and Campbell, R.D. 1998. Characterization of a humanlysophosphatidic acid acyltransferase that is encoded by a genelocated in the class III region of the human majorhistocompatibility complex. J. Biol. Chem. 273: 4096–4105.

Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z.,Miller, W., and Lipman, D.J. 1997. Gapped BLAST andPSI-BLAST: A new generation of protein database searchprograms. Nucleic Acids Res. 25: 3389–3402.

Apweiler, R. Attwood, T.K., Bairoch, A., Bateman, A., Birney, E.,Biswas, M., Bucher, P., Cerutti, L., Corpet, F., Croning, M.D., etal. 2000. InterPro—an integrated documentation resource forprotein families, domains, and functional sites. Bioinformatics16: 1145–1150.

Atchley, W.R. and Fitch, W.M. 1997. A natural classification of thebasic helix-loop-helix class of transcription factors. Proc. Natl.Acad. Sci. 94: 5172–5176.

Bairoch, A. and Apweiler, R. 2000. The SWISS-PROT proteinsequence database and its supplement TrEMBL in 2000. NucleicAcids Res.28: 45–48.

Banks, R.E., Dunn, M.J., Hochstrasser, D.F., Sanchez, J.C., Blackstock,W., Pappin, D.J., and Selby, P.J. 2000. Proteomics: Newperspectives, new biomedical opportunities. Lancet356: 1749–1756.

Bateman A., Birney, E., Durbin, R., Eddy, S.R., Howe, K.L., andSonnhammer, E.L. 2000. The Pfam protein families database.Nucleic Acids Res. 28: 263–266.

Bork, P. and Koonin, E. 1996. Protein sequence motifs. Curr. Opin.Struct. Biol. 6: 366–376.

Bork P. and Ouzounis, C. 1995. Ready for a motif submission? Aproposed checklist.Trends Biochem. Sci. 20: 104.

Brown, N.P., Leroy, C., and Sander, C. 1998. MView: Aweb-compatible database search or multiple alignment viewer.Bioinformatics 14: 380–381.

Chen, N., Ferrer, J.V., Javitch, J.A., and Justice, J.B. 2000.Transport-dependent accessibility of a cytoplasmic loop cysteinein the human dopamine transporter. J. Biol. Chem. 275: 1608–1614.

Chen, J.G., Liu-Chen, S., and Rudnick, G. 1997. External cysteineresidues in the serotonin transporter. Biochemistry 36: 1479–1486.

Corpet, F., Servant, F., Gouzy, J., and Kahn, D. 2000. ProDom andProDom-CG: Tools for protein domain analysis and wholegenome comparisons. Nucleic Acids Res. 28: 267–269.

de Wind N., Dekker, M., Claij, N., Jansen, L., van Klink, Y., Radman,M., Riggins, G., van der Valk, M., van’t Wout, K., and te Riele, H.1999. HNPCC-like cancer predisposition in mice throughsimultaneous loss of Msh3 and Msh6 mismatch-repair proteinfunctions. Nat. Genet. 23: 359–362.

Frank S., Lustig, A., Schulthess, T., Engel, J., and Kammerer, R.A.2000. A distinct seven-residue trigger sequence is indispensablefor proper coiled-coil formation of the human macrophagescavenger receptor oligomerization domain. J. Biol. Chem. 275:11672–11671.

Fukunishi, Y. and Hayashizaki, Y. 2001. Amino-acid translationprogram for full-length cDNA sequences with frame-shift errors.Physiol. Genomics 5: 81–87.

Geourjon, C. and Deleage, G. 1995. ANTHEPROT 2.0: Athree-dimensional module fully coupled with protein sequenceanalysis methods. J. Mol. Graph. 13: 209–212, 199–200.

Exploration of Novel Motifs from Mouse cDNA

Genome Research 377www.genome.org

Cold Spring Harbor Laboratory Press on April 16, 2008 - Published by www.genome.orgDownloaded from

Gunduz, M., Ouchida, M., Fukushima, K., Hanafusa, H., Etani, T.,Nishioka, S., Nishizaki, K., and Shimizu, K. 2000. Genomicstructure of the human ING1 gene and tumor-specific mutationsdetected in head and neck squamous cell carcinomas. Cancer Res.60: 3143–3146.

Helbing, C., Veilette, C., Riabowol, K., Johnston, R.N., andGarkavetsev, I. 1997. A novel candidate tumor suppressor, ING1,is involved in the regulation of apoptosis. Cancer Res. 57:1255–1258.

Henikoff, S., Greene, E.A., Pietrokovski, S., Bork, P., Attwood, T.K.,and Hood, L. 1997. Gene families: The taxonomy of proteinparalogs and chimeras. Science 278: 609–614.

Hicks, G.R. and Raikhel, N.V. 1995. Protein import into the nucleus:An integrated view. Annu. Rev. Cell. Dev. Biol. 11: 155–188.

Hofmann, K., Bucher, P., Falquet, L., and Bairoch, A. 1999. ThePROSITE database, its status in 1999. Nucleic Acids Res. 27:215–219.

Hohenester, E., Maurer, P., and Timpl, R. 1997. Crystal structure of apair of follistatin-like and EF-hand calcium-binding domains inBM-40. EMBO J. 16: 3778–3786.

Huang, X., Adams, M.D., Zhou, H., and Kerlavage, A.R. 1997. A toolfor analyzing and annotating genomic sequences. Genomics 46:37–45.

Jones, D.T., Taylor, W.R., and Thornton, J.M. 1992. The rapidgeneration of mutation data matrices from protein sequences,Comput. Appl. Biosci. 8: 275–282.

Kammerer, R.A., Schulthess, T., Landwehr, R., Lustig, A., Engel, J.,Aebi, U., and Steinmetz, M.O. 1998. An autonomous folding unitmediates the assembly of two-stranded coiled coils. Proc. Natl.Acad. Sci. 95: 13419–13424.

Keck, U., Nedellec, P., Beauchemin, N., Thompson, J., andZimmermann, W. 1995. The cea10 gene encodes a secretedmember of the murine carcinoembryonic antigen family and isexpressed in the placenta, gastrointestinal tract and bonemarrow. Eur. J. Biochem.. 229: 455–464.

King, R.D., Saqi, M., Sayle, R., and Sternberg, M.J. 1997. DSC: Publicdomain protein secondary structure predication. Comput. Appl.Biosci. 13: 473–474.

Kishino, H., Miyata, T., and Hasegawa, M. 1990. Maximumlikelihood inference of protein phylogeny, and the origin ofchloroplasts, J. Mol. Evol. 17: 368–376.

Landschulz, W.H., Johnson, P.F., and McKnight, S.L. 1988. Theleucine zipper: A hypothetical structure common to a new classof DNA binding proteins. Science 240: 1759–1764.

———1989. The DNA binding domain of the rat liver nuclearprotein C/EBP is bipartite. Science 243: 1681–1688.

Lewin, T.M., Wang, P., and Coleman, R.A. 1999. Analysis of aminoacid motifs diagnostic for the sn-glycerol-3-phosphateacyltransferase reaction. Biochemistry 38: 5764–5771.

Loewith, R., Meijer, M., Lees-Miller, S.P., Riabowol, K., and Young,D. 2000. Three yeast proteins related to the human candidatetumor suppressor p33ING1 are associated with histoneacetyltransferase activities. Mol. Cell. Biol. 20: 3807–3816.

Lupas, A., Van Dyke, M., and Stock, J. 1991. Predicting coiled coilsfrom protein sequences. Science 252: 1162–1164.

Matsuda, H., Ishihara, T., and Hashimoto, A.. 1999. Classifyingmolecular sequences using a linkage graph with their pairwisesimilarities. Theoretical Computer Science 210: 305–325.

Matsuda, H. 2000. Detection of conserved domains in proteinsequences using a maximum-density subgraph algorithm. IEICETrans. Fundamentals Electron. Commun. Comput. Sci. E83-A:713–721.

Mietz, J.A., Fewell, J.W., and Kuff, E.L. 1992. Selective activation of adiscrete family of endogenous proviral elements in normalBALB/c lymphocytes. Mol. Cell Biol. 12: 220–228.

Mortlock, D.P., Sateesh, P., and Innis, J.W. 2000. Evolution ofN-terminal sequences of the vertebrate HOXA13 protein. Mamm.Genome 11: 151–158.

Nakai, K. and Horton, P. 1999. PSORT: A program for detectingsorting signals in proteins and predicting their subcellularlocalization. 1999. Trends Biochem. Sci. 24: 34–36.

Noe, B., Hagenbuch, B., Stieger, B., and Meier, P.J. 1997. Isolation ofa multispecific organic anion and cardiac glycoside transporterfrom rat brain. Proc. Natl. Acad. Sci.. 94: 10346–10350.

Ollmann M., Young, L.M., Di Como, C.J., Karim, F., Belvin, M.,Robertson, S., Whittaker, K., Demsky, M., Fisher, W.W.,Buchman, A., et al. 2000. Drosophila p53 is a structural andfunctional homolog of the tumor suppressor p53. Cell 101:91–101.

O’Shea, E.K., Rutkowski, R., and Kim, P.S. 1989. Evidence that theleucine zipper is a coiled coil. Science 243: 538–542.

Pascual, J., Martinez-Yamout, M., Dyson, H.J., and Wright, P.E. 2000.Structure of the PHD zinc finger from human Williams-Beurensyndrome transcription factor. J. Mol. Biol. 304: 723–729.

Pruitt, K.D. and Maglott, D.R. 2001. RefSeq and LocusLink: NCBIgene-centered resources. Nucleic Acids Res. 29: 137–140.

RIKEN Genome Exploration Research Group Phase II Team andFANTOM Consortium. 2001. Functional annotation of afull-length mouse cDNA collection. Nature 409: 685–690.

Rost, B. 1996. PHD: Predicting one-dimensional protein structure byprofile based neural networks. Methods Enzymol.. 266: 525–539.

Saito, A., Furukawa, T., Fukushige, S., Koyama, S., Hoshi, M.,Hayashi, Y., and Horii, A.J. 2000. p24/ING1-ALT1 andp47/ING1-ALT2, distinct alternative transcripts of p33/ING1.Hum. Genet. 45: 177–181.

Skowyra, D., Zeremski, M., Neznanov, N., Li, M., Choi, Y., Uesugi,M., Hauser, C.A., Gu, W., Gudkov, A.V., and Qin, J. 2001.Differential association of products of alternative transcripts ofthe candidate tumor suppressor ING1 with the mSin3/HDAC1transcriptional corepressor complex. J. Biol. Chem. 276:8734–8739.

Smit, A.F.A and Green, P. 1997. RepeatMasker athttp://ftp.genome.washington.edu/RM/RepeatMasker.html

Tamai, I., Nezu, J., Uchino, H., Sai, Y., Oku, A., Shimane, M., andTsuji, A. 2000. Molecular identification and characterization ofnovel members of the human organic anion transporter (OATP)family. Biochem. Biophys. Res. Commun. 273: 251–260.

Tatusov, R.L., Mushegian, A.R., Bork, P., Brown, N.P., Hayes, W.S.,Borodovski, M., Rudd, K.E., and Koonin, E.V. 1996. Metabolismand evolution of Haemophilus influenzae deduced from a wholegenome comparison with Escherichia coli. Current Biology 6:279–291.

Thompson, J.D., Higgins, D.G., and Gibson T.J. 1994. CLUSTAL W:Improving the sensitivity of progressive multiple sequencealignment through sequence weighting, position-specific gappenalties and weight matrix choice. Nucleic Acids Res. 22:4673–4680.

Tsuji, A. and Tamai, I. 1999. Organic anion transporters. Pharm.Biotechnol.. 12: 471–491.

Valdez, B.C. and Wang, W. 2000. Mouse RNA helicase II/Gu: cDNAand genomic sequences, chromosomal localization, andregulation of expression. Genomics 66: 184–194.

Wandersee, N.J., Roesch, A.N., Hamblen, N.R, de Moes, J., van DerValk, M.A., Bronson, R.T., Gimm, J.A., Mohandas, N., Demant,P., and Barker, J.E. 2001. Defective spectrin integrity andneonatal thrombosis in the first mouse model for severehereditary elliptocytosis. Blood 97: 543–550.

Watanabe, A., Ikejima, M., Suzuki, N., and Shimada, T. 1996.Genomic organization and expression of the human MSH3 gene.Genomics 31: 311–318.

West, J., Tompkins, C.K., Balantac, N., Nudelmann, E., Mengs, B.,White, T., Bursten, S., Coleman, J., Kumar, A., Singer, J.W., et al.1997. Cloning and expression of two human lysophosphatidicacid acyltransferase cDNAs that enhance cytokine-inducedsignaling responses in cells. DNA Cell Biol. 16: 691–701.

Wootton, J.C. and Federhen, S. 1993. Statistics of local complexityin amino acid sequences and sequence databases. Comput. Chem.17: 149–163.

Wu, K.C., Bryan, J.T., Morasso, M.I., Jang, S.I., Lee, J.H., Yang, J.M.,Marekov, L.N., Parry, D.A., and Steinert, P.M. 2000. Coiled-coiltrigger motifs in the 1B and 2B rod domain segments arerequired for the stability of keratin intermediate filaments. Mol.Biol. Cell. 11: 3539–3558.

Yamashita, A., Kawagishi, N., Miyashita, T., Nagatsuka, T., Sugiura,T., Kume, K., Shimizu, T., and Waku, K. 2001. ATP-independentfatty acyl-coenzyme A synthesis from phospholipid: CoenzymeA-dependent transacylation activity toward lysophosphatidicacid catalyzed by acyl-coenzyme A:lysophosphatidic acidacyltransferase. J. Biol. Chem. 276: 26745–26752.

Zeremski, M., Hill, J.E., Kwek, S.S., Grigorian, I.A., Gurova, K.V.,Garkavtsev, I.V., Diatchenko, L., Koonin, E.V., and Gudkov, A.V.1999. Structure and regulation of the mouse ING1 gene. Threealternative transcripts encode two PHD finger proteins that haveopposite effects on p53 function. J. Biol. Chem.. 274:32172–32181.

Received April 23, 2001; accepted in revised form December 14, 2001.

Kawaji et al.

378 Genome Researchwww.genome.org

Cold Spring Harbor Laboratory Press on April 16, 2008 - Published by www.genome.orgDownloaded from