Mouse genetics is, by its very nature, a collaborative field of scientific investigation. This is because the interpretation of data collected by any one scientist is highly dependent on data collected by others. High resolution genetic maps are often formed through the integration of results obtained in many individual studies, and as each new result is published, it can be swept up into a system of databases (see Appendix B). Large-scale integration has been possible only because all mouse geneticists speak the same language. The definition of this language is provided by the International Committee on Standardized Nomenclature for Mice which has been in existence since 1939. This committee is charged with the task of establishing and updating rules and guidelines for genetic nomenclature. The continued functioning of this committee is critical because, as the analysis of the genome becomes ever more sophisticated, new genetic entities become apparent, and these must be named in a standard fashion.
For a complete description of the "Rules and Guidelines for Gene Nomenclature," one should consult the Lyon and Searle book (Committee on Standardized Genetic Nomenclature for Mice, 1989), and updates published regularly in Mouse Genome. Here, I will briefly review the salient features of this nomenclature system with a focus on the naming of newly defined genes and loci. Once an investigator has chosen a new name and symbol for a locus, the chair of the Committee should be contacted for confirmation that the rules have been followed properly, and the names do not conflict with others already in use.
There is no rhyme or reason to the names given to the original inbred strains derived at the beginning of the century. The name of the famous BALB/c line was derived by co-joining the name of the investigator (Bagg) with the color of the mouse (albino).
Bagg's ALBino became BALB. Other famous strains have names based on animal numbers; for example, female no. 57 (from Abbie Lathrop's farm) gave rise to both the C57BL/6 and C57BL/10 strains which are commonly abbreviated as B6 and B10 respectively. New inbred strains can be named freely by their originators as long as certain rules are followed: the name should be brief, and it should begin with a capital letter followed by other capital letters (preferably) or numbers. 18 Strains with a common origin that have been separated prior to the F20 generation must be given separate symbols, although these symbols can indicate their relationship to each other. All names should be registered with the appropriate contact person who is indicated prominently in the current issue of Mouse Genome.
Substrains can arise whenever two or more colonies of an established inbred strain are maintained in isolation from each other for a sufficient period of time to allow detectable genetic differences to become fixed. There are three specific instances where substrain formation can be considered to have occurred: (1) when branches of an inbred strain are separated before the F40 generation when residual heterozygosity is still likely, (2) when a branch has been maintained separately from other branches for 100 or more generations, and (3) when genetic differences from other branches are uncovered. Such differences can be caused by any one or more of three factors: residual heterozygosity at the time of branching, mutation, or contamination.
Substrains are indicated by appending a slash (/) to the strain symbol followed by an appropriate substrain symbol, for example, DBA/1 and DBA/2. A laboratory registration code is often included within the substrain designation, for example, C57BL/6J and C57BL/10J are two substrains of C57BL that are both maintained at the Jackson Laboratory (indicated with a J). On the other hand, a different nomenclature has been formulated recently to distinguish the same strain maintained without any apparent genetic differences by two or more laboratories. In this case, the "@" character is appended to the strain symbol followed by the laboratory registration code. For example, the SJL strain maintained by the Jackson Laboratory would be symbolized as SJL@J.
In the first set of rules for distinguishing gene symbols laid down by the Committee on Mouse Genetics Nomenclature in 1940, it was stated that "the initial letter of the mutant gene symbol shall be the same as the initial letter of the mutant gene, e.g. d for dilution. Additional letters shall be added to the initial letter if necessary to distinguish it from symbols already in use" (Snell, 1941 p.242). 19 With over three thousand independent loci identified as of 1993, the necessity of using symbols that contain more than one letter is now obvious. In fact, the recent explosion in gene and locus identifications in the mouse has brought about a re-evaluation of the entire basis for the naming of chromosomal entities. At the time of this writing, a final consensus has not yet been reached. Thus, investigators are cautioned to contact members of the International Committee on Standardized Nomenclature for Mice before settling on a name for a new genetic entity.
Each mouse locus is given a unique name and a unique symbol. In devising new names, investigators should consider their suitability for inclusion into databases. Thus, names should be limited in length to fewer than 40 characters (including spaces), and should not include Greek letters or Roman numerals. The symbol is a highly abbreviated version of the name. In published articles, locus symbols (but not names) are always set in italic font. Symbols always begin with a letter followed by any combination of letters or Arabic numbers without internal white space. In the past, symbols were typically three to eight characters in length. Today, database considerations set a preferred maximum number of characters at 10, although this rule is frequently broken.
Loci that are members of a related series of some kind are given the same primary stem and symbol followed by a distinguishing number or letter. Thus, the third esterase gene to be defined is named "Esterase 3", with the symbol Es3, and the second homeo box gene cluster to be identified is named "homeo box B cluster" with the symbol Hoxb. In the past, a hyphen was often used to separate the numeral designation from the body of the gene symbol, e.g. Es-3. This practice has now been discontinued and hyphens have been deleted from all symbols except in the special cases discussed just below.
When one member of a series has been further duplicated into a closely linked cluster of related genes, a number can be appended to the cluster name; individual genes in the Hoxb cluster will be named homeo box B1, homeo box B2 etc. with symbols Hoxb1, Hoxb2 etc. For clusters that were initially named with an appended number like Lamb1 the individual symbols can be named by appending the cluster name with a hyphen followed by a number to obtain the symbols Lamb1-1, Lamb1-2, etc.
All loci can be broadly separated into two classes. The first class includes loci known to be functional or homologous to functional loci. With few exceptions, these loci are genes or pseudogenes. The second class includes sequences identified solely on the basis of DNA variation. Members of this latter class are referred to as anonymous loci because their function or lack thereof is unknown. The rules for naming each of these classes of loci follow below.
Gene names should convey in a concise form, and as accurately as possible, the character by which the gene is recognized. Genes can be named according to an expressed phenotype (retinal degeneration or shiverer), an enzyme or protein name or function (glyoxalase-1, hemoglobin alpha chain, or octamer binding transcription factor 1), a pattern of expression (t-complex testes-expressed-1), a combination of these (myosin light chain alkali-fast skeletal muscle), or by homology to genes characterized in other organisms (homeo box A, homeo box B, etc.; retinoblastoma). Except in the case of genes that are first characterized through a recessive mutation, names and symbols should begin with an upper case letter. With all symbols, all letters that follow the initial character should be lower case.
Whenever a mouse gene is characterized based on homology to a gene already named in another species, the mouse homolog should be given essentially the same name and symbol. Of course, one should always check the mouse gene databases (see Appendix B) to be certain that the symbol has not already been assigned. In the translation from human to mouse symbols, characters beyond the first should be converted from upper to lower case.
All pseudogenes are defined by homology to known genes. Their symbol is a combination of the known gene name (as a stem), and the pseudogene designation (ps) followed by a serial number. Thus, the third alpha globin pseudogene has been given the name "hemoglobin alpha 3 pseudogene", with the symbol Hba-ps3.
When new loci are uncovered by hybridization with known genes, the functionality of the new locus is usually unknown. In these cases, where the locus could be either a functional gene or a pseudogene, it should be named with a "related sequence" symbol (rs). Thus, if a new locus is uncovered by cross- hybridization with a probe for the Plasminogen gene (symbolized Plg), it would be named "Plasminogen related sequence-1" and would be symbolized as Plg-rs1. If a new locus is uncovered with a probe for one member of a series of loci, the rs symbol is appended without the hyphen. Thus, a locus related to Ela1 would be symbolized as Ela1rs1.
Anonymous DNA loci are named in a straightforward manner. The symbol should begin with the character "D" (for DNA), followed by an integer representing the chromosomal assignment, followed by a two to three letter registration code representing the laboratory or scientist that described the locus, followed by a unique serial number given to the locus to distinguish it from others on the same chromosome described by the same investigator. For example, the twenty-third anonymous locus mapped to chromosome 14 by the Pasteur Institute would be given the symbol D14Pas23. The name for this locus would include all of this information in longhand form "DNA segment chromosome 14 Pasteur 23". This DNA locus nomenclature system should be used for all loci defined only as DNA segments including, but limited to, microsatellites, minisatellites and RFLPs (see Chapter 8). To obtain a unique laboratory or investigator registration code, please contact the Institute for Laboratory Animal Resources, USA National Academy of Sciences, Washington, D.C.
Mouse homologs of anonymous DNA loci first mapped in humans are named in a somewhat different format in order to allow the connection between the two species to be perfectly transparent.
In these cases, the symbol should still begin with the character "D" and the mouse chromosomal assignment, but this should now be followed by the character "h", the chromosomal assignment of the human homolog and its identification number. For example, a probe to the human locus D17S111 the 111th single copy (S) anonymous locus mapped to human chromosome 17 is used to identify a mouse homolog on Chr 1. This corresponding mouse homolog will now be named D1h17S111.
In the case of a gene defined initially by a mutant phenotype, the symbol for the first defined mutant allele becomes both the gene symbol and the symbol for that allele. The corresponding wild-type allele is indicated by a + sign. For example, an animal heterozygous at the tf locus with a wild-type and the defining mutant allele would have a genotype symbolized as +/tf. In this case, the context is sufficient to indicate the association of the + symbol with the tf locus. When the context is not sufficient to indicate association, the wild-type allele of a specific locus should have the locus symbol appended to it as a superscript. Thus, the wild-type allele at the tf locus can also be designated as +tf.
In all other cases, alleles are designated by the locus symbol followed by an allele-defining symbol that is usually one or a two characters in length and set in superscript, with the entire expression set in italics. This rule also applies to mutant alleles beyond the first one that are uncovered at a phenotypically defined locus. For computer presentation with only ASCII (text format) code, the allele designation can be set off from the locus symbol by prefixing it with a * or with angular brackets; for example, Hbbd becomes Hbb*d or Hbb<d>.
The simplest means for assigning allele names is through a series of lower case letters, beginning with a. Thus, the Hba-ps4 gene has alleles Hba-ps4a, Hba-ps4b, etc. In many cases, it can be useful to provide information within the allele symbol. For example, a M. spretus-specific allele may be given the designation s as in DXPas4s. 20 This type of nomenclature can be extended to alleles associated with the common inbred strains such as B6 (signified by the b allele) and DBA (signified by the d allele) as well as the subspecies musculus (m), castaneus (c), and domesticus (d).
New "random" mutations at previously characterized genes are denoted by a superscript m followed by a serial number and the 1-3 letter code representing the laboratory or scientist that described the new allele. When specific mutations are generated by the gene targeting technologies (see Section 6.4), the same nomenclature applies except that the superscript m is proceeded by a superscript t. Thus, the third knockout allele created at Princeton University by gene targeting at the Cftr locus would be designated as Cftrtm3Pri. To obtain a unique laboratory or investigator registration code, please contact the Institute for Laboratory Animal Resources, USA National Academy of Sciences, Washington, D.C.
The experimental introduction of foreign DNA into the germ line of a mouse results in the creation of a new transgene locus at the site of integration. The official symbol for a transgene locus has five parts. First is the designation Tg for transgene. Second is a letter indicating the mode by which the transgene was inserted; N is used for nonhomologous insertion, R for insertion with a retroviral vector, and H for homologous recombination. With the standard production of transgenic mice by embryo injection, N would be used; with homologous recombination in embryonal stem cells followed by chimera formation to rescue the transgene into the germ line, H would be used; for transgenic animals produced by retroviral infection of embryos, R would be used.
The third part of the symbol contains a mnemonic, of six characters or fewer, that describes the salient features of the transgene insert written within parentheses. If the insert includes a defined gene, the gene symbol should be incorporated into the mnemonic without hyphens. Other standard abbreviations for use within the mnemonic include: An for anonymous sequence; Nc for noncoding sequence; Rp for reporter sequence; Et for enhancer trap; Pt for promoter trap; and Sn for synthetic sequence. The fourth part of the symbol is an investigator-assigned 1-5 digit number. The fifth and last part is the laboratory code. An example of this nomenclature is as follows. Castle has injected mouse embryos with a construct containing the Pgk2 coding sequence as a reporter. He names the transgene locus present in the fourth line that he recovers TgN(RpPgk2)4Cas.
When the insertion of a transgene at a particular site results in a new mutation through the disruption of a gene present normally in the genome, this mutation should be named independently of the transgene locus itself. The rationale for this rule is that the contents of the transgene are independent of the locus uncovered through insertional mutagenesis. However, the mutant allele associated with the transgene should incorporate the transgene symbol as the superscripted allele designation. For example, if Castle's construct became inserted into the Hbb locus in the fifth line that he derived, the new Hbb allele that was created would be called HbbTgNrpPgk25Cas. Notice that the parentheses have been removed from the allele symbol. If a new mutation has been induced at a previously unidentified locus, the mutant phenotype should be used to name the new locus.
In this section, I have only touched upon those issues of nomenclature that will
be of most concern to the majority of molecular biologists involved in studies of the
mouse genome. In fact, the nomenclature rules developed for the mouse are rather
extensive and are discussed in much greater detail in the Lyon and Searle
compendium
(Committee on Standardized Genetic Nomenclature for Mice, 1989),
with additions and changes published regularly in Mouse Genome. As a final note,
one must keep in mind that mouse genetic nomenclature will continue to evolve
with the field as a whole. As new types of genetic elements and inter-relationships
are uncovered, it will be the charge of the Nomenclature Committee to keep the
rules internally consistent and up to date.
See also the
Nomenclature Rules and Guidelines at MGI.