Progress towards deciphering the genetic cause of human diseases has recently been empowered by next generation sequencing technologies. In typical analyses of whole exome sequences, the number of non-reference coding variants is ~25,000. Prevalent pipelines, focusing on identifying rare and deleterious variants (irrespective of phenotype), sift these SNPs and indels by rarity, by predicted damage and evolutionary conservation in the encoded protein, and by segregation in more than one affected individual. The resulting short-list contains dozens or even hundreds of variants; the researcher/clinician then faces the hurdle of connecting one (or just a few) variant-carrying genes to the patient's phenotype. We submit (with examples) that this last step of effective zooming-in on the target can be achieved by judiciously identifying often subtle connections between: a) genes in the short list, and b) keywords that describe the disease/symptoms.
VarElect is a new Variant Election facility within GeneCards, for phenotype-dependent variant prioritization, leveraging the rich information within GeneCards and MalaCards. Users submit phenotype/disease keywords related to a sequenced individual, as well as a list of genes that reflect variations (typically 100-300). VarElect acts jointly on the gene list and phenotype keywords, and produces a list of prioritized, scored, and contextually annotated genes. VarElect infers direct as well as indirect links between genes and phenotypes. An example of an indirect GeneCards-based inference of GeneA to Phenotype PhenX is when PhenX is found to be associated with GeneB, which in turn shares a pathway with GeneA. Such gene-to-gene relationships are also formed (among others) by interaction networks, paralogy relations, domain-sharing, and mutual publications. MalaCards, in turn, allows one to produce a comprehensive phenotype search expression by using its data about diseases, underlying symptoms, and their relationships. The degree of mutual linking is quantified via endogenous search scores. VarElect provides a robust algorithm for ranking genes within a short list, and pointing out their likelihood to be related to a disease, enabling the researcher/clinician to perform the last decision step in deep sequencing runs in a fast and objective manner.
AlgorithmPhenotypes are searched in conjunction with variant-related genes (see figure below), yielding a gene list where each is scored by hits for any of the input terms. VarElect creates two prioritized lists: one of direct results, namely genes in the input list containing hits for any of the phenotypes, the other of indirect results contains genes in the input list with NO hits for any of the phenotypes, but which are found to be connected to any other genes WITH hits for the phenotypes via GeneCards search. After finding indirect connections (e.g. via a "hit" in say Interactions) VarElect creates a new scoreto produce the final prioritized list.
A gene will be considered as related to one of the variant-related genes if it comes up as a result of a GeneCards textual search for that gene. In some cases, genes may have hundreds of indirect relations to the requested phenotypes. Showing such a big list of results can be counter-productive to the user. Therefore, the prioritization score will include only the top 5 (or less) indirect gene-to-phenotype relations.
The results will be ordered by score in descending order, and the final score for the variant-related gene will be an exponentially decaying sum of those scores. This means that the scores of the indirect relations may have a greater impact on the prioritization than the number of indirect relations.
ScoringEach resultant gene is scored by first boosting direct hit search engine (Solr) relevance scores, and then adding adjusted (to avoid overpowering) scores for indirect associations. We are experimenting with gene/phenotype algorithmic tweaks, as well as with normalizations and attribute-specific refinements.
How to use VarElect
Basic VarElect Search
- Enter the list of GeneCards gene symbols related to your variants in the text box marked "Gene Symbols". The list may be delimited by commas or any white space (spaces, tabs, new lines etc.)
- Enter the list of phenotype/disease keywords related to a sequenced individual in the text box marked "Phenotype Keywords". By default, a result gene will appear if it is related to any of the listed keywords. Use quotes to match multi-word phrases. You may use a combination of AND and OR boolean expressions (capitalized) to further specify the search.
- Click "Search" to find direct and indirect relations between genes and phenotypes.
The results displayed are the prioritized list of variant related genes entered by the user, seperated to directly related genes and indirectly related genes. Each gene has some data added to it retrieved from the GeneCards database, and the calculated score of the relationship with the entered phenotype related keywords.
If there are any, the genes directly related to the phenotypes (via GeneCards textual search) will be displayed first, and the indirectly related genes will be in the second tab. Note that the scores of direct and indirect relations are not comparable, since they are calculated differently.
Clicking the '+' sign next to each result will give a list of the genes indicating a direct/indirect relation to the phenotypes. For each of these genes, clicking the '+' sign will give detailed information on how the connection was made, both to the phenotype and to the variant related gene.
To save loading time, the detailed information about textual searches are only retrieved on demand, so it may take a few seconds to display this data. However, the data is retrieved asynchronously, so it does not interfere with continued browsing of the results.
hostname: 356980-web2.xennexinc.com index build: 128 solr: 1.4