Integration of data from multiple sources: a fusion of epidemiology and bioinformatics with applications to complex diseases

The continuous accumulation of biomedical data has made imperative the need to combine them in an integrative analysis. Meta-analysis is a statistical method, first applied in the field of psychology, in which a set of original studies is synthesized and the potential diversity across them is explored. In medical research it was initially applied to randomized clinical trials, but nowadays, it is considered a valuable tool for the combination of observational studies, diagnostic studies, pharmacogenomics, as well as for gene-disease association studies. Moreover, meta-analysis has also been proven useful in summarizing the results of high-throuput experiments such as genome-wide association studies (GWAS) and gene expression studies using microarrays. A single meta-analysis that addresses one treatment comparison for one outcome or one gene-disease association, even if perfectly done with perfect data, may offer a short-sighted view of the evidence. This may suffice for decision-making if there is only one treatment choice for this condition, only one outcome of interest and research results are perfect. However, usually there are several alternative treatments that need to be compared, whereas in genetic epidemiology, complex diseases are generally considered to be influenced by a large number of genes as well as by environmental factors all of which are thought to have additive or synergetic effects on disease development and progression.
Figure 1: Schematic representations of (A) an umbrella review encompassing 13 comparisons involving 8 treatment options (7 active treatments and a placebo) and (B) a network with the same data. Each treatment is shown by a node of different colour, and comparisons between treatments are shown with links between the nodes. Each comparison may have data from several studies that may be combined in a traditional meta-analysis

In this project, we propose an integrated framework that bridges genetic epidemiology data and methods, with methods of computational biology and bioinformatics. Genetic epidemiology, which will be in the focus of our proposal, is a relatively new discipline that emerged as the fusion of traditional epidemiology with genetics. In this discipline epidemiological methods are applied in order to uncover the potential role of genetic variants in the aetiology of disease. This emerging and rapidly developing field, studies the genetic elements of diseases as well as the joint effects of genetic factors and environmental determinants in large populations. As data accumulate, the collection and organization of the large amount of biological data and the development of specialized publicly available databases, has provided the means for integrating genetic epidemiology with bioinformatics. The human genome project, the high-throuput methods and the revolution of bioinformatics have boosted genetic association studies during the last years offering a huge number of potential risk factors (genes and their variants) that could be implicated in various human diseases. The prior expertise of the members of our research team in developing bioinformatics tools and databases has also played an important role in the conception of the project as well as in guarantying its success. The inter-disciplinary approach that we propose towards the integration of bioinformatics with genetic epidemiology will cover a broad area of research including genetic association studies, meta-analysis, development of statistical and mathematical methodology, development of software and functional or structural validation of the findings using molecular biology techniques. All these tasks will be oriented towards understanding the molecular basis of common multifactorial diseases.
Figure 2: Graphical representations of the commonly encountered types of associations in genetic-epidemiology. (A) Gene-disease association (upper), gene-phenotype association (middle) and Mendelian randomization (bottom) with the dotted arrow representing the indirectly inferred association of the phenotype with the disease (see the main text). (B) A gene which is associated with two diseases (upper) and two genes in linkage disequilibrium associated with a single disease. The latter case was dealt in a meta-analysis setting only very recently, whereas the former is not yet completely addressed in the general case

Complex or multifactorial diseases are generally considered to be influenced by a large number of genes as well as by environmental factors all of which are thought to have additive or synergetic effects on disease development and progression. Type 2 diabetes mellitus (T2DM) and essential hypertension (EH) are two complex diseases with significant burden on the world population and many devastating complications. It seems that these conditions are interrelated, with shared risk factors such as obesity. It has been shown that each condition is a risk factor for developing the other, both leading to cardiovascular disease. Presumably, T2DM develops when a diabetogenic lifestyle is superimposed upon a susceptible genotype. Based on evidence from large, prospective studies, T2DM adversely affects all components of the cardiovascular system, from microvasculature to heart, comprising thus an established predictor of cardiovascular disease. The increased risk is partly attributed to the pernicious effects of persistent hyperglycemia on the vasculature and partly to the other coexisting metabolic risk factors. Many causes that will be investigated in this project lead to increased levels of blood pressure including genetic variations, obesity, high alcohol intake, high salt intake, aging and perhaps sedentary lifestyle, stress, low potassium and calcium intake.
Figure 3: Complete causal models of disease development. (A) Huntington Disease, (B) Phenylketonuria, (C–F) Hypothetical examples for complex diseases. White areas refer to genetic factors and grey areas to environmental factors. (G) Schematic (and incomplete) presentation of pathways that are involved in coronary heart disease (CHD). Potential interactions between the risk factors have been omitted. The dotted circle indicates unmeasured or unknown intermediate factors in other pathways

Project goals

The goals of the proposed research are:
  1. To extend the already established methods of multivariate meta-analysis in order to include data for multiple outcomes or for multiple risk factors.
  2. To provide a unified view and methodology for integrating data from multiple sources (genetic data, environmental factors, gene expression data and so on).
  3. To provide software that will permit the analysis to be carried out even by non-specialists and to disseminate the methodology to a wider audience.
  4. To apply the methods in common diseases of multifactorial origin such as diabetes, hypertension, stroke and myocardial infarction and validate the results experimentally.


  1. Burton, P. R., M. D. Tobin, et al. (2005). "Key concepts in genetic epidemiology." Lancet 366(9489): 941-951
  2. Ioannidis, J. P. (2009). "Integration of evidence from multiple meta-analyses: a primer on umbrella reviews, treatment networks and multiple treatments meta-analyses." CMAJ 181(8): 488-493
  3. Janssens, A. C. and C. M. van Duijn (2008). "Genome-based prediction of common diseases: advances and prospects." Hum Mol Genet 17(R2): R166-173
  4. Jackson, D., R. Riley, et al. (2011). "Multivariate meta-analysis: Potential and promise." Stat Med.
  5. van Houwelingen, H. C., L. R. Arends, et al. (2002). "Advanced methods in meta-analysis: multivariate approach and meta-regression." Stat Med 21(4): 589-624.
  6. Trikalinos, T. A., G. Salanti, et al. (2008). "Meta-analysis methods." Adv Genet 60: 311-334.

Pantelis Bagos,
May 31, 2015, 6:32 AM