Large-scale data fusion (Faculty of Computer and Information Science, University of Ljubljana)

Authors:Marinka Žitnik, Blaž Zupan

The increased prevalence of antibiotic-resistant bacteria urges us to explore alternative strategies to treat bacterial infections. Nature abounds with species that are highly resistant to pathogenic bacteria, including the popular model organism Dictyostelium, which is a bacterial predator. Uncovering its bacterial resistance pathways could improve our understanding of core resistance mechanisms and lead us to drugs that target such pathways in humans. But even in Dictyostelium, the experiments are expensive and take time. Our partners at Baylor College of Medicine took five years to discover four bacterial resistant genes.

To speed-up the discovery process, we have developed a computational approach that can consider a vast array of data sets, including data on mutant-based phenotypes, gene expressions, protein interactions, gene functional annotations, literature, drugs, and effects of drug treatments.

We represent each data set as a matrix, then collectively compress the matrices to get rid of noise and retain only the prevailing, true data patterns. The resulting compressed data system is then used to identify genes with high probability of a target phenotype. For bacterial resistance, we have predicted the role for nine new genes. For eight of these genes, our predictions were confirmed in the wet lab. This time, instead of spending five years to discover the initial four genes, our partners took only one month of lab time.

Žitnik M, Zupan B (2015) Data Fusion by Matrix Factorization, IEEE Transactions on Pattern Analysis & Machine Intelligence, 37(1): 41-53.
Žitnik M, Nam EA, Dinh C, Kuspa A, Shaulsky G, Zupan B (2015) Gene Prioritization by Compressive Data Fusion and Chaining, PLoS Computational Biology, 11(10): e1004552.