Open research data and software

Research software impact

Mature open source projects are listed below. See GitHub for bleeding edge code. See also the brief memo on open licensing of scientific material.

Digital Humanities and Computational Social Science

rOpenGov R package ecosystem for open data and computational social science (R/GitHub) includes 20+ R packages. Introduced and NIPS Machine Learning Open Source Software workshop in 2013. Tools for Eurostat open data, Sotkanet database of National Institute for Health and Welfare (Finland), PX-web API used by the Statistical authorities in many countries, Finnish geospatial data, Finnish open government data, and other tools.

COMHIS open data analytics infrastructure of Helsinki Computational History Group. Includes tools for Finnish national bibliography, Swedish national bibliography, English Short Title Catalogue (ESTC), Heritage of the Printed Book database (CERL), and the Octavo service.

Datavaalit Parliamentary monitoring infrastructure for Finnish election and parliamentary data. Funded by Sitra 2012-2013. Double-winner of Apps4Finland 2012.

Machine learning and ecological models

DMT Dependency Modeling Toolkit. Probabilistic tools for dependency analysis between multiple data sources (R/CRAN). Probabilistic PCA, factor analysis, CCA, regularized variants, dependency-based dimensionality reduction etc. ICML/MLOSS workshop, Israel 2010.

earlywarnings Methods for identification of critical transitions between ecosystem states from time series data (CRAN). Co-developer. CRAN/Github. WICI Data Challenge 2013 runner-up / Waterloo Institute for Complex Systems and Innovation.

Microbial ecology

microbiome R/Bioconductor package R toolkit for microbiome analysis.

Tutorial collection for microbiome analytics

Neutral models Accelerated fitting of neutral models.

CONCOCT Metagenomic contig binning

Functional genomics

netresponse Modeling context-specific activation patterns in genome-wide interaction networks (R/Matlab). Originally applied to study transcriptional responses in genome-scale interaction networks across organism-wide collections of gene expression data. doi:10.18129/B9.BIOC.NETRESPONSE

RPA (R/Bioconductor). Scalable probabilistic method for preprocessing short oligo microarray data. NAR 2013. doi:10.18129/B9.BIOC.RPA

pint Probabilistic data integration for DNA/RNA data in functional genomics (R/Bioc). MLSP 2009.

intcomp Benchmarking for integrative cancer gene discovery algorithms. Briefings in Bioinformatics 2012


Full agreement texts for academic publisher agreements. The agreements were released by the FinELib consortium of research libraries following our FOI request in April 2018.

Scientific journal subscription costs in Finland 2010-2016; MoE/ATT. The data set was released by Finnish Ministry of Education (Open Science and Research Initiative) following my Freedom of Information Request (summary). Finland became the first country to systematically collect and release this information.

Human Gut Microbiota Atlas (Data Dryad); Genus level profiling of the human gut microbiota for 1006 western adults; Lahti et al. Nature Communications 5:4344, 2014

Gut microbiota profiling in African-American diet swap experiment; Data Dryad; O’Keefe et al. Nature Communications 6:6342, 2015

Probiotics intervention data with high-throughput profiling of the gut microbiome and serum lipidome (RData format); from Associations between the human intestinal microbiota, Lactobacillus rhamnosus GG and serum lipids indicated by integrated analysis of high-throughput profiling data. Lahti et al. PeerJ 1:e32, 2013