Collaborative data analysis using SWISH DataLab
The SWISH DataLab addresses two of the main bottlenecks of Data Science – that of bringing data from different sources together, and cleaning and selecting data that is relevant for further analysis.
SWISH unites SWI-Prolog and R together behind a web based IDE that resembles Jupyter notebooks. The platform allows multiple data scientists to work on the same data simultaneously while rule sets can be reused and shared between users. This facilitates data scientists to provide more complex data transformation steps to domain experts.
Most pipelines use a general purpose programming language such as Python to clean and ingest the data into a linked data store or RDBMS. The relevant data is then selected and appropriate machine learning is applied. In contrast, SWISH data management is based on Prolog, a relational and logic based language. External data sources, such as RDBMS, Linked Data, CSV files, XML files and JSON, are made available using a mixture of adaptors, which make the data available in Prolog’s relational model without transferring the data, and ingestion, which loads the data into Prolog. Allowing the data to be used in a unified framework without transferring this data simplifies bringing the data together.
Subsequently, declarative rules define a clean and coherent view on the data that is targeted towards analysing this data. Given the logic basis of Prolog, this view is modular, concise and declarative, making it easy to maintain. SWI-Prolog’s tabling extension provides the same termination properties as DataLog as well as the same order independence of rules within the subset Prolog shares with DataLog. Tabling also provides caching of results. At the same time, users have access to the more general Prolog language to code transformations that are not supported by DataLog. According to Wikipedia, “In recent years, Datalog has found new application in data integration, information extraction, …”. SWISH adds collaboration as well as Turing completeness to deal with transformation that Datalog is not capable of in a coherent environment.
- The SWISH DataLab can be configured to allow both authenticated users and anonymous users with limited access rights.
- Notebooks and programs are stored in a GIT-like repository and fully versioned.
- Results can be reproduced reliably through creating a snapshot of a query and all relevant programs.
- Data views defined in SWISH may be downloaded as CSV and can be accessed through a web based API.
- The platform can be deployed on your laptop as well as on a server.
The SWISH DataLab provides a high-level platform to select and combine data sources in multiple workflows, while using tools that are in common usage by data analysis professionals.
Everything you need to get started with the SWISH Datalab is available as open source software: