What is the data science and AI tech radar?

The tech radar visualizes an evaluation of available data science technologies. The individual rings in the radar chart indicate the evaluations of technologies. From inside to outside these categories are:

What is the purpose?

Choosing technologies is hard. Technology choices have to be based on the specific requirements of individual projects. Still, it is possible to provide some general guidance. For example, some technologies are not very suitable for data science and can therefore be ruled out in most cases. Similarly, there are evergreen technologies whose application is nearly risk-free. This visualization aims at giving you an overview of data science technologies and how they should be handled when starting out a new project.

How are the decisions made?

This data science tech radar is maintained by Matthias Döring. The decisions are made based on his personal and professional experiences with the technologies. In order to integrate the knowledge and experience from the data science community, please get in touch if you feel that any technology is missing/misplaced. If you make a contribution, you will of course receive an honorable mention in the acknowledgments section.

Acknowledgments

This tech radar is based on the Zalando Tech Radar, which in turn is based on the pioneering work of ThoughtWorks.

Frameworks

Flask

Flask is a minimalist framework for developing web applications with Python. It is particularly suitable for the development of APIs.

gRPC

gRPC is a remote procedure call protocol that utilizes Google's Protobuf format for data serialization. If you are interested in using Protobuf, consider this guide.

Node.js

Node.js is a JavaScript, open-source environment for servers. It is typically combined with frontend libraries such as AngularJS, React, or Vue.js. Due to the existence of these well-developed libraries, Node.js is the preferred platform for applications with complex user interfaces.

Protobuf

Protobuf is a mechanism for data serialization that is developed by Google. Due to its binary format and the explicit definition of data structures, it is advantageous over JSON. Protobuf can be used together with the gRPC protocol.

scikit-learn

scikit-learn is a Python machine-learning library that offers a multitude of models ranging from decision trees over support vector machines to k-means clustering. Note that, for deep learning, TensorFlow should be used.

Shiny

Shiny is an R framework for building reactive web applications. It is ideally suited for situations in which R code is already available but it should be made available to a larger user base in such a way that users can interact with the data. Shiny is well-suited for prototyping or for building small applications that are used internally. For larger applications or applications where availability and robustness are key, other frameworks such as Node.js or Flask should be considered.

Swagger (OpenAPI)

Swagger is a framework for web API development. It is widely known for Swagger UI, frontend for accessing REST API documentation written according to the Swagger/OpenAPI specifications. Since data science applications are often deployed in forms of APIs, this is a must-have.

Tableau

Tableau is a commercial dashboarding solution. If you want to integrate data from various sources and visualize trends in the data, Tableau is probably the way to go.

TensorFlow

Google's TensorFlow is the dominant framework for deep learning. TensorFlow models can be served in production environments using TensorFlow Serving.

Infrastructure

Ansible

Ansible is a lightweight, open-source tool for the management of compute infrastructure. If you have several servers for which you want to perform configuration tasks such as user management, software installations, and updates, you should consider Ansible.

Apache

The Apache web server (httpd) has been the dominant web server for many years. Nowadays, however, Nginx should be the preferred over Apache.

AWS Lambda

AWS Lambda is a serverless platform for event-driven data processing tasks. Using AWS Lambda, you can define a lambda function that shall be executed, once a given event takes place. AWS Lambda is ideally suited for short, event-driven workloads that do not require a compute instance that runs 24/7.

Chef

Chef is a tool for the automated management of IT infrastructure. Similarly to Puppet, Chef is more complex than Ansible.

Docker

Within a very short time span, Docker has become an essential tool for software development and deployment. Docker is a containerization technology that encapsulates low-level system libraries and dependencies of a package in terms of images, which can be hosted online. These images can then be run on any system in terms of containers, without having to worry about dependency management.

Kubernetes

Kubernetes (K8s) is a container orchestration platform that is based on Docker. K8s is particularly suitable when multiple containers that interact with each other have to be managed. RedHat's OpenShift is an extension of K8s that is geared towards enterprise applications.

Nginx

Nginx is a powerful reverse proxy. Because it offers some advantages over the Apache web server, Nginx should be the preferred choice.

Puppet

Puppet is a tool for infrastructure management and delivery. Similarly to Chef, Puppet is more complex than Ansible.

Data Management

Apache Airflow

According to Apache, Airflow is a platform created by the community to programmatically author, schedule and monitor workflows. Workflow are defined in terms of directed acyclic graphs (DAGs). For a cloud-native solution, consider AWS Data Pipeline, a managed ETL (extract, transform, load) service.

Apache Cassandra

Cassandra is a NoSQL database. It is the most popular wide-column (i.e. 2D key-value) storage solution.

Apache Hadoop

Hadoop is a framework for managing big data. It is based on a map-reduce strategy. HDFS (Hadoop File System) can be combined with Hadoop in order to improve data availability. Hadoop can be combined with Kafka. A cloud-native alternative to the Apache suite of big-data tools is AWS EMR.

Apache Kafka

Kafka is a big-data storage solution that combines solutions for messaging and streaming. Kafka can be combined with Hadoop.

Apache Spark

Apache Spark is an engine for processing stream data. It is frequently used for processing data before it is used in machine learning/data science applications. Spark can be used together with Kafka.

AWS S3

S3 (Simple Storage Service) is a durable, highly-available cloud object storage solution available through AWS. It is ideal for applications such as logging. To reduce costs, data that are not frequently accessed can be automatically transferred to cheaper storage options such as AWS Glacier.

AWS EFS

AWS EFS (Elastic File System) is a file system that elastically scales both performance and available space according to user demands. EFS is a great tool when multiple applications need to quickly access file-system data from a central location.

The ELK Stack

The ELK stack consists of Elasticsearch, Logstash, and Kibana. Elasticsearch indexes a collection of documents and allows users to search within the document collection. Logstash is an application that ingests log files, structures them, and stores them to disk. Logstash can be combined with Elasticsearch. Kibana allows for visualizing data based on Elasticsearch.

MongoDB

MongoDB is the most widely used NoSQL document store. If you are looking for a cloud-native approach, consider AWS DynamoDB.

MySQL

MySQL is probably still the most widely known RDBMS (relational database management system). Since the future of MySQL is uncertain, the GPL-licensed MariaDB should be used instead.

Presto

Presto is a SQL engine for big data applications. Using Presto, data can be retrieved from various sources such as Hadoop, AWS S3, MySQL, MongoDB, and Kafka.

RabbitMQ

RabbitMQ is a message broker that is frequently used in terms of a middleware connecting independent applications. RabbitMQ is particularly useful for microservice architectures. A cloud-native alternative to RabbitMQ is AWS SQS, a queuing solution that can be combined with AWS SNS.

Redis

Redis is the dominant database when it comes to in-memory caching. It can also be used as a message broker. An alternative to Redis is Memcached. Redis should be considered for any application that needs to store large numbers of objects in memory, e.g., for storing session data or for implementing a queue.

Languages

Bash

Bash is a shell language that excels at working with files and programs at the Unix system level. Many data applications require the successful execution of several programs or data wrangling, for which Bash is ideal. Ideally, Bash scripts should be relatively small. For complex file system task, other solutions should be considered.

C

C is a low-level language that is particularly suited for all applications that work closely with the hardware of a system, for example operating systems such as Linux or embedded systems. However, it is not a language that is very useful for data science.

C++

Similarly to Java, C++ is a well-established high-level programming language. However, in contrast to Java, C++ is geared towards fine-grained control over system resources. For example, C++ does not have an automatic garbage collection but requires developers to think about how they manage memory. While C++ allows for very efficient and stable software, development is typically considerably slower than in other languages due to a lack of supporting libraries, long compile times, and the intricacies of the language. However, if performance is a prime goal of the software, C++ is probably the way to go. It is no coincidence that the core of TensorFlow is written in C++. While C++ is not suitable for data science tasks, it excels when engineering large-scale AI systems for which performance is key.

Go

Go is a minimalist, compiled language that was developed by Google: it is performant, type-safe, and transparent. However, there is currently little support for data exploration, which means that Go is not suitable for typical data science tasks. However, Go is well-suited for infrastructure management tasks, for example for model deployment. So, while Go will probably never become a typical language data scientists will use, it should be considered for AI engineering tasks.

Java

Java has been the most widely used high-level programming language for a long time now. Due to its virtual machine, Java applications are independent of the system architecture, rendering Java code highly portable. Java has an active community and has great ecosystems for overall software engineering, for example, due to highly developed IDEs that are very convenient to use. However, Java does not really support the data science process. Thus, Java likely only plays a role if models have to be integrated into an existing Java application.

JavaScript

JavaScript is currently the dominant language when it comes to web development, particularly frontend development. So, when you are thinking about bring your data science application to a customer, it would be worthwhile to use a JavaScript web application (e.g. Express + React) for this purpose. For machine learning, JavaScript enables one to run deep models on the client-side which can be useful if you know that clients can handle the load.

Julia

Julia is a language that is gaining more and more traction in the data science community. The main reason for this its usability and its C-like performance. However, this is only possible because Julia is a compiled language. Despite its support in Jupyter notebooks, Julia will probably never become the go-to language for data exploration due to the lag that caused by the compilations. But, if you have demanding calculations, you should definitely evaluate Julia.

Kotlin

Similarly to Java, Kotlin is based on the Java Virtual Machine. Kotlin, however, is a leaner language than Java. For example, it does not enforce object-oriented development. Moreover, Kotlin is compatible with Android devices, which makes it great for the mobile market. However, all of these areas are not very relevant for data science.

Matlab

Matlab is a commercial language that is geared towards mathematical calculations, which is why it has collected some users in the data science sphere. However, with Julia there is a strong, freely available competitor on the market.

PHP

Once the star of web development, PHP has fallen from grace in recent years. This is mainly due to the wide adoption of JavaScript over PHP.

Python

Python is the most widely used scripting language out there. There is a very good support in terms of external libraries. For example, the TensorFlow, scikit-learn, numpy, and pandas packages are particularly suitable for data science.

R

R is a programming language that natively supports vectorized operations. Thus, R is ideal for all applications in which multi-dimensional data are relevant. Due to R's popularity in the statistical community, there is a plethora of libraries for visualizing, exploring, and modeling data. Overall, R is not as generally applicable and not as robust as other languages, which is why alternatives such as Python should be considered for productive systems.

Rust

Rust is a compiled language that's geared towards robustness. Similarly to C++ it allows low-level management of resources, which will take up a lot of development time. So if you want to move fast, Rust is not for you. There are a few early-stage machine learning libraries available for Rust, so if you are looking for performance and stability, it may be worth a shot.

Scala

Scala is an object-oriented language for the Java Virtual Machine. For data science applications, there seems to be some usefulness with respect to big data frameworks such as Hadoop or Spark. Otherwise, Scala is not suitable for data science applications.

TypeScript

Typescript is a variant of JavaScript that offers a typing system. The goal of TypeScript is to make JavaScript applications more stable. However, the question is whether the additional burden during the implementation is worth it. If you want to move fast, JavaScript is preferable.