The tech radar visualizes an evaluation of available data science technologies. The individual rings in the radar chart indicate the evaluations of technologies. From inside to outside these categories are:
Choosing technologies is hard. Technology choices have to be based on the specific requirements of individual projects. Still, it is possible to provide some general guidance. For example, some technologies are not very suitable for data science and can therefore be ruled out in most cases. Similarly, there are evergreen technologies whose application is nearly risk-free. This visualization aims at giving you an overview of data science technologies and how they should be handled when starting out a new project.
This data science tech radar is maintained by Matthias Döring. The decisions are made based on his personal and professional experiences with the technologies. In order to integrate the knowledge and experience from the data science community, please get in touch if you feel that any technology is missing/misplaced. If you make a contribution, you will of course receive an honorable mention in the acknowledgments section.
Protobuf is a mechanism for data serialization that is developed by Google. Due to its binary format and the explicit definition of data structures, it is advantageous over JSON. Protobuf can be used together with the gRPC protocol.
scikit-learn is a Python machine-learning library that offers a multitude of models ranging from decision trees over support vector machines to k-means clustering. Note that, for deep learning, TensorFlow should be used.
Shiny is an R framework for building reactive web applications. It is ideally suited for situations in which R code is already available but it should be made available to a larger user base in such a way that users can interact with the data. Shiny is well-suited for prototyping or for building small applications that are used internally. For larger applications or applications where availability and robustness are key, other frameworks such as Node.js or Flask should be considered.
Swagger is a framework for web API development. It is widely known for Swagger UI, frontend for accessing REST API documentation written according to the Swagger/OpenAPI specifications. Since data science applications are often deployed in forms of APIs, this is a must-have.
Tableau is a commercial dashboarding solution. If you want to integrate data from various sources and visualize trends in the data, Tableau is probably the way to go.
Ansible is a lightweight, open-source tool for the management of compute infrastructure. If you have several servers for which you want to perform configuration tasks such as user management, software installations, and updates, you should consider Ansible.
AWS Lambda is a serverless platform for event-driven data processing tasks. Using AWS Lambda, you can define a lambda function that shall be executed, once a given event takes place. AWS Lambda is ideally suited for short, event-driven workloads that do not require a compute instance that runs 24/7.
Chef is a tool for the automated management of IT infrastructure. Similarly to Puppet, Chef is more complex than Ansible.
Within a very short time span, Docker has become an essential tool for software development and deployment. Docker is a containerization technology that encapsulates low-level system libraries and dependencies of a package in terms of images, which can be hosted online. These images can then be run on any system in terms of containers, without having to worry about dependency management.
Kubernetes (K8s) is a container orchestration platform that is based on Docker. K8s is particularly suitable when multiple containers that interact with each other have to be managed. RedHat's OpenShift is an extension of K8s that is geared towards enterprise applications.
Puppet is a tool for infrastructure management and delivery. Similarly to Chef, Puppet is more complex than Ansible.
According to Apache, Airflow is a platform created by the community to programmatically author, schedule and monitor workflows. Workflow are defined in terms of directed acyclic graphs (DAGs). For a cloud-native solution, consider AWS Data Pipeline, a managed ETL (extract, transform, load) service.
Cassandra is a NoSQL database. It is the most popular wide-column (i.e. 2D key-value) storage solution.
Hadoop is a framework for managing big data. It is based on a map-reduce strategy. HDFS (Hadoop File System) can be combined with Hadoop in order to improve data availability. Hadoop can be combined with Kafka. A cloud-native alternative to the Apache suite of big-data tools is AWS EMR.
S3 (Simple Storage Service) is a durable, highly-available cloud object storage solution available through AWS. It is ideal for applications such as logging. To reduce costs, data that are not frequently accessed can be automatically transferred to cheaper storage options such as AWS Glacier.
AWS EFS (Elastic File System) is a file system that elastically scales both performance and available space according to user demands. EFS is a great tool when multiple applications need to quickly access file-system data from a central location.
The ELK Stack
The ELK stack consists of Elasticsearch, Logstash, and Kibana. Elasticsearch indexes a collection of documents and allows users to search within the document collection. Logstash is an application that ingests log files, structures them, and stores them to disk. Logstash can be combined with Elasticsearch. Kibana allows for visualizing data based on Elasticsearch.
Presto is a SQL engine for big data applications. Using Presto, data can be retrieved from various sources such as Hadoop, AWS S3, MySQL, MongoDB, and Kafka.
RabbitMQ is a message broker that is frequently used in terms of a middleware connecting independent applications. RabbitMQ is particularly useful for microservice architectures. A cloud-native alternative to RabbitMQ is AWS SQS, a queuing solution that can be combined with AWS SNS.
Redis is the dominant database when it comes to in-memory caching. It can also be used as a message broker. An alternative to Redis is Memcached. Redis should be considered for any application that needs to store large numbers of objects in memory, e.g., for storing session data or for implementing a queue.
Bash is a shell language that excels at working with files and programs at the Unix system level. Many data applications require the successful execution of several programs or data wrangling, for which Bash is ideal. Ideally, Bash scripts should be relatively small. For complex file system task, other solutions should be considered.
C is a low-level language that is particularly suited for all applications that work closely with the hardware of a system, for example operating systems such as Linux or embedded systems. However, it is not a language that is very useful for data science.
Similarly to Java, C++ is a well-established high-level programming language. However, in contrast to Java, C++ is geared towards fine-grained control over system resources. For example, C++ does not have an automatic garbage collection but requires developers to think about how they manage memory. While C++ allows for very efficient and stable software, development is typically considerably slower than in other languages due to a lack of supporting libraries, long compile times, and the intricacies of the language. However, if performance is a prime goal of the software, C++ is probably the way to go. It is no coincidence that the core of TensorFlow is written in C++. While C++ is not suitable for data science tasks, it excels when engineering large-scale AI systems for which performance is key.
Go is a minimalist, compiled language that was developed by Google: it is performant, type-safe, and transparent. However, there is currently little support for data exploration, which means that Go is not suitable for typical data science tasks. However, Go is well-suited for infrastructure management tasks, for example for model deployment. So, while Go will probably never become a typical language data scientists will use, it should be considered for AI engineering tasks.
Java has been the most widely used high-level programming language for a long time now. Due to its virtual machine, Java applications are independent of the system architecture, rendering Java code highly portable. Java has an active community and has great ecosystems for overall software engineering, for example, due to highly developed IDEs that are very convenient to use. However, Java does not really support the data science process. Thus, Java likely only plays a role if models have to be integrated into an existing Java application.
Julia is a language that is gaining more and more traction in the data science community. The main reason for this its usability and its C-like performance. However, this is only possible because Julia is a compiled language. Despite its support in Jupyter notebooks, Julia will probably never become the go-to language for data exploration due to the lag that caused by the compilations. But, if you have demanding calculations, you should definitely evaluate Julia.
Similarly to Java, Kotlin is based on the Java Virtual Machine. Kotlin, however, is a leaner language than Java. For example, it does not enforce object-oriented development. Moreover, Kotlin is compatible with Android devices, which makes it great for the mobile market. However, all of these areas are not very relevant for data science.
Matlab is a commercial language that is geared towards mathematical calculations, which is why it has collected some users in the data science sphere. However, with Julia there is a strong, freely available competitor on the market.
Python is the most widely used scripting language out there. There is a very good support in terms of external libraries. For example, the TensorFlow, scikit-learn, numpy, and pandas packages are particularly suitable for data science.
R is a programming language that natively supports vectorized operations. Thus, R is ideal for all applications in which multi-dimensional data are relevant. Due to R's popularity in the statistical community, there is a plethora of libraries for visualizing, exploring, and modeling data. Overall, R is not as generally applicable and not as robust as other languages, which is why alternatives such as Python should be considered for productive systems.
Rust is a compiled language that's geared towards robustness. Similarly to C++ it allows low-level management of resources, which will take up a lot of development time. So if you want to move fast, Rust is not for you. There are a few early-stage machine learning libraries available for Rust, so if you are looking for performance and stability, it may be worth a shot.
Scala is an object-oriented language for the Java Virtual Machine. For data science applications, there seems to be some usefulness with respect to big data frameworks such as Hadoop or Spark. Otherwise, Scala is not suitable for data science applications.