Essential resources for data engineers
This resource list has moved to scling.com. The contents below are left as a reference, but will not be updated.
This is a curated recommended read and watch list for scalable data processing. It is primarily aimed towards software architects and developers, but there is material also for people in leadership position as well as data scientists, in particular in the first section. The content has been chosen with a bias towards material that conveys a good understanding of the field as a whole and is relevant for building practical applications.
If you wish to discuss the contents or report any broken links, please do so via email to lalle@mapflat.com. Also feel free to send any material you really think should be in here.
Happy reading!
The big picture
End to end
Building scalable data pipelines
Data pipelines from zero to solid. End to end overview of how to build a data platform, including data ingestion, data pipelines, and serving computational results.
Building a data pipeline from scratch
Avoiding big data anti-patterns
The next data engineering architecture: Beyond the lake and the corresponding blog post How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh. Insights about scaling data processing environments beyond homogeneous data platforms, and proposed solution patterns. Rare and useful advice for companies that have come far in their data journey.
The profession of solving (the wrong problem). Keeping focus on the business value without getting distracted by technology, with real-world examples of typical failures.
Architecture and patterns
Big data - principles and best practices of scalable realtime data systems
Questioning the lambda architecture
System architectures for personalization and recommendation
Streaming analytics with Spark, Kafka, Cassandra, and Akka
Business perspective, leadership
The 5 stages of grief on the road to big data
The 10 worst big data practices
Head to tail
Data collection
The Log: What every software engineer should know about real-time data’s unifying abstraction
Staying in sync: From transactions to streams
Getting data out of databases: a surprisingly tricky problem
Exactly-once streaming from Kafka
Kafka reliability - when it absolutely, positively has to be there
Infrastructure at scale: Apache Kafka, Twitter Storm & Elastic Search
Devices and timestamps: seriously though, WTF?
Batch processing
Second generation “workflow managers” for big data. Explanation of workflow orchestration.
Managing containerized data pipeline dependencies with Luigi
Stream processing
Applications in the emerging world of stream processing (slides). Good summary of how to build stream processing applications.
Building real-time data-driven products (slides). Holistic view on building stream processing applications, architectural variants, and tradeoffs invovled.
The world beyond batch: Streaming 101
The world beyond batch: Streaming 102
Stream processing, event sourcing, reactive, CEP… and making sense of it all This blog post is also part of Martin Kleppman’s free book Making sense out of stream processing, which also encompasses his two entries in the Data collection section.
Apache Kafka, Samza, and the Unix philosophy of distributed data
Dataflow: A unified model for batch and streaming data processing. Google still dominates data processing technology, and looking at them is usually a glimpse into the future for open source technology. Apache Beam is still a bit young, but the semantics described in the presentation are likely to be the next architectural step, as an alternative to the lambda and kappa architectural patterns.
Streaming big data & analytics for scale
Data product serving, NoSQL
Cassandra data modeling best practices. This is one of the first documents on data modelling for Cassandra. It predates the Cassandra Query Language (CQL), but in order to use CQL efficiently, it is necessary to understand and adapt data models to the underlying structures.
Time series stream processing with Spark and Cassandra
Components and comparisons
Insight data engineering ecosystem: An interactive map
Choosing an HDFS data storage format- Avro vs. Parquet and more
Hadoop file formats: it’s not just CSV anymore
Dataflow/Beam & Spark: A programming model comparison
Apache showdown: Flink vs. Spark
Comparison of various streaming technologies
Real-time stream processing at InMobi (Storm & Spark Streaming comparison)
Picking the right SQL-on-Hadoop tool for the job
Creating value
Approximate algorithms
Some important streaming algorithms you should know about
Realtime personalization and recommendation with stream mining
Scalable real-time processing techniques - how to almost count
Probabilistic sketching @ research.neustar.biz
Acceptably inaccurate: probabilistic data structures (Slides)
Data science, machine learning
Interactive recommender systems
Deep learning for high performance time-series databases
10 more lessons learned from building machine learning systems
Best practices for machine learning engineering
Production Ready Data-Science with Python and Luigi. The steps from data science model to production pipeline.
Guide towards algorithm explainability in machine learning. Slides and code. Strategies for handling bias and looking into the black box of machine learning models.
Practices
Producticity, test, quality, monitoring
Test strategies for data processing pipelines (slides). How to build automated regression test suites for stream processing and batch processing data pipelines.
The mechanics of testing large data pipelines
Effective testing for spark programs
Spark and Spark Streaming unit testing
Goods: organizing Google’s datasets. Google has higher amibitions on dataset structure than most companies need. The paper, however, gives insight into the kind of entropy that creeps into data processing systems, and examples of structure and tools necessary to keep the chaos under control.
Schema & semantics
The unified logging infrastructure for data analytics at Twitter
Schema evolution in Avro, Protocol Buffers and Thrift. Every data platform or pipeline should have a strategy for schema evolution. This article describes the details of how schemas evolve, and the difference between the three formats.
Scala
Why is there a section here on Scala? Because Scala is rising as the preferred language for scalable data processing. The primary reason for this is not technical, but cultural; successful data-driven products rise out of a collaboration between data scientists and software engineers. The day to day activities of the former group involve model tinkering and experimentation, and the rituals and boilerplate involved in backend languages such as Java prohibits quick experimentation. The latter group, however, is concerned with operational stability, and languages frequently used for experimental purposes, such as Python and R, tend to be perceived as insufficiently rigid and lacking ecosystem support for quality assurance and operations.
Scala is the middle ground where these two worlds meet. It is succinct and expressive enough for experimental purposes, but also statically typed and standing on the JVM platform, providing the quality and operations ecosystems. The lion share of innovation in data processing is therefore expressed in Scala, and it is a matter of time before most data-driven companies adopt it. It is possible to stick to Java or Python for data-driven products, but such decisions come at the cost of deselecting to utilise most of the innovation that happens in the data processing open source world.
Scala is powerful and well suited for data processing, but it comes with risks; it comes with ammunition to shoot yourself in the foot, and an opinionated community that can be both elitistic and cryptic. It is wise to collect input and advice from several different types of sources when adopting Scala.
Twitter Scala school. A guide to learning Scala from scratch.
Strategic Scala style: Practical type safety
Transitioning to Scala. Pragmatic advice for teams adopting Scala.
Building a company on Scala. Likewise pragmatic advice for companies using Scala.
Moving a team from Scala to Golang. This is a cautionary tale of Scala adoption gone wrong; it provides insight into cultural risks that have to be managed. These risks exist in all teams, but Scala provides soil for them to bloom.
Don’t fear the implicits: Everything you need to know about typeclasses. Comprehensive explanation of typeclasses, one of the most powerful Scala constructs.
Privacy
Privacy by design. My bag of tricks and patterns for protecting users’ privacy and complying with GDPR in a big data environment.
Building privacy-protected data systems
How to prepare for proposed EU data protection regulation
Adapting your company to comply with EU privacy regulations (In Swedish)
Performance
Everyday I’m shuffling - tips for writing better Spark programs
Resource sites
A self-study list for data engineers and aspiring data architects
Last week in stream processing & analytics
10 completely free resources for sharpening your skills in Hadoop
The full list in one page of data science resource
So you are interested in deep learning
Papers we love (on distributed systems)
How To Become a Data Engineer. Resources for learning data engineering.