Green coding - Confluent: Sustainability through data streaming
This is a guest post for the Computer Weekly Developer Network written by Peter Pugh-Jones in his capacity as director of financial services at Confluent.
Confluent is a full-scale data streaming platform that enables users to access, store and manage data as continuous, real-time streams – built by the original creators of Apache Kafka, Confluent expands the benefits of Kafka with enterprise-grade features, while removing the burden of Kafka management or monitoring.
As TechTarget reminds us Apache Kafka is a distributed publish-subscribe messaging system that receives data from disparate source systems and makes the data available to target systems in real-time.
Pugh-Jones writes in full as follows…
It’s no secret that big data can come with a big environmental impact.
For every one gigabyte stored in the cloud, around seven kilowatts of energy are used per hour. That means that if a business stores ten terabytes of data, it’s creating a carbon footprint equivalent to 500 kg of CO2.
But it’s not just storage, all aspects of processing, analysis and software development come with their own environmental burdens. But here’s the good news, data-driven technologies are also helping to change the world for the better.
Systematic sustainability
Digital data is helping to streamline processes, driving environmental decision making and uncovering new and more sustainable ways of working.
Developers are also pushing for their own solutions. This concept of ‘green coding’ — a practice that prioritises efficiency and sustainability in software development — is gaining increasing traction, as is a push for greater awareness of tech supply chains and ethical data storage decisions.
So how can businesses adopt greener coding practices and where does data streaming fit into this sustainable future?
Batch vs real-time streaming
When it comes to managing data, businesses must make a choice between batch processing (processing large volumes at scheduled intervals) and data streaming (continuously processing data in real-time as it arrives).
The common misconception is that continuous streaming must be less efficient because it requires constant processing power. Similar to a TV on standby, the logic goes that running something continuously is worse than switching it on and off when needed.
In reality, the opposite is true. Batch processing requires such regular and intensive spikes in processing power that it’s significantly less efficient than a continuous low-level stream. To continue our television analogy, it’s equivalent to turning a TV on and off 500 times a minute rather than leaving it on standby. The result is actually far more energy being consumed.
In contrast, data streaming means that a constant flow of data can be processed at all times. Rather than waiting for a mass of messages to accumulate and then processing them via one huge spike in CPU power, you’re instead running one message at a time and using a far smaller amount of energy. In terms of resource consumption, the result is a gently flowing stream rather than a sudden tidal wave.
Providing predictable processing
Streaming data doesn’t just benefit from lower CPU usage, it also provides for more predictable processing overall. By switching from sudden spikes in processing to a continuous predictable flow, data streaming allows organisations to better predict and forecast their requirements. Take Apache Flink as an example. Apache Flink is an open source stream-processing framework — a popular technology for those who opt for real-time data streaming.
With Flink Actions (the operations that are applied to data streams when using Apache Flink), users can not only enable real-time data processing, but also analytics. These analytics help organisations develop a clear and reliable understanding of their usage, with less need to build in contingency for sudden unexpected spikes. By lowering this need for unused cloud contingency, organisations can increase efficiency, reduce costs, and ultimately develop more sustainable processing.
Frameworks like Apache Flink — managed via a data-streaming platform (Ed: could our erstwhile author here be referring to Confluent perchance?) also come with the added benefit of being serverless.
Serverless suitability
This serverless nature means that computing resources can be scaled up or down based on workload. As the volume of data being processed fluctuates, the infrastructure adjusts in real-time to handle the load efficiently without human intervention.
Not only does this lead to more responsive, efficient and cost-effective data processing, it also aligns with the principles of green coding by minimising idle computational resources. The infrastructure is used only when active processing occurs, limiting energy waste.
This approach also encourages developers to focus on writing efficient, event-driven functions that consume fewer resources — without having to account for the bottlenecks associated with batch processing. This shift in focus can ultimately help to promote sustainability throughout the entire software development lifecycle.
Caveat emptor
Despite all of these benefits, there is of course one caveat.
While green coding and data streaming can get you so far, much of what’s been described in this article still depends on specific cloud suppliers and those behind your data storage.
Having said that, more and more of these suppliers are coming to realise the huge benefits of going green. Not only is sustainability the right thing to do, but it’s also highly marketable, saves on energy and saves on cost. Leading serverless providers often use renewable energy sources to power their data centres, further reducing the carbon footprint of Apache Fink and similar serverless computing frameworks. This shift towards green energy is integral to sustainable computing practices.
For those working in high-event data-driven industries, the combination of data streaming and a green cloud provider represents a major step towards a more sustainable, data-driven future.