Controlling the cloud with the industrial edge
October 18th, 2023
Cloud adoption across companies operating in physical industries - manufacturing, energy & more - typically has historically lagged other industries. This isn’t by accident, but by design. Protecting factory operations from cybersecurity intrusion, challenges in connecting to legacy equipment, regulatory requirements - there is a plethora of critical reasons for why cloud adoption can take time. And where it has, it has mainly been on the pure-play IT side. Think ERP, finance, accounting & HR. Core manufacturing operations, understandably, have required more time.
This is beginning to change. The last three decades or so have seen an entire universe of companies spring up focused primarily on data collection in core operational processes. In manufacturing speak, they are called data historians, and unless you work in the space, you probably have never come across them. Common examples are AVEVA (formerly Wonderware), FactoryTalk (Rockwell Automation), Factry, OSI PI (also owned by AVEVA) to name a few examples. They are at their core time series databases designed for ingesting and compressing large volumes of operational logs & production data, with some visualization & reporting functions to boot.
Unfortunately it’s not all roses and sunshine with this type of architecture. While often adept at data retrieval & storage, historians lack advanced data pipeline & analytics capabilities. Often the data needs to be extracted into other services where the real analysis & inference is conducted.
The cloud comes to the rescue…
Cloud services such as Azure or AWS overcome some of these limitations of traditional data historian services. They do this by providing unbounded analytical ability. Sure, it doesn’t matter if your data historian accrues extensive operational logs where 98% of the data is irrelevant or unusable. Stream or batch that data to one of the myriad cloud databases, pipelines, or a data warehouse or lake. Build a cloud-based processing framework to transform, twist & extract relevant features. Feed this into a predictive maintenance ML model, also hosted in the cloud. Log the results (in the cloud) and triage alerts to other (cloud-based) systems. Job done.
Problem solved, right?
…until the cloud doesn’t
The sheer volume of data collected can come at a very significant cost for manufacturers, where bottom-line margin really matters. On average, manufacturers tend to have net profit margins of single digits to low double-digits, industry dependent. Heavy investments in cloud infrastructure, if not properly monitored, can have a significant impact on profitability.
Even smaller companies can be hamstrung by rocketing cloud costs, especially if the focus is on AI or more compute-hungry analytics. As one anecdotal example amongst many, we’ve come across startups in the recycling sector using computer vision to assess quality control of garbage passing through recycling facilities. Often the cost of streaming images to the cloud where ML models conduct their inference & classification exceeds the revenue potential of each customer.
This can render the entire business model, or for larger companies, the ROI of advanced data analytics, obsolete. So what can we do about this?
The trick is about triangulating relevant & highly informative data points, outside the perimeter of cloud-based analytical services. This is more than just calculating summary statistics on historic data to then stream to the cloud. It is about moving a bulk of analytical workload closer to the source of data generation. Anomaly detection, rolling window analysis, ML model inference … all are examples of compute that can take place on the shop floor in real-time. In doing so, companies can cut cloud costs in two ways: a) minimising the volume of data transfer & storage; and b) reducing compute cost for analysis & inference performed in cloud services.
A significant benefit of the cloud relevant to traditional on-prem services has been around convenience & ease of use. Historically by moving to the cloud, companies have been able to effectively outsource complex IT infrastructure maintenance, reducing overhead. This pendulum is beginning to swing in the opposite direction, driven by some powerful tailwinds:
Ever more powerful hardware. Off-the-shelf commodity hardware today is not only cost-effective but incredibly performant across all measures - memory, bandwidth, storage etc. We can now even run large ML models on types of hardware (i.e. Nvidia Jetson).
Edge runtime infrastructure. Deploying, updating and securing on-prem IT services used to be a difficult endeavour. Runtime solutions today (i.e. AWS Greengrass, Azure IoT Edge) have significantly improved the on-prem technical capability to run software applications on edge devices, with security considerations baked in (i.e. outbound-only connections, no port forwarding etc). There is still room for improvement (this is where Ferry’s deployment integration with AWS & Azure comes in!), but it is a significant improvement to what came before.
Software containers. Most applications today are deployed and run in containers. These are packages of software that contain all of the necessary elements to run in any environment. One of the most common types is Docker. The value of containers is that it allows engineers to build any application that they want that can then be packaged & repeatedly run in a variety of environments (across cloud & the edge). Containers are key in bringing modern software development practices designed for the cloud to on-prem services.
Let’s work through an example
Inspired by examples we’ve seen in the agriculture sector, we simulate a case where a large-scale producer of cattle is trying to make use of the vast amount of data that their feeders are collecting. Streaming that data straight to the cloud would be prohibitively expensive. Is there a way that we can use edge compute processing to compress this information & store it in the cloud?
We start with a simple data emitter application that imitates the feeder machine. In this example, we’re using AWS Greengrass to deploy a simple script that publishes a record of data (temperature, humidity, feed consumption, timestamp) every second. We deployed this using Ferry to a Raspberry Pi we had lying around in our office to start emitting the data. The data is published locally on the device via the MQTT protocol.
The next step is then to write a data archiver to imitate how an industrial gateway device (i.e. an industrial computer, PC etc) that can receive this high-frequency data, perform some compression, and then send the compressed files to the cloud. In this case, we store 5 minutes worth of data locally on the device in a csv, which we then compress to a Parquet file. Parquet is a powerful open-source column-oriented data file format built for efficient data storage and retrieval.
Once we compress the data into a Parquet file, we upload it every 5 minutes to a S3 bucket in AWS, and clear the local files on the device.
Here you can see that each parquet file is ~25KB, which is highly compressed. In this case, across 576 files we had over 172k rows of data amounting to just ~14MB.
Of course, this can be enhanced. We could write a script that runs some aggregation during the 5 minute window for a rolling average, and only upload the summary statistics to AWS. We could write a simple model with thresholds for key variables (i.e. temperature) that triggers an alert to a different service whilst capturing a window view of data before and after the incident. We could store the data in a local SQL database (that is periodically flushed) with ETL data pipelines that can aggregate data from multiple machines, run SQL queries, or even an anomaly detection model, and pass the results to a destination. With the right tools, the possibilities are endless.
Not a binary choice
Unlike Neo in the Matrix, we’re not faced with a red pill, blue pill dilemma. The manufacturing operations of tomorrow will mesh together both edge & cloud services for an optimal balance of compute, cost & control. There’s varying technical architectures of exactly how to do this, but that is a topic for our next article.
Meanwhile, we at Ferry provide analytical & data pipeline tools for industrial companies to build powerful data-driven workflows connecting frontline operations to enterprise applications. Edge compute is our world: we strive to give companies working in manufacturing, energy & more the best toolkit to go beyond data collection to build effective analysis & make informed operational decisions. And in doing so, strike the right balance between the cloud and on-prem.
In the example, above we used Ferry’s code tools to write a custom edge application to imitate a edge data pipeline. We complement this with no-code data pipeline tools that can interface with a broad range of industrial systems to help anyone quickly build & deploy powerful analytics, and get the best value from their operational data.
If you’re interested in learning more, you can find more info on our website here, our public docs here, or reach out to me at dominic@deployferry.io!