Data engineering - TetraScience: From lab to enterprise, what scientific data teaches us

This is a guest post for the Computer Weekly Developer Network written by Justin Pront, senior director of Product at TetraScience.

TetraScience is known for its Tetra Scientific Data and AI Cloud and is the only vendor-neutral, open, cloud-native platform purpose-built for science applications – the company pledges to liberate, unify and transform raw data into more-than-FAIR, AI-native data.

Pront writes in full as follows…

When a major pharmaceutical company recently tried to use AI to analyse a year of bioprocessing data, they hit a wall familiar to every data engineer: their data was technically “accessible” but practically unusable.

Instrument readings sat in proprietary formats, critical metadata lived in disconnected systems and simple questions like “What were the conditions for this experiment?” required manual detective work across multiple databases.

Scientific data stress test

This scenario highlights a truth I’ve observed repeatedly: scientific data represents the ultimate stress test for enterprise data architectures. While most organisations grapple with data silos, scientific data pushes these challenges to their absolute limits. Multi-dimensional numerical sets from a dizzying array of sensitive instruments, unstructured notes written by bench scientists, inconsistent key-value pairs and workflows so complex that the shortest ones total 40 steps.

The challenges faced by data engineers in life sciences offer valuable lessons that could benefit any data-intensive enterprise. Three key principles from scientific data engineering stand out: the shift from file-centric to data-centric architectures, the importance of preserving context from source through transformation via data engineering… and the need for unified data access patterns that serve immediate and future analysis needs.

Source data streams

In life sciences, data typically starts as physical measurements from lab instruments, each with its proprietary format and margin of error. Scientists need this data for immediate analysis, AI/ML training, regulatory filings and long-term research. This scenario parallels challenges in manufacturing, finance and other sectors where data originates from diverse sources and must serve immediate needs and downstream uses for analysis and data products.

The first lesson involves moving beyond file-based thinking. Like many business users, scientists traditionally view files as their primary data container. However, files segment information into limited-access silos and strip away crucial context. While this works for the individual scientist analysing their assay results to get data into their Electronic Lab Notebook (ELN) or Lab Informatics Management Systems (LIMS), it makes any aggregate or exploratory analysis or AI / ML engineering time and labour-intensive. Modern data engineering should focus on the information, preserving relationships and metadata that make data valuable. This means using platforms that capture and maintain data lineage, quality metrics and usage context.

The second lesson centres on data integrity during transformations.

Even minor data alterations in scientific work, such as omitting a trailing zero in a decimal reading, can lead to misinterpretation or invalid conclusions. This drives the need for immutable data acquisition and repeatable processing pipelines that preserve original values while enabling different data views.

We’ve treated data integrity from acquisition at a file or source system through data transformation and analysis as non-negotiable in our heavily regulated environment.

Similar principles apply in financial reporting, healthcare records or any domain where data accuracy directly impacts decision-making.

TetraScience’s Pront: The challenges faced by data engineers in life sciences offer lessons that could benefit any data-intensive enterprise.

Accessibility vs. utility

The third lesson addresses the tension between immediate accessibility and future utility. Scientists want (and need) seamless access to data in their preferred analysis tools, so they end up with generalised desktop-based tooling such as spreadsheets or localised visualisation software. That’s how we end up with more silos.

Organisations need cloud data sets co-located with their analysis tools to ensure the same quick analysis while the entire enterprise benefits from having the data prepped and ready for advanced applications, AI training and regulatory submissions. Data lakehouses built on open storage formats like Delta and Iceberg emerged in response to these needs, offering unified governance and flexible access patterns.

However, the underlying principles – preserving context, ensuring data integrity and enabling diverse analytical workflows – apply far beyond scientific domains and use cases.

These challenges have led to practical solutions that any organisation can adopt:

Data processing architectures to ingest and quickly move from file processing to SQL tables for analysis.
Unified data storage formats that combine the flexibility of data lakes with the performance of warehouses
Platforms that can make data stored in the cloud accessible to specialised desktop analysis tools
Metadata management systems that track data lineage and usage patterns

These principles and solutions will become increasingly relevant as more industries adopt AI and face stricter regulatory requirements.

The scientific approach to data engineering, emphasising context preservation, data integrity and flexible access patterns, offers a blueprint for managing complex data ecosystems in any domain.