1  Introduction

Traditional biological research typically follows an exploratory approach, where hypotheses about unknown biological phenomena are developed and tested through repeated observation. With the advent of next-generation sequencing technology in 1996, it has become increasingly feasible to correlate phenotypic observations with corresponding genetic data on a large scale. This development, facilitated by advanced bioinformatics tools that can manage and interpret vast datasets, has enabled large-scale, quantitative analysis of gene expression data across a wide variety of organisms, including humans. A key innovation in this field is reverse engineering technology, which tracks gene expression over time to elucidate gene interactions and regulatory relationships, offering valuable insights into complex biological networks.

However, the vast amount of data generated by high-throughput methods only provides static snapshots of dynamic biological processes, making it challenging to fully understand their complexity. To address this, researchers have adopted a bottom-up approach, starting by identifying correlations between genes and subsequently introducing them into cells to manipulate biological functions directly. This field is referred to as synthetic biology, where researchers aim to build biological systems to function in a predefined manner. Synthetic Biology employs engineering principles1 such as standardization, abstraction, and modularization to the design and manipulation of biological systems.

Standardization refers to appropriate characterization and creation of standardized languages2, protocols, and biological parts, such as genes, regulatory elements, enzymes, chassis, that can be used interchangeably across different projects. The goal is to create a catalog of “biological parts” that function reliably and predictably when combined, enabling faster and replicable assembly of complex biological processes3. A critical aspect of synthetic biology is the strategic assembly of DNA components that function optimally within cells. This process involves identifying and quantifying DNA parts, then carefully engineering them into functional genetic circuits using computer-aided design tools. The Registry of Standard Biological Parts (such as that maintained by iGEM) is a prominent example of this concept, offering a library of pre-characterized genetic elements that can be assembled to form new synthetic pathways or organisms4. This overall promotes reproducibility5, collaboration, and biosafety6.

Abstraction involves simplifying the complex, intricate details of biological systems by focusing on higher-level functions and processes7. By abstracting biological components, synthetic biologists can design and manipulate genetic circuits without needing to fully understand or model every molecular interaction in a system. This allows them to target specific functionality in terms of hierarchies8 and modular units9, such as a gene as opposed to the accompanying regulatory elements or whole proteins versus multiple domains that serve different purposes. By doing so, the overall design process is made efficient by managing complexity that organisms offer.

Modularization is the practice of breaking down a biological system into smaller, self-contained units or “modules” that can be independently designed, tested, and assembled into more complex systems10. Each module typically performs a specific function, such as gene expression, signal transduction, or metabolic conversion11. This modular approach allows for greater flexibility and scalability in engineering, as researchers can mix and match modules to create a wide range of customized and diverse systems. Biological parts from one organism can be used in other, less manageable chassis upon optimizing in this manner as well12.

The Design-Build-Test-Learn (DBTL) cycle, a foundational methodology of synthetic biology, also adopts engineering principles to enhance reproducibility and efficiency in biological research. Yet, the Build and Test phases pose considerable challenges to their labor-intensive and time-consuming nature. For instance, the assembly of three DNA fragments often requires a week per circuit assembly by a skilled graduate student. Furthermore, optimizing the expression of a gene with promoters, RBSs, and terminator parts - three different types for each – necessitates creating and testing 27 distinct DNA combinations. This process could extend a graduate student’s workload to four to six months for the construction of an optimal genetic circuit.

On an industrial scale, the Build and Test stages present similarly formidable challenges. The development and commercialization of Artemisinin, a treatment for malaria by Amyris, demanded an investment of $15 million and ten years, equivalent to roughly 150 person-years. Similarly, DuPont’s production of 1,3-propanediol required 575 person-years, highlighting the extensive resource commitment in synthetic biology projects13. These instances underline the significant time and labor investments needed to harness the benefits of synthetic biology, explaining why the field, which emerged in the early 2000s, took about a decade to prove its value.

The incorporation of robotics into biology has significantly reduced the manual labor and time constraints traditionally associated with biological experiments. This advancement was pioneered by Dr. Ross D. King in 200414, who developed the robotic scientists Adam and Eve to autonomously investigate yeast metabolism and enzyme discovery15 Despite their innovative potential, the application of these robotic scientists has faced limitations due to the need for specialized automation equipment tailored for specific research projects. This requirement often limits the flexibility needed to conduct a wide array of complex biological experiments, leading to significant costs and efforts to adapt commercial equipment for various research needs.

However, the fundamental principles of synthetic biology, including standardized parts, assembly techniques, and an engineering-focused DBTL strategy, have demonstrated significant compatibility with automation and robotics. This synergy has not only enhanced research but also found practical industrial applications. A notable example is Lanzatech’s production of ethanol from greenhouse gases using its clostridiaBiofoundry (cBioFab)16, which combines cell-free systems with machine learning technology to achieve remarkable industrial breakthroughs17,18 in producing 1-hexanol and hexanoic acid (>100x improvement).

A biofoundry integrates the entire DBTL cycle, from initial DNA design to testing cells modified with that DNA, by systematically linking automated hardware and software to streamline the process from gene synthesis to the final product. A key strategy in constructing an efficient biofoundry is the careful design of each DBTL stage to ensure seamless operation and eliminate bottlenecks, significantly improving development speed and throughput. For instance, Ginkgo Bioworks has shown considerable efficiency gains, with its throughput increasing two to four times annually between 2014 and 2020. In 2017, the company achieved a milestone when the cost of testing per strain became lower than that of manual handling19. Amyris, having launched its biofoundry in 2011, has successfully commercialized 15 new substances at a consistent rate over the last seven years20, marking a productivity increase of more than twentyfold compared to its earlier efforts with Artemisinin. These developments underscore the transformative impact of biofoundries on synthetic biology, providing scalable and cost-efficient solutions for bioproduct development. By optimizing the bioproduct development process, biofoundries substantially reduce the time and cost associated with bringing new products to market, representing a significant advancement in biotechnological innovation.

1.1 Considerations required for biofoundry development

1.1.1 Hardware

The concept of a biofoundry often evokes images of automated robotics operating within a biological laboratory. Given this association, automation is deemed essential for the development of biofoundries. However, the deployment of advanced automation machinery, including robotic arms, in fully automated biofoundries requires careful attention to planning21. While devices like liquid handlers are crucial for high-throughput well plate-based experiments, incorporating robotic arms with other automated devices, such as automated thermal cyclers or automated incubators, significantly raises both initial and ongoing maintenance costs. Adopting a semi-automated DBTL cycle is one of the strategies that offer maximum flexibility at minimal cost.

Currently, the availability of automated equipment is restricted to a handful of global companies. To broaden the options for users and ensure the provision of high-quality services, ongoing development of such equipment is essential. This necessitates long-term and continuous investment, supported by governmental policies, especially given the current limited market availability. One way to reduce costs is to make use of community labs and open DIY hardware22. However, these rely on availability of material and cannot be implemented for long-term projects. The emphasis in hardware development should be on enhancing the connectivity and flexibility of devices to boost the overall efficiency of the DBTL cycle, rather than focusing solely on individual device specifications.

Furthermore, an increase in hardware throughput does not guarantee an improvement in overall biofoundry performance. One of the major reasons automated equipment may remain underutilized or idle in universities or research institutions is that data management and analysis are failed to keep pace with the hardware’s throughput. In this context, the workflows and software discussed in the subsequent chapters are pivotal considerations prior to hardware deployment.

1.1.2 Workflows

Before delving further into biofoundry development, it’s necessary to define some key terms associated with biofoundry operation for ease of understanding. We introduce two fundamental terms: ‘workflow’ and ‘unit process.’ A unit process is the smallest protocol executed by an automated machine, while a workflow is a sequence of unit processes organized to produce a desired product. The terms ‘protocol’ and ‘process’ are employed broadly. Both unit processes and workflows can be formulated as standard operating procedures (SOPs) which are detailed instructions designed to standardize and enhance the reproducibility of operations.

In constructing a biofoundry, workflows are developed by re-optimizing established manual protocols. Manual protocols frequently overlook or oversimplify sample preparation, deemed apparent. Conversely, automated protocols demand more detailed specifications, including the quantity, position, and handing of reagents and equipment, especially when using multi-well plates.

Developing a workflow involves assessing the level of automation feasible with the available equipment and consumables. The initial level may resemble manual operations, progressing to the use of liquid handlers capable of managing experiments based on 96 or 384-well plates. Utilizing high-throughput equipment not only enhances experiment throughput but also improves reproducibility. Establishing quantitatively measurable metrics within the tier system is crucial for ensuring the reliability and interoperability of workflows.

To enhance a biofoundry’s efficiency, it’s essential to reduce the operational costs associated with workflows and enable the simultaneous running of multiple workflows, which is key to maximizing efficiency and productivity. This challenge can be tackled in two ways. One approach is consolidating samples onto a single plate when different researchers request the same workflow. Alternatively, when different workflows are requested, they can be integrated by considering the run times of each unit process and scheduled as a single workflow. These scheduling strategies, more suitable for semi-automated biofoundries, are essential for enhancing biofoundry performance, given the limited number of machines. Additionally, waste in research, often due to avoidable design flaws or incomplete experiments, is a significant concern. Over 50% of research remains unpublished, half of which is attributable to preventable errors23. An integrated biofoundry can minimize these mistakes by facilitating more informed decisions from a broader range of experiments.

1.1.3 Key Gaps in workflows:

  • Fragmented Workflows: Many biofoundries have disjointed processes across different stages of research and production, making it challenging to maintain a seamless flow from design to implementation.

  • Lack of Standard Operating Procedures (SOPs): Inconsistent or absent SOPs can lead to variability in experiments, making reproducibility difficult and complicating data comparison.

  • Integration of Automation: While automation is essential for scaling processes, many workflows do not fully integrate automated systems, leading to inefficiencies and increased manual work.

  • Data Management and Sharing: Ineffective data management practices can result in difficulties in data access and sharing among team members, hampering collaboration and progress.

  • Feedback Loops: There may be insufficient mechanisms for feedback between different stages of the workflow, preventing insights from experimental results from being effectively integrated into future designs.

  • Cross-Disciplinary Collaboration: Workflows often lack structures that promote collaboration across disciplines, which is crucial in a multidisciplinary field like synthetic biology.

  • Real-Time Monitoring: Many workflows do not incorporate real-time monitoring and analytics, which can help in making timely adjustments and decisions during experiments.

  • Scalability Protocols: Procedures for scaling up from lab-scale experiments to large-scale production are often underdeveloped, posing challenges for commercialization.

  • Training and Onboarding: Inefficient onboarding processes for new personnel can slow down productivity and lead to misunderstandings about protocols and workflows.

1.1.4 Software

Despite the increasing necessity for constructing biofoundries, the availability of software specifically designed for biofoundry operations remains limited. While some existing solutions for Electronic Laboratory Notebooks (ELN) or Laboratory Information Management Systems (LIMS)24 are available, they often do not meet the unique requirements of biofoundry operations comprehensively. Furthermore, the cost of these solutions can be prohibitive for use in an extensively scaled biofoundry environment with a large team.

Developing software for biofoundry operations is indispensable. Creating software that coordinates biological experiments across various stages of the DBTL cycle requires the collaborative, intensive efforts of IT engineers and biologists. To address this challenge, we advocate for a ‘rapid prototyping and soft integration strategy.’ This approach emphasizes the quick development of specific functionalities necessary for biofoundry operations and their integration with existing tools, like Snapgene and GitLab. Modern frameworks for developing web-based applications, such as Python’s Streamlit or R’s Shiny, are invaluable for rapidly producing streamlined software applications. Integrated Development Environments (IDEs) like Visual Studio Code (VSCode) or RStudio are instrumental in managing the entire DBTL cycle, thereby significantly boosting the efficiency and synchronization of biofoundry operations. This approach of ‘rapid prototyping and soft integration’ underscores the importance of continuous software maintenance and updates to support equipment and technological advancements, alongside evolving scientific interests.

Concerning the functional requirements of biofoundry software, a major challenge in biofoundry design is operational costs. It is imperative to reduce the consumption of consumables, time, and labor in individual workflows. Initially, efforts should concentrate on optimizing workflows within a semi-automated system. Software that persistently monitors the availability of automated equipment and materials plays a vital role in maximizing resource use and coordinating the execution of workflows from various users. Moreover, given the extensive number of samples processed by a biofoundry, its operational software must manage a significantly larger volume of equipment, materials, experiments, and operations than a conventional laboratory. This necessitates a seamless system to ensure equipment availability, smooth supply of materials, and swift data analysis for designing subsequent high-throughput experiments. Furthermore, constructing an operational system capable of controlling automated devices from diverse manufacturers requires APIs that can translate the interactions between user-designed workflows and automated equipment into a universally understandable language, such as JSON.

While these endeavors can be demanding and might necessitate collaboration with equipment vendors, initiating software development that monitors equipment status and suggests optimized workflows in a semi-automated system is feasible. The adaptability and compatibility offered by such a system are crucial for enhancing accessibility and interoperability among biofoundry facilities.

1.1.5 Key Gaps in Software

  • Data Integration: Many biofoundries utilize disparate systems for data collection and analysis, leading to challenges in integrating and interpreting data from various sources.
  • User-Friendly Interfaces: Existing software often lacks intuitive interfaces, making it difficult for non-experts to use advanced tools and access data effectively.
  • Collaboration Tools: There is often a shortage of robust collaboration platforms that facilitate communication and project management among multidisciplinary teams.
  • Standardized Workflows: A lack of standardized software workflows can lead to inconsistencies in experimental processes and results, hindering reproducibility.
  • Simulation and Modeling Tools: Advanced tools for simulating biological processes and modeling systems biology are often limited or not user-friendly, which restricts their use in design and optimization.
  • Automated Data Analysis: There is a need for more sophisticated software solutions for automated data analysis and interpretation, particularly for large datasets generated by high-throughput experiments.
  • Visualization Tools: Enhanced visualization software is often needed to effectively represent complex biological data and results, making them more accessible to researchers and stakeholders.
  • Version Control: Effective version control systems for experimental protocols, data, and software tools are often not well-integrated, leading to potential errors and confusion.
  • Interoperability: Many biofoundry software tools lack interoperability, making it difficult to exchange data and functionalities between different platforms and tools
  • Regulatory Compliance: Software solutions that aid in ensuring compliance with regulatory requirements are often lacking, making it difficult to manage documentation and quality control.

1.2 Objectives of the Thesis:

Considering the gaps in Biofoundry development, we aim to address the following across the chapters.

  1. Integration of High-Throughput Tools, Computational Models and Biofoundries: To develop and validate computational models and experimental tools that can predict biosensor/protein/DNA activity, stability, and function based on sequence data. By integrating these tools into biofoundry workflows, the design process can be streamlined, predictability of outcomes are improved, and accessible to other biofoundries.
  2. Designing effective experimental strategies: To develop reproducible, standardized, and data centric experimental methods to facilitate integration of computational tools and models for the same. Most current computational models make use of publicly available data is subject to high attrition rate due to publication survior bias, and often don’t work with sufficient experiemental negative data. It is important to design experiments that maximizes the amount of data collected without compromising on quality.
  3. Artificial Intelligence Applications: To explore the use of artificial intelligence algorithms in analyzing large datasets generated from high-throughput techniques. By employing techniques such as deep learning, to uncover hidden patterns and correlations that can inform the design of novel proteins with desirable properties.
  4. Optimization of Biosensor Activity: To utilize high-throughput experimental and computational tools to optimize various parameters in a sensor protein’s activity, such as signal and background.

The development of workflows and the encompassing software will focus on protein engineering, however, can be applied in other part engineering and optimizations.

1.3 Thesis Outline

Chapter 2: Computational Tools for Streamlined Biofoundry Workflows

Chapter 2 addresses the development of workflows, software and tools to streamline processes common in synthetic biology and protein engineering using biofoundries. Here the tools for data visualization, analysis, logging, sharing, modelling, machine operatability are exhibited. Case studies in the following chapters illustrate the successful application of these tools in biofoundries, demonstrating their impact on data management and project outcomes. The chapter focuses on tools programmed using Python, with user interfaces developed on Streamlit, which provide a base for further chapters.

Chapter 3: Data Collection for AI-based Protein Engineering

Chapter 3 examines the importance of the collection of data when employing AI. It discusses high-throughput methods such as biosensors, which enable real-time monitoring of protein activity; Fluorescence-Activated Cell Sorting (FACS), which facilitates rapid screening of protein variants; and long-read sequencing, which provides one-shot comprehensive genomic information. The chapter emphasizes the integration of data from these diverse sources to enhance AI model training, and discusses different experimental and deep learning strategies, while also addressing challenges related to data variability and quality.  The versatility of the method is demonstrated by engineering transcription factor DmpR and enzyme MPH as a proof of concept.

Chapter 4: Biofoundry based production and multiplexed identification of mutants

Chapter 4 discusses the innovative use of biofoundries for the rapid production and identification of protein mutants (1-3 site-directed mutations or novel generated proteins), focusing on an automated workflow that can construct and analyze of more than 2,034 samples simultaneously using Nanopore sequencing technology. The chapter highlights the efficiency of automated systems in mutant construction, which leverage synthetic biology tools for high-throughput experimentation. It emphasizes the advantages of Nanopore sequencing, allowing real-time sequencing of long DNA strands for multiplexed identification of genetic variations. This integration not only accelerates the screening process but also expands opportunities in protein engineering. Challenges such as data analysis and mutant characterization accuracy are acknowledged, while future advancements in sequencing and automation promise to enhance these workflows further. The chapter also discusses the use of structure prediction models to analyse the functionality of predicted proteins.

Chapter 5: Assembling Long DNA Sequences for Protein Engineering

Chapter 5 focuses on the critical techniques for assembling long DNA sequences, which are essential for the development of novel proteins or organisms. Here Dsembler, a computational tool aimed to identifying the oligomer combinations to best result in a successful long DNA sequence is introduced.

Chapter 6: Conclusion

Chapter 6 presents a comprehensive analysis of the results, aiming to deliver a clear understanding of the overall research findings. The chapter summarizes the overall conclusions of the study, emphasizing the significance of cohesive architecture and interdisciplinary collaboration in advancing biofoundry development.