lakeFS 1.0: Revolution in Data Versioning by Treeverse

October 24, 2023

lakeFS 1.0 is a groundbreaking achievement by Treeverse, allowing data enthusiasts to effortlessly track changes in their data akin to how developers use platforms like GitHub for code. The masterminds behind this open-source initiative have ushered in a new era with their latest release.

➜ The Journey of lakeFS

Originating in 2020, the lakeFS project has come a long way, consistently enhancing its offerings. The core idea? Offer an open-source solution for version control in object storage-based data lakes. By 2021, Treeverse had secured a whopping $23 million in funding, indicating substantial industry support and belief in their vision.

2022 witnessed Treeverse launching the lakeFS cloud service, a managed version control solution for data on the cloud. Their innovation isn’t going unnoticed, either. Heavyweights like Lockheed Martin, Volvo, and Arm have all hopped on the lakeFS train.

But that’s not all. The latest lakeFS 1.0 is crafted to gel well with other data lake technologies, including favorites like Databricks and the trending open-source Apache Iceberg.

➜ lakeFS: Not Just Another Data Version Control

Have you ever heard of Git? That fantastic version control system that tracks coding changes used universally in platforms like GitHub? lakeFS, inspired by Git, intends to do the same for data in data lakes.

Einat Orr, the Co-founder and CEO of Treeverse, explained the depth of lakeFS’s capabilities. While other technologies might allow versioning of tables or schemas, lakeFS ensures comprehensive version control over the entire data lake. This means it’s possible to version complete data workflows and pipelines. Plus, it preserves essential metadata about each version, promoting integration and reproducibility.

However, Treeverse isn’t pitching lakeFS as a rival to Databricks or Apache Iceberg. Instead, they view it as an enhancing tech that adds layers of benefits. Interestingly, Orr highlighted that lakeFS also plays well with data orchestration tools like Apache Airflow, Prefect, and Dagster. This integration further amplifies its value in data pipeline management.

➜ AI, Machine Learning, and lakeFS

The power of lakeFS isn’t restricted to traditional data versioning alone. Its potential in the realm of AI and machine learning is turning heads.

Take the new lakeFS local capability, for instance. While handling heaps of data, data scientists can now use lakeFS to version their data locally. This is particularly handy during the model development and testing phases. Orr elaborates that researchers often work on their local systems during development, making this feature a boon.

Speaking of the road ahead, Orr shared a glimpse into Treeverse’s vision. They’re in the primary stages of integrating data version control features for vector database technologies. “Our vision is to be the version control tool running over all your data sources. We aim to let you version control your data pipelines, regardless of the data’s location,” Orr passionately remarked.

“We have a large base of installations and really a product that reflects what people need for data version control over a data lake,” said Einat Orr during her chat with VentureBeat.

➜ The Significance of Data Versioning

As the world dives deeper into the digital age, the importance of data continues to amplify. Organizations are not just collecting data but striving to make the most of it. With vast amounts of data comes the inevitable need for better organization, categorization, and versioning. But why is data versioning so critical?

Think of the countless times developers rely on version control systems like Git. They allow for a systematic review of changes, efficient collaboration, and the ability to revert to previous versions in case of errors. Translating this into the world of data, especially with the intricate and expansive nature of data lakes, the importance of such control becomes evident.

➜ The Challenges in Data Versioning

One might wonder, if versioning is so pivotal, why isn’t every organization doing it? The reality is that data versioning, especially in massive data lakes, is far from straightforward. The challenges lie in managing metadata, ensuring data integrity across versions, and minimizing storage costs associated with multiple data versions.

Moreover, compliance is an additional layer of complexity for businesses operating in highly regulated industries. Regulations often dictate how data should be managed, stored, and archived, meaning that versioning isn’t just a technical challenge but a regulatory one.

➜ How lakeFS is Addressing the Challenges

lakeFS isn’t just another tool in the market. What sets it apart is its comprehensive approach to data versioning. Offering version control over the entire data lake addresses many challenges. It ensures data integrity, provides precise metadata management, and, more importantly, integrates seamlessly with other data tools and systems, which makes the adoption process smoother for businesses.

Furthermore, as data science and AI become integral parts of business strategies, having a tool like lakeFS that caters to the nuanced needs of these domains—like local versioning for model testing—is invaluable.

The emergence of tools like lakeFS signifies the evolving landscape of data management. As organizations deal with massive data, robust version control systems will be a game-changer. Integrating these tools with AI and machine learning workflows indicates a brighter future for seamless data management and utilization. For more insights into the world of tech, check out NeuralWit.

Tags
lakeFS