
Instead of rainbow trout, this lake is stocked with massive amounts of research data in all shapes, sizes, and formats. If you’re new to the data lake concept, or if you’ve heard the term tossed around in SMART FIRES circles and wondered what it was, this article is for you.
So… What Is a Data Lake?
A data lake is a scalable digital storage area that can hold any kind of research data—raw or cleaned, tiny or huge, structured or disordered. Unlike a database, which needs tidy, well-organized information, a data lake happily stores research files exactly as they are. The SMART FIRES system is designed to support a large, multi-institutional research community working across disciplines, file types, and computational needs.
Where This Lake Lives
Our data lake lives on Blackmore, Montana State University’s high-performance research storage infrastructure. It’s designed to handle fast, heavy, and growing research workloads:
- 80 gigabit per second throughput for high-speed data access
- Nightly full backups to protect project data
- An offsite backup for disaster resilience
Think of Blackmore as the secure vault that holds the lake—one built for scale, speed, and ease of access.
Why SMART FIRES Needs a Data Lake
SMART FIRES involves more than 50 researchers across six Montana universities, working in four major interdisciplinary thrust areas. That means lots of:
- Data (imagery, models, field measurements, social science data, code)
- Formats (CSV, NetCDF, TIFF, Python scripts, logs, and more)
- Collaborators across institutions
- Back-and-forth between raw data, high-performance computing (HPC) processing, and analysis
This diversity creates some challenges—coordination, consistency, access, sharing, protection, and long-term storage—all of which the data lake is designed to solve.
Here’s what it enables:
- Cross Institution Collaboration - The lake provides a secure, shared space accessible to project participants across different institutions, eliminating access barriers caused by local storage, institutional silos, differential authentication and permissions, or email-based file transfers.
- Scalable Storage for Large and Growing Datasets - The project will generate huge datasets over its five-year span. The data lake can expand to meet those needs.
- Smooth Integration With High Performance Computing - The data lake integrates with the Tempest, Hellgate, and other HPC systems through Globus, allowing researchers to automatically pull data into a compute job and push results back when processing finishes. This keeps workflows manageable and saves time (and headaches).
- Long-term Preservation and Public Access - SMART FIRES has a data management plan that includes archiving and sharing results in public data repositories. The data lake is the backbone that will make that possible.
So How Do We Access This Lake?
You don’t need a kayak—just an account with Globus, the secure research data transfer platform.
Most Montana University System researchers can log in with institutional credentials. Those not under institutional subscriptions can create a free Globus account and then be granted access.
You can access and manage data using:
- The Globus Web App (point-and-click uploads and downloads)
- The Globus CLI (necessary for HPC workflows)
- Python or JavaScript SDKs (for programmatic workflows, automation, or scripts)
Researchers can even embed data transfers directly into compute jobs—automatically pulling raw data to Tempest, running a batch job, and sending processed results back to the lake.
Behind the Scenes
Managing the data lake is a joint effort.
Montana State University IT handles:
- Identity and access management
- Integration and technical infrastructure
- HPC support and training
- Security and backup systems
The Montana State University Library handles:
- Data curation
- Metadata standards
- Preservation planning
- Training and quality assurance
It’s a partnership designed to support both technical resilience and good data stewardship as foundations to promote research integrity and public trust in research.
How It Fits Into the Data Lifecycle
Research data management isn’t linear—researchers frequently revisit earlier stages of the research process as methods evolve or new insights emerge. The data lake supports this iterative cycle by providing a stable, centralized environment for:
- Raw data storage
- Iterative cleaning and processing
- HPC analysis
Collaborative data stewardship
- Active data sharing and future reuse
This structure supports both day-to-day science and long-term project goals.
In Short: Why the Data Lake Matters
The SMART FIRES data lake helps us:
- Work together across institutions
- Store and access massive datasets
- Process data efficiently on HPC
- Protect and document research
- Prepare for future sharing, publication, and preservation
No fishing poles required—just better science, smoother collaboration, and a system designed to grow with us.