“Data (…) is typically stored in a distributed file store that can hold high volumes of large files in various formats. This kind of store is often called a data lake.”Big data architectures
Big data architectures are often designed backward: engineers start with the data sources first (inputs.) The bigger the data size, and the faster it needs to be processed (real-time), the more excited it is for engineers to design the pipelines.
The right approach is the opposite: you need to start with the outputs. Are you powering a dashboard? About what? For which period? Which level of granularity? How often will users check the dashboard?
Perhaps you are combining, de-duping data, and creating a source of truth for different entities across your data lake.
Or you might want to predict something. What are the metrics you need to predict? How often? With which accuracy?
As it is true with most of the product design work, you need to start with the end-user: what is the problem you are trying to solve, and how you are going to present the solution to the user. The final UI will determine how you need to orchestrate your big data architecture.