Moving legacy data from a relational database to the cloud is becoming a common problem.
Given that databases like MySQL degrade on performance as the data grows, they are being demoted to a simple configuration storage or debugging tools.
If you want to collect massive amounts of data, you need a distributed cloud-based tool like Apache Hadoop.
The framework is simple and consists of four steps:
- The user submits a request from a browser.
- The web server receives the request and submits a request for data migration.
- A migration engine receives the request and spawns a task.
- The task connects to the relational database and copies the data to the cloud-based storage.
It heavily relies on the cloud database features to support full imports, partial imports, and real-time sync, etc.
The migration engine will be a series of ETL spark jobs (written in Python, Scala or Java) that will connect to the relational database, extract the data, and save it in the target database.
The draft doesn’t provide many details. It does give you a clear idea of the process you need to follow. If you want to see a real world implementation of this, check out the new AWS Glue Managed ETL service.