Data ingestion is the process of connecting to multiple data sources and transporting the data from each source into a single repository, typically a database, data warehouse, or data lake. Once the data is in the central repository, it can be accessed and analyzed by anyone in the organization with access rights. Data ingestion can occur in batches on a schedule, or it can occur in real-time with a steady flow of data from the source system into the central repository.
Although data ingestion is often used interchangeably with data integration, the two are not the same. Data ingestion imports the data in the new repository in its raw form. With data integration, the data is transformed as part of the process of moving it from the source system through an ETL (Extract, Transform, Load) process. In addition, in some architectures, integrating data means the data stays in the source systems but is accessible through a centralized application, like a search engine.
The Benefits of Data Ingestion
The most significant benefit of data ingestion is that you can get it into a central repository quickly because no transformation processes are necessary when you move it from the source system. Once it’s in the repository, it can be cleaned, ensuring it’s consistent and correct. At this point it can also go through any transformation processes necessary.
Centralizing data is also key for analytics systems that look at all the data and derive common themes and insights.
For example, a customer data platform (CDP) ingests data from source systems such as marketing automation, CRM, ERP, web analytics, social media, and others. Once in the CDP, the data is cleansed by automating actions such as resolving identities, deduplicating profiles, resolving discrepancies between data, and discarding inaccurate data. The cleansed data is then available to analytics engines, including machine learning (ML) processes, and delivered back to external systems that need it for campaigns and programs.
Challenges with Data Ingestion
Ensuring that data ingested into a central location is performed securely is critical, especially when it’s customer data or other proprietary and confidential company information. The process of moving the data from source to destination must be secured. And once the data is in the new repository, it also needs to be adequately secured so that only the right analytics tools, systems, and people have access to it.