ETL by API: An Antipattern – Pineview Technologies

ETL by API can be done well, but the reality is most of the time it is not. Why?

APIs were designed to back applications and allow for communication between two different systems. When REST (Representational State Transfer) became popular, the systems on either side of the API tended to be user applications. That makes sense considering API stands for Application Programming Interface.

As Data Engineering and Analytics have grown to become disciplines with teams devoted entirely to them, more teams are finding reasons to interact with APIs. Often those teams have not had much exposure to how API’s work or what problems they were designed to solve.

Here are some things to keep in mind when data engineers are working with APIs.

An API’s result set is not a table. One API call may result in retrieving or updating (GET/PUT REST verbs) one row from a table, but it can also respond with an entire data model. There is no agreed upon standard for how API’s deal with data. For example, one company may have an API that allows you to hit a GET/customer/id=1 endpoint and they’d respond with a flattened JSON structure that looks similar to a row in a database table. Another company may have an API with the same endpoint that responds with a hierarchical JSON structure that includes several lists of items such as addresses, phone numbers, family members, and so forth. Each of those “entities” are probably stored in their own tables within the source database. That API would be pulling data from any number of tables and representing them in one “model” and often without the unique ID’s from the source database tables. The relationships are established through the hierarchical construct of the JSON response without providing information about foreign keys to the code that is “consuming that API”. Please note: many API’s do return information about foreign keys, but that pattern is often seen more with internal API’s that get used within a company and less often with API’s that are consumed by outside parties.

{
  "customerId": "CUST-001234",
  "firstName": "Jane",
  "lastName": "Doe",
  "email": "[email protected]",
  "dateOfBirth": "1987-04-12",

  "phoneNumbers": [
    {
      "type": "mobile",
      "number": "+1-555-123-4567"
    },
    {
      "type": "home",
      "number": "+1-555-987-6543"
    }
  ],

  "addresses": [
    {
      "type": "home",
      "street": "123 Oak Street",
      "city": "Springfield",
      "state": "IL",
      "postalCode": "62704",
      "country": "USA"
    },
    {
      "type": "work",
      "street": "500 Business Plaza",
      "city": "Springfield",
      "state": "IL",
      "postalCode": "62701",
      "country": "USA"
    }
  ]
}

An API is concerned with the size of the transmission. The “model” of the request/response often does not represent all of the fields that could be associated with an “entity”. For example, using the above example with the GET/customer/id=1 endpoint, if a customer did not have phone numbers in the database, the entire phone number node (aka “key-value pair”) of the JSON response is often eliminated entirely. It is the definition of the model within code that would fully represent the customer model, but the JSON responses don’t necessarily reflect all possible fields that would be associated with that entity. This is important for data engineers to understand because interacting with an API response from an API once or even several times is usually not enough to fully understand the complete model. It is important to rely on the API documentation, which is often incomplete or out of date. It is also important to talk with people who work at the company that maintains the API.
Loading a data warehouse or data lake by hitting an API through a series of GET requests is not the best approach. If you need to pull ten million rows from an online application and your plan is to ask for every single record GET request by GET request (RBAR or row by agonizing row), you will quickly find that you will not be able to accomplish that goal within a reasonable amount of time. You may also find that the 3rd party API you are consuming is implementing queueing or throttling functionality, and they will only allow you to extract so much data in a certain amount of time. Again, you need to rely on the API documentation and you may also need to have a conversation with the company maintaining that API to understand if they have an undocumented API or some other way that you can make a bulk request to get that data instead.
“No SQL” / document databases exist because people want to save complex hierarchical JSON without worrying about the full model underlying that JSON structure. For many use cases, it’s a fancy way to kick the can down the road and delay doing the hard work of deciding what data is really needed. It’s no coincidence that these types of databases are expensive to extract data from.
APIs routinely have new versions released. Schema changes would be expected during these version upgrades, but ideally they would not occur outside of a version upgrade.

Cyndi