Performance & Tech-Debt Focused Developers
Data Engineers VS Hierarchical JSON
Data Engineers VS Hierarchical JSON

Data Engineers VS Hierarchical JSON

My boss was a data focused leader who had spent the bulk of his career working with data warehouses. I spent the majority of my career up to this point working with transactional systems that were upstream from the analytical systems like data warehouses. That included working with API’s and applications.

There was a project that boss put me on. I wouldn’t be acting as the data architect because my boss decided he wanted one of the guys to do that part. I didn’t think that was the best decision, but there were other more interesting and harder problems that I was better suited for on that project.

The client used SAP Ariba and had been for a long time. That software system had a ton of data that could be accessed by an API, and for our purposes, only by the API. I inquired about other possible solutions, and there were none.

My boss had been involved in the sale process, but I was not in those meetings so I’m not sure what he promised. He had no experience working with languages like C# or Java and he had never consumed an API.

I was handed an impossible task: the instructions were to create code that would ingest data from the API. That would’ve been easy enough, but I only had 4 weeks to do it, and he wanted me to write code that would be able to ingest data from ANY API, with ANY STRUCTURE, and it would always work. This boss had managed to do the same thing conceptually with data by working with metadata and designing pipelines that weren’t tightly coupled to any specific table or file. But APIs are different- they have different problems they’re trying to solve, and only the simplest JSON would resemble a tabular structure. My boss had very little understanding of why and how hierarchical JSON could be used. He also had no understanding of how JSON responses will omit key-value pairs entirely in order to save space. He insisted that I would build it his way, and when I resisted, he wanted me to seek guidance from a BI developer who also had never consumed an API.

They tried to tell me to “flatten” the JSON. I knew that wouldn’t work with this data – the JSON structure was too big, complicated, and nested. They could not understand what I was trying to tell them in words, mainly because they didn’t have the experience working with JSON like this. I did a reluctant POC using the response from a single SAP Ariba API call and I flattened it. It created a tabular structure that resembled a table with very bad column names and that table had over 10k columns because of it. Essentially every entry in a list inside that hierarchical JSON got turned into a column and the name of the column reflected the position of the entry in that list. It was horrific. I had to point out to them that their idea was a dead end because SQL Server tables can only have 1024 columns. (This table could not be a “wide – table” with “sparse” columns, which has a limit of 30k)

I went back to my project and I built it in a way that would allow the customer to easily extend the pattern after I left. A model based on the JSON structure for the SAP Ariba call was required in my code, and the data architect on that project accused me of “hard-coding” it. He didn’t understand the difference between hard-coded and tightly-coupled apparently. Using a model in an API has great power, but these were not API developers in the slightest. The same guy had tried to consume these APIs (which had authentication in place) within data factory, and quickly got overwhelmed with the difficulty of that task. Some people will waste so much time trying to force a low – code tool to do something that is simple to do with an actual coding language like C#. They’re afraid of learning a coding language, but aren’t afraid of banging their head on the wall for months. Seems rather silly to be honest.

I finished the design of my azure function that would interact with that API and put the data into the data warehouse. I taught the client’s developers how to troubleshoot and extend the code that I had built. The developers gave my solution high praise, they thought what I had built was brilliant. My boss, however, was offended, probably because I refused to build the thing he promised that simply did not exist.

After that project and after leaving that company, I did think that I could build what he was asking for, but it would’ve taken longer than 4 weeks. It definitely wouldn’t have been flattening any hierarchical JSON structures though. It would be re-tabularizing data that was tabularized in a database to begin with. It’d create many tables from one JSON structure, and those tables would need to drift over time UNLESS the API had good documentation and a full and complete JSON model could be shown, without omitting any key-value pairs.

Perhaps one day I will create something like this. It’s on my own backlog of random ideas I’ve collected over the years, but I think I’d rather just go play with my dogs.

If you are interested in learning some important things data engineers should know about APIs, look here: ETL by API