Five Data Model Mistakes Common to Startups

Hyperfocus on the first customers and the MVP can often lead to these five common datamodeling mistakes

Posted by Tejus Parikh on May 9, 2017

It’s not surprising to come across a floundering B2B startup in the process of a significant code rewrite. New features to attract customers would appear to be the logical approach in such times, but new features may be nearly impossible to build because of basic data-model mistakes made early in the life of the company. Most developers are familiar with the convention that a bug in production is more costly than a bug in development. Likewise, poor data structures behind production data can be crippling in critical situations.

Data-modeling is easy to gloss over when building an MVP since the long term impacts are not consequential if nobody likes what is being built. Yet there is some value in considering the viability of the solution in the event of success. The trade off is often worth it for the 5 common modeling mistakes that I’ve come across in my career.

Assuming people belong to one organization or company

This is easily the most common mistake I’ve seen and is often inadvertantly driven by early customer feedback. Most early customers of early startups are smaller organizations with simpler organization structures and the abstraction of a user has a permission for one organization makes sense.

This abstraction has often worked for the first few hundred customers. However, when the startup starts going upstream and going after larger organizations user permissions start to get more complicated. Eventually a conglomerate will need key people to have access to multiple subsidiaries or two independent customers will share the same consultant (possibility because they referred you). Further more a user might be an administrator in one account, but a normal user in another.

The end result is that one of the most critical aspects of the app (and all the assumptions that were built from it) will need to be addressed at the same time company traction is improving. Starting with the idea that users can belong to multiple organizations is much easier for the long term.

Assuming users will only ever login in one way

This is another case of your MVP test group not having the characteristics of the larger market. Authentication code is a pain since it’s not just a form to set a password. You need password recovery methods and the ability to change credentials, meaning more screen and emails. Even though there are many libraries out there that provide the fundamentals, they all still require customization for a consistent experience.

Which justifies the temptation to login with Facebook, Salesforce, LinkedIn, or some other third party service. Except none of these services have 100% penetration. A large subset of users that could use your product won’t be able to use it without a custom authentication mechanism, which means if you started with a third party, you’ll find yourself supporting multiple authentications.

Schemas when you should be schemaless and vise versa

Working on the same technologies for years will expose the technology’s serious fundamental problems. Databases, and the relational structures built on top of them, are known for not scaling horizontally and being difficult to use with unstructured data. Data stores inspired by Google’s big table paper have becoming increasingly popular as a solution for these limitations.

The mistake here is that most data models are still relational and most companies aren’t handling the volume of data that Google, Facebook, or Twitter are. Furthermore the decreasing cost of computing makes vertically scaling a database more palatable.

The key here is that I’m referring to most problems, not all. If you are doing what Google’s doing (processing disparate documents from multiple sources), then even at low volumes using a schema-less datastore would make sense.

The bottom line is you have to match the technology to the data, not base the decisions only on what you know or what’s trendy.

Adhering to data-modeling doctrine

Similar to the point above, even within a specific data technology there are best practices and established doctrine. An example from the relational database world is normalization.

There are multiple types of normal forms, but the general principle is that when designing a data-model, one should avoid repeating data. Think of your credit card number. When you get a new one, you have to change it in every online store that has it. If the world were normalized, you’d update your wallet once and everyone would be able to use the new value.

Normalization makes sense when storage is expensive, but the data is mutable. If storage is cheap (as it is now) and the data is “write once, read many” de-normalization may be the way to go. Otherwise simple queries might require multiple joins across large tables, which is never cheap.

No doctrine is one size fits all.

Leaving constraints to the application layer

Ruby on Rails has historically been guilty of this problem. For too long the official approach was to specify relationships only with Ruby code with the constraints enforced by application validations. In the MVP there will only be one application connected to the database.

No surprise, this does not hold true for long. Eventually some SQL will be run from the command line, somebody will write a bulk uploader using a pared down environment, or half the team will decide that Erlang, Scala, Rust, or Node is the future and build some new product using them. If the correct constraints are provided in the datastore, at least you can have some confidence that it is not all getting screwed up.

Conclusions

All of these mistakes are understandable given all the uncertainties during the earliest stages of a startup. Yet the one major takeaway of all these points is think about how your data is going to be used not just today, but in the future. After all, the true goal of any milestone is to get to the next one. A little extra effort in the planning stages can save weeks of effort down the line.

Fixing a data-modeling issue is almost always doable, but at great cost. Changing code is just one aspect. Manually changing production data is downright scary. There are enough issues that can derail a startup, there’s no reason to let poor data modeling be one of them.

Original image is CC-licensed [original source]

Tejus Parikh

I'm a software engineer that writes occasionally about building software, software culture, and tech adjacent hobbies. If you want to get in touch, send me an email at [my_first_name]@tejusparikh.com.