Akeneo: The Turducken of Product Information Management
You know the Thanksgiving treat that’s a favorite of protein lovers, the Turducken; a chicken stuffed inside a duck stuffed inside a Turkey? Akeneo, the Product Information Manager (PIM), similarly employs a turducken architectural strategy in order to manage complexity and maintain application performance.
A PIM allows a business to maintain a list of its products and all the information about each product in an application specially designed to make doing so, easy and efficient. Large organizations may have millions of products, and each of those products has its own specific data.This product data, or attributes, can often number anywhere between 10-100, which can quickly add up to thousands of different attributes.
Creating a relational database structure to handle thousands of columns to store attribute values may be possible, but it can become a very complex solution that is not portable and performs poorly. Let's examine this performance issue in detail.
More Data Means More Work
Performance in this context, is the ability to maintain web application response times, while also applying batch-processed rules, data import, and data exports. Performance is directly related to the amount of work that needs to be performed. Regardless of the type of database technology employed, more data means more work must be performed.
Specifically, in relational database technology, this means:
- The more columns in the table, the more work per row is required to query or manipulate the data
- The more rows in a table, the more work is required to query or manipulate the data
- The more tables in the data model, the more joins must take place, thus, more work is required to query or manipulate data
Accordingly, anything we can do in the architectural design of an application to reduce these, the less work it will require, resulting in a better performing application.
Beware of Column Limitations
In the case of a PIM, where an extremely large number of attributes in the form of columns is required, and a large number of products in the form of rows, using a column for each product attribute is not a viable solution.
Your first solution to model the data for a PIM might look like the following:
But there's a significant problem. The four dominant Relational Database Management Systems (RDBMS) that I work with regularly: MySQL, Oracle, PostgreSQL, and SQL Server, all have different column and row limits.
- MySQL using InnoDB has a limit of 1,000 columns per table, and limits on row size
- Oracle is also limited to 1,000 columns per table
- PostgreSQL may have up to 1,600 columns, but the actual limit is once again based on row size
- SQL Server has a 1,024 column limit with a row size limitation
So, the real column limits are actually smaller in practice when you take into consideration, the column sizes. To work around this number of columns limitation, we can add a level of abstraction as in the entity-relationship diagram (ERD) below:
Define the Product Type
Now we define a product type (known as family in Akeneo). For each product type, we specify a list of attributes. Then we create a product that has a product type, which determines which attributes are added to a child table: product_attributes.
With this solution, we can define as many attributes to a product type as needed, without increasing the number of columns. Instead, we increase the number of rows for each attribute. If we have 1M products, with, on average, 15 attributes each, we will have 1M rows in the products table and 15M rows in the product_attributes table. At that size, the performance of the database will be degraded, because the product_attributes table is quite large. Larger than b-tree type indexes can compensate for.
In addition, the database can only support one data type, so we'll have to store everything as the most flexible database data type, a VARCHAR, i.e. a string. This means we'll need to add more code in our middle-tier to convert values to the proper data-type, resulting (once again) in more work.
To move past the “one database data type for all attributes,” we can add another layer of abstraction to our model:
Now we have a table for each attribute data type. This is good, because we can store things in the correct data type, reducing the work to complete data type conversions, and since the attribute values are now stored in more tables, the total number of rows per table has been reduced. But, there's a new problem: we've gone from three relational joins in the second solution, to nine joins. Joins are more work. So, we traded row counts and data type conversions for joins. This solution, while better, is far from ideal.
JSON to The Rescue
What next? JSON to the rescue. Akeneo’s innovative approach to solving this complexity, and therefore the performance problem too, is to store product attribute values in a JSON document that in turn is stored in the raw_values column of the pim_catalog_product table. This leads to a document database, traditionally called a NoSQL database, inside a relational database, traditionally called a SQL database.
This innovation has significant performance implications. We've reduced the row count of the largest table to 1M rows, and reduced the number of joins to two. But, we did sacrifice the ability to do an indexed search on any of a product's attribute values. Storing product data as a JSON document does not come without its price. In order to search products efficiently, Akeneo employs the use of ElasticSearch, a separate search engine to index the product attribute values.
Internally, Akeneo's Product Query Builder (PQB), uses ElasticSearch to identify a list of product IDs that match a given query, then uses Doctrine to query the underlying MySQL database. This approach maintains web application performance for end-users and provides a practically unlimited number of attributes for each product.
Stop Doing Extra Work
So, why not just use a NoSQL database? Using our examples above, the product_types and product_type_attributes tables would be one object graph document database, while products and product_attributes would be another.
One thing document databases do not do well is joining separate document databases. Joining the information would become extra work done in the middle-tier. When you consider the entire set of subject areas for a given application, you would be talking about 15 - 20 document databases. Using a NoSQL database alone is not once again a viable solution. It’s the combination of the three database technologies, the Turducken architecture with relational, document, and search engine: MySQL, a JSON column, and ElasticSearch, that makes Akeneo's solution perform well.
Hence, by using the right combination of technologies, Akeneo provides a friendly and efficient user experience for managing the enrichment of product information that is also very performant.