Wednesday, August 3, 2016

Reads vs writes in production database of web application

Almost a year has passed since my last blog post, but no - I'm not dead (yet). I'm recently focused on production webapp we previously developed. This is a first time I decided to stay in the project on this stage. It looks interesting, but it certainly doesn't allow to write too much on the blog :)

I recently think about CQRS-like patterns we partially used in this application, and pros'n'cons of migrating more towards full CQRS, which is mostly about separating reads from writes. Previously I assumed that in usual application "we may have probably 100 times more reads than writes", what would qualify to have a value from these two models separation. I wanted to check it later on the production, but I forgot. Now we are close to start a big redesign of the whole application, and I've just remembered about that. So, how does it look for real production? This is data from postgres pg_stat_database:

tup_returned   | 58956902163
tup_inserted   | 3348809
tup_updated    | 4557408
tup_deleted    | 245501

Does it look that I made a big mistake previously and we have ~7200 times more reads than writes? It will be funny to investigate that in the near future...

Saturday, September 19, 2015

How to avoid subclasses join in Hibernate model

This article will be about multiple joins in Hibernate model during super class fetch in JOINED inheritance mapping strategy, and generally how I significantly improved our system performance using some simple patterns. There's a lot of stuff on internet about this (usually wrong approach), so here I'll present my working solution for this problem.

Few words about motivation. People from Hibernate claim that this is not necessary to avoid joins, because join in nowaday databases is fast. This might be not a problem for, for example, 4 subclasses to make 4 join on each query. But what I personally really like in Hibernate, is that you can really model your world with classes and automatically map it to database. Really modeling your world usually means that you have a really lot of super and subclasses, and you may end up with for example 30-40 joins in simple from BaseEntity query, or even you may hit 61 joins limit (if you use mysql).

And here comes the problem. First and foremost, even if join is fast, I don't want to have 40 joins on simple queries. Even if it didn't influence the performance (but after my tests on big system - it does), I would like to be able to debug some queries in plain SQL, and I wouldn't like to have queries extending to 5 screens for that. So, here start things like adjusting the model to queries, what means that our real world model is no longer a real world model, but it becomes hibernate oriented model. That's second thing I hate to have in the project. I like to keep things clean.

OK, now we can delve into the problem.

Joined strategy defaults


Let's consider following simple model:



This is how it's reflected in database (I use postgresql for this example):

aero=> \d super;
              Table "public.super"
   Column   |          Type          | Modifiers 
------------+------------------------+-----------
 classname  | character varying(255) | not null
 id         | bigint                 | not null
 super_prop | character varying(255) | 
Indexes:
    "super_pkey" PRIMARY KEY, btree (id)
Referenced by:
    TABLE "suba" CONSTRAINT "fk_5yjvt9lkf5b24nyxv59n3kbgj" FOREIGN KEY (id) REFERENCES super(id)
    TABLE "subb" CONSTRAINT "fk_q4aajuitkvsk93mn07t66l0pi" FOREIGN KEY (id) REFERENCES super(id)

aero=> \d suba;
              Table "public.suba"
  Column  |          Type          | Modifiers 
----------+------------------------+-----------
 subaprop | character varying(255) | 
 id       | bigint                 | not null
Indexes:
    "suba_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
    "fk_5yjvt9lkf5b24nyxv59n3kbgj" FOREIGN KEY (id) REFERENCES super(id)

aero=> \d subb;
              Table "public.subb"
  Column  |          Type          | Modifiers 
----------+------------------------+-----------
 subbprop | character varying(255) | 
 id       | bigint                 | not null
Indexes:
    "subb_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
    "fk_q4aajuitkvsk93mn07t66l0pi" FOREIGN KEY (id) REFERENCES super(id)

So, we have a superclass table, and two subclasses tables (in JOINED inheritance mapping strategy).

I put some data there:

sessionFactory.getCurrentSession().save(new SubA());
sessionFactory.getCurrentSession().save(new SubB());

And can observe how this data is distributed in database:

aero=> select * from super; select * from suba; select * from subb;
 classname | id | super_prop 
-----------+----+------------
 SubA      |  1 | superProp
 SubB      |  2 | superProp
(2 rows)

 subaprop | id 
----------+----
 subAProp |  1
(1 row)

 subbprop | id 
----------+----
 subBProp |  2
(1 row)

So, as expected, the data is distributed around all tables, depending on subclass type.

Now I ask hibernate about superclasses list from database using from Super HQL query, and the final SQL query built by Hibernate looks as follows:

select [...] from SUPER super0_ left outer join SUBA super0_1_ on super0_.id=super0_1_.id left outer join SUBB super0_2_ on super0_.id=super0_2_.id

The example above shows the broached problem (we imagine now, that we have 50 subclasses here, of course :) ).

Explicit polymorphism


People on the internet often try to avoid multiple joins problem using explicit polymorphism. To be completely honest, I tried to use this hibernate feature to check if it solves the problem. The misunderstanding about this feature is probably located in Hibernate javadoc, and it claims that explicit polymorphism means: This entity is retrieved only if explicitly asked.

So if we don't ask about SubA class, for example, we should fetch only instances of Super. Let's try it. 

I tested two models, first one with explicit polymorphism on superclass:



And second one is with explicit polymorphism on subclass:



Both neither influence the table model, nor the data. And for both in response to from Super we have:

select [...] from SUPER super0_ left outer join SUBA super0_1_ on super0_.id=super0_1_.id left outer join SUBB super0_2_ on super0_.id=super0_2_.id

It just doesn't work as the expected.

So, what is the explicit polymorphism for? It looks that the only application of this features, is so called lightweight class pattern, and it may be used only in situation when two or more of classes are mapped to the same table. But it doesn't solve our N joins problem.

Single table + secondary tables


So how to solve the problem of N joins? I use following pattern, involving SINGLE_TABLE inheritance mapping strategy and secondary tables. Let's first look on the class model:



Everything is mapped to single table, but for all derived classes I define secondary table (using @SecondaryTable annotation) and tell Hibernate that this table should be fetched using additional select (using @Table annotation). The additional hassle here is that for all properties from subclasses, I need to mark them by @Column (or @JoinColumn - for single ended associations) to tell Hibernate to put this property to the secondary table. This really should be done automatically (if whole class has secondary table definition), but this is one of many other things that Hibernate people refuse to do.

Let's take a glance on the database model:

aero=> \d super; \d suba; \d subb;
              Table "public.super"
   Column   |          Type          | Modifiers 
------------+------------------------+-----------
 classname  | character varying(255) | not null
 id         | bigint                 | not null
 super_prop | character varying(255) | 
Indexes:
    "super_pkey" PRIMARY KEY, btree (id)
Referenced by:
    TABLE "suba" CONSTRAINT "fk_5yjvt9lkf5b24nyxv59n3kbgj" FOREIGN KEY (id) REFERENCES super(id)
    TABLE "subb" CONSTRAINT "fk_q4aajuitkvsk93mn07t66l0pi" FOREIGN KEY (id) REFERENCES super(id)

              Table "public.suba"
  Column  |          Type          | Modifiers 
----------+------------------------+-----------
 subaprop | character varying(255) | 
 id       | bigint                 | not null
Indexes:
    "suba_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
    "fk_5yjvt9lkf5b24nyxv59n3kbgj" FOREIGN KEY (id) REFERENCES super(id)

              Table "public.subb"
  Column  |          Type          | Modifiers 
----------+------------------------+-----------
 subbprop | character varying(255) | 
 id       | bigint                 | not null
Indexes:
    "subb_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
    "fk_q4aajuitkvsk93mn07t66l0pi" FOREIGN KEY (id) REFERENCES super(id)

The really good thing here is that the model is not changed at all, so if you experience problems with multiple joins, you may easily convert your JOINED hierarchy mapping strategy, to such model.

To be completely clean let's take a look on the data:

aero=> select * from super; select * from suba; select * from subb;
 classname | id | super_prop 
-----------+----+------------
 SubA      |  1 | superProp
 SubB      |  2 | superProp
(2 rows)

 subaprop | id 
----------+----
 subAProp |  1
(1 row)

 subbprop | id 
----------+----
 subBProp |  2
(1 row)

And finally on from Super HQL translated to SQL:

select [...] from SUPER super0_
select super_1_.subaprop as subaprop1_83_ from SUBA super_1_ where super_1_.id=1
select super_2_.subbprop as subbprop1_84_ from SUBB super_2_ where super_2_.id=2

There's no any joins here anymore.

Is this right?


If we don't want to have a big join, we need to consider that Hibernate needs to load all entities somehow. Here, we replaced the join with subclasses fetch in secondary selects, what is called the "N+1 select problem". So is this right or not to use this pattern?

I'd say: it depends. If you have 4-5 subclasses probably the join would be faster. If you have 50 joins, it probably wouldn't. Moreover if you hit 61 max joins in mysql, you cannot execute any query anymore, so this is the only sensible solution.

What I really do when I see such big joins in a project, I try to measure performance of both approaches, and choose the best solution for real database with real data and real queries we execute on underlying model. I cannot say that "a lot of joins" is an antipattern, or "N+1 select" is an antipattern. It just depends on some conditions. For example if you fetch data for the view with pagination, you usually don't fetch more that 10-20 entities in a single query, so N+1 select problem is very little here. If you want to fetch big amount of data, this probably will be a problem, but also there's probably something wrong with you app, if you need to fetch so much data from database.

So everything should be tested and used with thinking (what can be said about clearly everything in software development :) ).

Further improvements


This article I made after huge system refactor in current project. Translation from JOINED to the proposed model above was only one of the things I refactored. It turned out that uncontrolled eager fetch of various associations can also produce unbelievable queries. You may say that if one have uncontrolled eager fetch, one also have a bad model. That's true, this is why the refactor was needed - to convert bad model to best performing model. I always follow the pattern that we focus on business system value in a first place, and the optimization stage comes at the end.

So what I did more to optimize the model. First and foremost now I consider eager fetch for collections as a bad pattern. But this is not a problem, because all to-many associations are lazy by default, and if they become eager, this is a developer choice, so he should know what is he doing and why. But the problem is with single ended associations. All single ended associations are eager by default, and they can produce something really unwanted regarding application SQL.

You may control this "join depth" using hibernate.max_fetch_depth property. If you set it to 0, hibernate wouldn't make any deep joins fetching your entities. But this doesn't mean that you convert your single ended associations to lazily fetched. They still will be fetched (at least the first level of single-ended associations in relation to entity you want to fetch) using additional selects. To really convert them into lazy one, you also need to use additional annotation @LazyToOne. That really converts the objects in the relations to lazy proxy (you have then very ugly lazy proxies in these relation ends, but this still can be overcome using hibernate instrumentation in compilation stage).

So, is this right to have everything lazy and how does it influence on queries performance? No, it's not right to say so. I just don't believe in good perfomance, when you once can say "this entity should have these relations fetched lazily, and these relations we will use frequently and they should be fetched eagerly". This simply depends on the given entity usage. Sometimes, when you develop your model, it seems you can say something like this (usually thinking about fetching the list for the view, where this entity is displayed), but then it turns out that you also use list of the same entities in various different services, and you just don't need these eagerly fetched association in these queries. 

So I believe this is just not OK to define on your model what should be fetched eagerly and what should not. There's a different place where you can do it - in repository queries. For each query you make, you usually know the exact usage of this query, and you can define appropriate joins in HQL or by using setFetchMode(), when you use Criteria API. And this is the right place where it should be defined.

So what I did, when I was doing this refactoring, can be limited to these activities:
  1. Convert some entity hierarchies from JOINED to SINGLE_TABLE with secondary table, to controll the subtree fetching.
  2. Make everything lazy, and
  3. Define appropriate joins on repository queries level, depending on the query usage.
After this job, with our high load database test we achieved about 560% of performance boost.

Friday, August 28, 2015

Statistics queries with time range for PostgreSQL and Hibernate

Few days ago I was asked to implement charts with some finance data using JS charting library, in the application working on Postgres database. There's a lot of finance stuff in the database stored every minute and I just needed to show this data on charts.

OK, but charts are nice if you can restrict amount of your data returned from the server to some limited number of items. Returning 1000 items would both - make chart very heavy and unreadable and consume a lot of net traffic. I assumed 50 as the average number of items that are nicely displayed on such charts and are also acceptable from the traffic point of view.

It would be nice to split time data to some periods that always assert no more than 50 items and then return only such statistics to the client side. And here I discovered very nice and applicable postgres function, which is date_trunc.

Equipped with this weapon I figured out universal solution to calculate such time-based statistics for postgres database using Hibernate. Firstly I designed a very simple enum to calculate right time-grouping criteria:

It'd be nice then to have everything working with whatever entity type and to be able to customize grouping query with Hibernate criteria, because this is right abstraction to build SQL query step by step from filters hypothetically associated with chart filters:

It's also quite simple to autodetect best data granulation for the chart (eg. if it is a hour, day, month or something else) depending on the base time range for the chart.