Friday, June 8, 2012

Joins are for lazy people?


I recently had a discussion with another developer who claimed to me that JOINs (SQL) are useless. This is technically true but he added that using joins is less efficient than making several requests and link tables in the code (C# or Java).



For him joins are for lazy people that don't care about performance. Is this true? Should we avoid using joins?


Source: Tips4all

22 comments:

  1. No, we should avoid developers who hold such incredibly wrong opinions.

    In many cases, a database join is several orders of magnitude faster than anything done via the client, because it avoids DB roundtrips, and the DB can use indexes to perform the join.

    Off the top of my head, I can't even imagine a single scenario where a correctly used join would be slower than the equivalent client-side operation.

    Edit: There are some rare cases where custom client code can do things more efficiently than a straightforward DB join (see comment by meriton). But this is very much the exception.

    ReplyDelete
  2. It sounds to me like your colleague would do well with a no-sql document-database or key-value store. Which are themselves very good tools and a good fit for many problems.

    However, a relational database is heavily optimised for working with sets. There are many, many ways of querying the data based on joins that are vastly more efficient than lots of round trips. This is where the versatilty of a rdbms comes from. You can achieve the same in a nosql store too, but you often end up building a separate structure suited for each different nature of query.

    In short: I disagree. In a RDBMS, joins are fundamental. If you aren't using them, you aren't using it as a RDBMS.

    ReplyDelete
  3. Well, he is wrong in the general case.

    Databases are able to optimize using a variety of methods, helped by optimizer hints, table indexes, foreign key relationships and possibly other database vendor specific information.

    ReplyDelete
  4. No, you shouldnt.

    Databases are specifically designed to manipulate sets of data (obviously....). Therefore they are incredibly efficient at doing this. By doing what is essentially a manual join in his own code, he is attempting to take over the role of something specifically designed for the job. The chances of his code ever being as efficient as that in the database are very remote.

    As an aside, without joins, whats the point in using a database? he may as well just use text files.

    ReplyDelete
  5. If "lazy" is defined as people who want to write less code, then I agree. If "lazy" is defined as people who want to have tools do what they are good at doing, I agree. So if he is merely agreeing with Larry Wall (regarding the attributes of good programmers), then I agree with him.

    ReplyDelete
  6. Ummm, joins is how relational databases relate tables to each other. I'm not sure what he's getting at.

    How can making several calls to the database be more efficient than one call? Plus sql engines are optimized at doing this sort of thing.

    Maybe your coworker is too lazy to learn SQL.

    ReplyDelete
  7. "This is technicaly true" - similarly, a SQL database is useless: what's the point in using one when you can get the same result by using a bunch of CSV files, and correlating them in code? Heck, any abstraction is for lazy people, let's go back to programming in machine code right on the hardware! ;)

    Also, his asssertion is untrue in all but the most convoluted cases: RDBMSs are heavily optimized to make JOINs fast. Relational database management systems, right?

    ReplyDelete
  8. The last company I worked for didn't use SQL joins either. Instead they moved this work to application layer which is designed to scale horizontally. The rationale for this design is to avoid work at database layer. It is usually the database that becomes bottleneck. Its easier to replicate application layer than database. There could be other reasons. But this is the one that I can recall now.

    Yes I agree that joins done at application layer are inefficient compared to joins done by database. More network communication also.

    Please note that I'm not taking a hard stand on avoiding SQL joins.

    ReplyDelete
  9. Yes, You should.

    And you should use C++ instead of C# because of performance. C# is for lazy people.

    No, no, no. You should use C instead of C++ because of performance. C++ is for lazy people.

    No, no, no. You should use assembly instead of C because of performance. C is for lazy people.

    Yes, I am joking. you can make faster programs without joins and you can make programs using less memory without joins. BUT in many cases, your development time is more important than CPU time and memory. Give up a little performance and enjoy your life. Don't waste your time for little little performance. And tell him "Why don't you make a straight highway from your place to your office?"

    ReplyDelete
  10. Sounds like a classic case of "I can write it better." In other words, he's seeing something that he sees as kind of a pain in the neck (writing a bunch of joins in SQL) and saying "I'm sure I can write that better and get better performance." You should ask him if he is a) smarter and b) more educated than the typical person that's knee deep in the Oracle or SQL Server optimization code. Odds are he isn't.

    ReplyDelete
  11. Without joins how are you going to relate order items to orders?
    That is the entire point of a relational database management system.
    Without joins there is no relational data and you might as well use text files
    to process data.

    Sounds like he doesn't understand the concept so he's trying to make it seem they are useless. He's the same type of person who thinks excel is a database application.
    Slap him silly and tell him to read more about databases. Making multiple connections and pulling data and merging the data via C# is the wrong way to do things.

    ReplyDelete
  12. I don't understand the logic of the statement "joins in SQL are useless".
    Is it useful to filter and limit the data before working on it? As you're other respondants have stated this is what database engines do, it should be what they are good at.

    Perhaps a lazy programmer would stick to technologies with which they were familiar and eschew other possibilities for non technical reasons.

    I leave it to you to decide.

    ReplyDelete
  13. Let's consider an example: a table with invoice records, and a related table with invoice line item records. Consider the client pseudo code:

    for each (invoice in invoices)
    let invoiceLines = FindLinesFor(invoice)
    ...


    If you have 100,000 invoices with 10 lines each, this code will look up 10 invoice lines from a table of 1 million, and it will do that 100,000 times. As the table size increases, the number of select operations increases, and the cost of each select operation increases.

    Becase computers are fast, you may not notice a performance difference between the two approaches if you have several thousand records or fewer. Because the cost increase is more than linear, as the number of records increases (into the millions, say), you'll begin to notice a difference, and the difference will become less tolerable as the size of the data set grows.

    The join, however. will use the table's indexes and merge the two data sets. This means that you're effectively scanning the second table once rather than randomly accessing it N times. If there's a foreign key defined, the database already has the links between the related records stored internally.

    Imagine doing this yourself. You have an alphabetical list of students and a notebook with all the students' grade reports (one page per class). The notebook is sorted in order by the students' names, in the same order as the list. How would you prefer to proceed?


    Read a name from the list.
    Open the notebook.
    Find the student's name.
    Read the student's grades, turning pages until you reach the next student or the last page.
    Close the notebook.
    Repeat.


    Or:


    Open the notebook to the first page.
    Read a name from the list.
    Read any grades for that name from the notebook.
    Repeat steps 2-3 until you get to the end
    Close the notebook.

    ReplyDelete
  14. No, not only are joins better optimized in database code that ad-hoc C#/Java; but usually several filtering techniques can be applied, which yields even better performance.

    ReplyDelete
  15. He is most certainly wrong. While there are definite pros to data manipulation within languages like C# or Java, joins are fastest in the database due to the nature of SQL itself.

    SQL keeps detailing statistics regarding the data, and if you have created your indexes correctly, can very quickly find one record in a couple of million. Besides the fact that why would you want to drag all your data into C# to do a join when you can just do it right on the database level?

    The pros for using C# come into play when you need to do something iteratively. If you need to do some function for each row, it's likely faster to do so within C#, otherwise, joining data is optimized in the DB.

    ReplyDelete
  16. I suspect he has a limited view on what databases should be used for. One approach to maximise performance is to read the entire database into memory. In this situation, you may get better performance and you may want to perform joins if memory for efficiency. However this is not really using a database, as a database IMHO.

    ReplyDelete
  17. I will say that I have run into a case where it was faster breaking the query down and doing the joins in code. That being said, it was only with one particular version of MySQL that I had to do that. Everything else, the database is probably going to be faster (note that you may have to optimize the queries, but it will still be faster).

    ReplyDelete
  18. He is wrong, joins are what competent programmers use. There may be a few limited cases where his proposed method is more efficient (and inthose I would probably be using a Documant database) but I can't see it if you have any deceent amount of data. For example take this query:

    select t1.field1
    from table1 t1
    join table2 t2
    on t1.id = t2.id
    where t1.field2 = 'test'


    Assume you have 10 million records in table1 and 1 million records in table2. Assume 9 million of the records in table 1 meet the where clause. Assume only 15 of them are in table2 as well. You can run this sql statement which if properly indexed will take milliseconds and return 15 records across the network with only 1 column of data. Or you can send ten million records with 2 columns of data and separately send another 1 millions records with one column of data across the network and combine them on the web server.

    Or of course you could keep the entire contents of the database on the web server at all times which is just plain silly if you have more than a trivial amount of data and data that is continually changing. If you don't need the qualities of a relational database then don't use one. But if you do, then use it correctly.

    ReplyDelete
  19. I've heard this argument quite often during my career as a software developer. Almost everytime it has been stated, the guy making the claim didn't have much knowledge about relational database systems, the way they work and the way such systems should be used.

    Yes, when used incorrectly, joins seem to be useless or even dangerous. But when used in the correct way, there is a lot of potential for database implementation to perform optimizations and to "help" the developer retrieving the correct result most efficiently.

    Don't forget that using a JOIN you tell the database about the way you expect the pieces of data to relate to each other and therefore give the database more information about what you are trying to do and therefore making it able to better fit your needs.

    So the answer is definitely: No, JOINSaren't useless at all!

    ReplyDelete
  20. Unless I've seriously misunderstood, the logic in the question is very flawed

    If there are 20 rows in B for each A, a 1000 rows in A implies 20k rows in B.
    There can't be just 100 rows in B unless there is many-many table "AB" with 20k rows with the containing the mapping.

    So to get all information about which 20 of the 100 B rows map to each A row you table AB too. So this would be either:


    3 result sets of 100, 1000, and 20k rows and a client JOIN
    a single JOINed A-AB-B result set with 20k rows


    So "JOIN" in the client does add any value when you examine the data. Not that it isn't a bad idea. If I was retrieving one object from the database than maybe it makes more sense to break it down into separate results sets. For a report type call, I'd flatten it out into one almost always.

    In any case, I'd say there is almost no use for a cross join of this magnitude. It's a poor example.

    You have to JOIN somewhere, and that's what RDBMS are good at. I'd not like to work with any client code monkey who thinks they can do better.

    Afterthought:

    To join in the client requires persistent objects such as DataTables (in .net). If you have one flattened resultset it can be consumed via something lighter like a DataReader. High volume = lot of client resources used to avoid a database JOIN.

    ReplyDelete
  21. What's the alternative the OTHER developer is providing.

    We can correctly comment only seeing the other option he/she is providing.

    Because - Laziness is relative and so is Performance.
    But options are comparative, the solution he is bringing may be real good. Just let us know.

    ReplyDelete
  22. This is "technically true" only in one case which is not used often in applications (when all the rows of all the tables in the join(s) are returned by the query). In most queries only a fraction of the rows of each table is returned. The database engine often uses indexes to eliminate the unwanted rows, sometimes even without reading the actual row as it can use the values stored in indexes. The database engine is itself written in C, C++, etc. and is at least as efficient as code written by a developer.

    ReplyDelete