Link here

Link here

Outsmarted by LINQ-to-SQL

LINQ, SQL 4 Comments »

(and how to read a SQL execution plan)

I’m using LINQ-to-SQL on a current project, which is mostly a pretty positive experience (ignoring the odd frustrating limitation) - it’s incredibly easy to set up.

When using any object-relational mapper (ORM), LINQ-to-SQL, NHibernate or whatever, you can’t just blindly trust the SQL queries they’re generating. Hopefully, the queries will be as finely tuned as if you lovingly hand-crafted them yourself, but what if they’re not? Do you care if your production database server melts down? Sensibly, you’ll keep SQL Profiler open and scrutinise each new type of query.

Shock and horror

Following that advice, when I started with LINQ-to-SQL I used SQL Profiler to see what it was getting up to. For example, I had everyone’s favourite Customer-Orders relationship:

schema

… and I was doing a query to find the customers who have never ordered anything:

ExampleDBDataContext context = new ExampleDBDataContext();
 
// The query
var customers = from c in context.Customers
		where c.Orders.FirstOrDefault() == null
		select c;
 
foreach (Customer c in customers)
	Console.WriteLine(c.Name);

It worked, but I was appalled to see it generate the following SQL:

SELECT [t0].[CustomerID], [t0].[Name], [t0].[CreatedDate]
FROM [dbo].[Customers] AS [t0]
WHERE NOT (EXISTS(
    SELECT TOP (1) NULL AS [EMPTY]
    FROM [dbo].[Orders] AS [t1]
    WHERE [t1].[CustomerID] = [t0].[CustomerID]
    ))

Smell that dirty subselect! As someone who’s been brought up on the mantra "subselects are bad; always use joins!", I wanted to write to Microsoft and educate them that the correct SQL would be:

SELECT [t0].[CustomerID], [t0].[Name], [t0].[CreatedDate]
FROM [dbo].[Customers] AS [t0]
LEFT OUTER JOIN [Orders] o ON t0.[CustomerID] = o.[CustomerID]
WHERE o.[OrderID] IS NULL

Mmm, that’s much cleaner and nicer. Lovely elegant joins. Stupid LINQ-to-SQL…

I think you can guess what’s coming

Let’s put my beliefs to the test. When there are 100 customers and 1000 orders, the two methods’ execution plans look like this:

ExecutionPlans-smalldata-annotated

Notice the "query cost" values (smaller is better).

LINQ-to-SQL’s method does a "stream aggregation" to get a list of distinct CustomerIDs from the Order table, then left-anti-semi join excludes any Customer rows which match one of those IDs. My "left outer join" method, on the other hand, joins all Customer-Order pairs, then has a filter to exclude any joined rows that have a CustomerID.

LINQ-to-SQL’s method wins slightly, but only very slightly. The near-identical performance is not surprising since they both scan all 100 customers and all 1,000 orders.

More data, more data

Repeating the experiment, but now with 100 customers and 1,000,000 orders, the execution plans change to:

ExecutionPlans-largedata-annotated

Agh! My elegant method is about 200 times slower than LINQ-to-SQL’s clumsy subselect! But why?

The query plan for my method remains identical, so now it has to scan all 1,000,000 order rows, joining them to customer records, and filtering out any customers with orders.

The query plan for LINQ-to-SQL’s method has changed, so now for every Customer record, it just subselects the TOP 1 matching Order record and does a left-anti-semi join (so Customers are included only if no matching Order was found). This means it doesn’t have to look through all 1,000,000 rows - it can bail out as soon as the first matching Order is found. Assuming you have a CustomerID index on the Order table, this is very fast indeed.

Conclusions

Obviously, the conclusion isn’t "LINQ to SQL is always right", or "subselects are always better than JOINs". I am merely admitting that LINQ to SQL isn’t as dumb as I thought, nor am I as clever as I thought. Oh, and scrutinising LINQ to SQL using SQL Profiler isn’t always enough; you sometimes need to inspect those execution plans too.

Geolocation is easier (and cheaper) than you think

Geolocation, SQL No Comments »

Most of the time when you’re surfing the web, or creating web applications, you don’t expect real geography to be involved. Historically it’s been tricky to identify with any accuracy or reliability the physical location of your visitors, and might even be said to contravene the spirit of the web. But if what if you really do want to take action based on where they’re coming from?

I discovered today that it’s really easy. Geolocation is the term used for the conversion of an IP address to a real-world geographic location. For example:

216.239.59.103 maps to:

City Latitude Longitude
Mountain View,
California, USA
37.3956 -122.076

There are a whole bunch of companies selling geolocation services - usually either as web services, or as downloadable databases containing IP addresses and geographic data.

Free geolocation data

If you’re looking for something free, check out MaxMind GeoLite City. You can download a huge pair of CSV files (one for IP blocks, one for corresponding location data) from their website and import them into SQL Server, and you’ll end up with a schema like this:

asd

Now, there are 4 billion possible IP addresses, so rather than explicitly listing every one as a separate row (which really would hurt SQL Server) the MaxMind people have split up the IP address space into blocks which correspond to a single geographic location. To help SQL Server find which block an IP address is in, they have defined a one-to-one mapping from IP addresses to BIGINT values, as such:

A.B.C.D <=> A*256^3 + B*256^2 + C*256 + D

 

So now you can (almost) instantly obtain geographic data for any IP address by defining a SQL user defined function (UDF) such as:

CREATE FUNCTION [GetGeocodingData]
    (@a tinyint, @b tinyint, @c tinyint, @d tinyint)
RETURNS @ReturnTable TABLE (
	City VARCHAR(255),
	Region VARCHAR(255),
	Country VARCHAR(255),
	Latitude FLOAT,
	Longitude FLOAT
)
AS
BEGIN
	INSERT @ReturnTable
	SELECT 	lo.City,
		lo.Region,
		lo.Country,
		lo.Latitude,
		lo.Longitude
	FROM Blocks bl
	JOIN Locations lo on lo.LocationID = bl.LocationID
	WHERE (CAST(@a AS BIGINT)*256*256*256 + @b*256*256 + @c*256 + @d)
	  BETWEEN bl.[BlockStart] AND bl.[BlockEnd]

	RETURN
END
GO

In this example, the parameters @a, @b, @c, @d correspond to the four parts of the IP address (and should therefore be integers in the range 0-255).

Site Meter