Count Distinct?

JaedenRuiner

Well-known member
Joined
Aug 13, 2007
Messages
340
Programming Experience
10+
I'm trying to think how I would have done it elsewhere, but I can't seem to resolve a little distinct issue in my T-SQL.

Given the nature of the DISTINCT keyword, when selecting rows, it acts somewhat like a "unique" key for all columns specified in the Selection. Now I know someone is going to come up and say "Distinct is not a key" so please ignore them and pay attention. To Explain what I mean: Each row in a Select Distinct must be completely unique from all other rows. Therefore, in a 3 column table:
VB.NET:
	Col1	Col2	Col3
	 1	 2	 3
	 1	 2	 2
	 1	 2	 1
	 1	 2	 3  <- not distinct
So in truth it DOES behave in a similar manner to a Unique key constraint around ALL columns specified just for the immediate purpose of selection.

Now that that is out of the way...how do you count it?

I can do:
VB.NET:
Select Count(Distinct col1)
and that works just fine. but for the example above that would return a value of 1 and my count would be off.
VB.NET:
Select Count(distinct col1, col2, col3)
returns an error about the "," being present.

Thus, I come to this: How can I get the Distinct Count for a DISTINCT subset of columns out of a table. (I ask, because this table has 16 columns, and I only need 6, but those 6 together must be distinct or else it throws my insert out of whack. I noticed the same issue with Oracle, that even though I tell it to Insert Into one table from another, it can often insert duplicates of the same row for no reason. I need to guarantee that the rows inserted from this table are Distinct, and I'd REALLY like to have the count of those rows before hand.)

Thanks
 
How about

VB.NET:
SELECT
  TABLENAME.COL1,
  TABLENAME.COL2,
  TABLENAME.COL3
FROM
  TABLENAME
GROUP BY
  TABLENAME.COL1,
  TABLENAME.COL2,
  TABLENAME.COL3

which should give you all your distinct rows, and then to get your COUNT value, just pick any one of the columns :-

VB.NET:
SELECT
  COUNT (DISTINCTROWSTABLE.COL1)
FROM
(
SELECT
  TABLENAME.COL1,
  TABLENAME.COL2,
  TABLENAME.COL3
FROM
  TABLENAME
GROUP BY
  TABLENAME.COL1,
  TABLENAME.COL2,
  TABLENAME.COL3
) DISTINCTROWSTABLE
 
yea...

but that is a level of indirection I wanted to avoid. Guess it must be done.

<edit>
As well i found, and i don't know if this is T-SQL or the SqlCommand class but they won't allow parenthetical FROM statements...
VB.NET:
SELECT Count(*)
FROM (
SELECT DISTINCT [In Pool], [Location], [X-Ref], [In Pool1], [Location1],
[X-Ref1], [Ship To], [X-Ref2], [In Pool2], [Name 1], [Your refer#],
[Material Description], [Matl Xref], [Date], [Quantity], [BUn]
FROM OPENROWSET('Microsoft.Jet.OLEDB.4.0','Excel 8.0;Database=c:\db\rexam1.xls', [Sheet1$])
) as Table
WHERE (([X-Ref2] <> '') AND 
([Your refer#] <> '') AND 
(Len([Ship To]) < 10) AND 
([Matl Xref] IN ('100202', '100210', '100253', '100254')))

Error: Incorrect Syntax Near WHERE.
Every time no matter the format or the setup, the sql engine has an issue with parenthetical sub-queries.

<Edit2> Okay, found the issue, but that is really ana-...meticulous in its syntax requirements....I must have an alias for the subquery.

But I think this will do it...now that I've got it figured I can resolve my Counting issue.
Thanks
 
Last edited:
Now that that is out of the way...how do you count it?
Count what? The number of distinct rows in a DISTINCT query resultset is always the number of rows in the query resultset.

If you did (Oracle syntax):

INSERT INTO x
SELECT DISTINCT * FROM y

The operation returns 237 when you ExecuteNonQuery(), and that is how many distinct rows were inserted: 237

How can I get the Distinct Count for a DISTINCT subset of columns out of a table.
Group them, and count the result of the grouping. Avoid using the word distinct, unless it's for COUNT(DISTINCT)

(I ask, because this table has 16 columns, and I only need 6, but those 6 together must be distinct or else it throws my insert out of whack.
INSERT INTO x
SELECT DISTINCT a,b,c,d,e,f FROM sixteen_cols

Distinct takes a hash of all the columns you select, not all the columns in the table. It's not a key. It's exactly the same as writing:

INSERT INTO x
SELECT a,b,c,d,e,f FROM sixteen_cols GROUP BY a,b,c,d,e,f



it can often insert duplicates of the same row for no reason.
Your join (if the query contains one) or the where clause specification is faulty and or not comprehensive enough to guarantee that the rows from the nested select are distinct w.r.t. all selected values. Be more careful how you write your queries; adding a DISTINCT in at the fial step to remove a faulty cartesian product has "nasty hack" written all over it :/



I need to guarantee that the rows inserted from this table are Distinct, and I'd REALLY like to have the count of those rows before hand.)

SELECT COUNT(*) FROM (SELECT 1 FROM sixteen_cols GROUP BY a,b,c,d,e,f)

While you complain of "adding a level of indirection" you cannot do this op any other way; you must instruct the database to group all the data up (it'll probably use a hash table), and then count the number of entires. A COUNT(DISTINCT col1) will do this under the hood, as will COUNT(*) FROM(...GROUP BY col1)

You should not assume that the query the DB chooses to run will look anything like the query you write, or in the order you wrote it.
 
You should not assume that the query the DB chooses to run will look anything like the query you write, or in the order you wrote it.
That's always been my problem, :) Same with oracle, I don't want the DB thinking it knows better. Nothing doing, just that my ego problem does involve a slight disagreement with the concept a computer can think better than I. In Oracle there were hints, so I could avoid nested loops, and other such bogglesome plans that pretty much always slowed my queries down. But yea, it is a good practice to remember the DB tries to do its best with what you provide.

Thanks for the pointers.
 
That's always been my problem, :) Same with oracle, I don't want the DB thinking it knows better.
"Asking whether a computer can think is like asking whether a submarine can swim"
Cliches aside, a lot of very expensive, clever beards were hired by the company in question and they used all the insider information about the DB that you will never have access to, to write an optimizer that makes choices you wouldnt or couldnt. If youre not one of those clever beards, then I'd pretty much say that the DB does know how to deliver its data better than you do, so youre gonna have to re-think your question. Youre coming along after the fact and saying "oh, my query runs slow so the DB must be crap" and trying to micromanage everything without taking time out for a true appreciation as to how the database is storing and retrieving its data. There's a hypocrisy in there;you say you want better control over a system but youre not interested in learning how the system works.

I could avoid nested loops
And if the CBO has chosen nested loops as being the best way to complete a query, who are you to argue? It's chosen that method for reasons that you do not and cannot know because you don't have access to all the stats about where and how the data is stored.. One only has to look at the remedials who post to AskTom to actually start to looks at an Oracle database as something really quite awesome; it astounds me that it takes somewhere in the region of 9 seconds to scan millions of transactions and work out the average number of days, times and amounts that have been transacted since the first activation, for every card in the credit scheme we manage.. That query joins 5 tables with nested loops too.


other such bogglesome plans that pretty much always slowed my queries down
But therein lies your limitation. Youre working off a simple rule: "nested loops are bad, i must always force Oracle to use something else" whereas the beards at Oracle have programmed things to be a lot more clever. Hints can't make a silk purse out of a sow's ear. If it were a horse, feed it something it will eat and watch it run, rather than whipping it repeatedly because it's too slow on a diet of straw and dust.
 
Last edited:
And if the CBO has chosen nested loops as being the best way to complete a query, who are you to argue?
the user that has to wait 30 mintues for a query that can be completed in 45 seconds when using a hash or merge method instead of a nested loop. Computers aren't perfect and neither are those clever beards. No matter how much "perfection" anyone claims is in their product...their lying. (Myself included).

As well I have seen plans come out of Databases that some how believe there is a Cartesian join, when the exact same query two days earlier executed and did not come up with that. Trusting the system is like getting into an airplane without a pre-flight check. Just because something is designed to work doesn't mean it always will.

Granted, sometimes using a WHERE clause join is easier for the DB plan to dissect over a INNER JOIN clause, as well a DISTINCT versus a GROUP BY. All these thing can shift and cause the plan to be seconds, minutes, or hours faster or slower dependent upon the whim of code I can't determine because it's built in to the DB architecture. The Answer: test multiple ways until you find the fastest one for the approach. I'll often run the same query 3 or 4 different ways syntactically to find the best approach based upon the database and how it is habitually responding. (and that doesn't even bring up the issue of caching, which often skews the plan anyway).

I trust the database to do what it can, and yes the speed and efficiency that many modern database possess is indeed astounding, But i take empirical evidence over whitepapers. When I run a query, and the execution plan comes up in the vicinity of 5 million seconds to execute, i tweak it here and there, sometimes with hints, sometimes with alterations to syntax, and suddenly i get the same results in a 10th the time.
There's a hypocrisy in there;you say you want better control over a system but youre not interested in learning how the system works.
Quite untrue. I want better control over the system, and I do learn how the system works, and then I take advantage of my doubt in the systems abilities. every program, application, and system i have ever used I use it at the default level first. I learn what it does and how it does it, and then see if I can one-up them. Sometimes I can, a lot of times I can't. But that isn't gonna stop me from trying. :)

But therein lies your limitation. Youre working off a simple rule: "nested loops are bad, i must always force Oracle to use something else"
At that company, on their server, it was a rule. It wasn't my limitation, it was the servers. Whether the limitation of the server, or the complexity of their database, the Nested Loop damn near quadrupled any query request. Where as HASH and MERGE tested in differenc combinations was ALWAYS faster. Now, since I no longer work at that company I have not needed to use that pre-determination any more, and Nested Loops are perfectly competent methods of execution, and I haven't used a on a query since leaving that place. But again, just because something is "clever" sometimes those clever designs outsmart themselves. I'll trust my eyes, ears, and my own mind thank you very much. When the system does what the whitepaper promises, I'll stick with it...when it doesn't, i'll force it into submission. *shrug*
 
As well I have seen plans come out of Databases that some how believe there is a Cartesian join, when the exact same query two days earlier executed and did not come up with that.
Plan and execution are two different things. If you dont follow rules that eliminate cartesian products then the addition of data to a system will result in a cartesian join.


Just because something is designed to work doesn't mean it always will.
Garbage in, garbage out. The only inconsistent in the system, I'm afraid, is the human pushing the buttons.

Granted, sometimes using a WHERE clause join is easier for the DB plan to dissect over a INNER JOIN clause, as well a DISTINCT versus a GROUP BY.
Do you actually believe some of the stuff you write? As noted before, unless youre one of the clever beards who wrote the Oracle CBo, please refrain from stating "facts" about what a db finds "easier"

All these thing can shift and cause the plan to be seconds, minutes, or hours faster or slower
Plan speed is necessarily seconds or less.

The Answer: test multiple ways until you find the fastest one for the approach.
Who says fastest is best? Your fast query might have shocking resource consumption; using all the temp tablespace on a busy multiuser system will often be the straw that breaks the camel's back. Youre trying to summarise points that take pages of white paper to discuss, into a few lines of forum post and all youre achiving is painting a picture of ignorance for yourself..

When the system does what the whitepaper promises
I doubt any the systems you use, in the way you use them, ever will
 
Garbage in, garbage out. The only inconsistent in the system, I'm afraid, is the human pushing the buttons.
And that doesn't include the Human pushing the buttons that designed the microchip? Or the Human that designed the robot that built the microchip? Or the Human that Put the microchip into the computer? Or the Human that write the OS? Or the Human that wrote the DataBase?

There are Humans all the way through the process, it appears you only wish to point out the hypocrisy in me, by pointing out the "loose nut behind the keyboard", without realizing that the same hypocrisy exists in you by claiming the system is flawless. The creation is as flawed as its creator: Computers are flawed because humans are flawed. Period.

Do you actually believe some of the stuff you write? As noted before, unless youre one of the clever beards who wrote the Oracle CBo, please refrain from stating "facts" about what a db finds "easier"
Yes, I do, because I have evidence and proof to back it up. I don't mean a whitepaper, or spec sheet, or what they told me it would do. I have recurrent events over time of the system - all systems - failing to do what they "claimed" they could do. That is a fact, and just because it disagrees with the "company line" does not negate that one bit. You apparently have a lot of faith in designers you never met, who frankly, I do believe are very good at their job, but until some miracle occurs to make humankind perfect, I will stick to the fact that those clever beards, as clever as they are, are just as flawed as you or I. To put it another way, I'd more willing to trust you in your development than those clever beards, because they are a complete unknown to me, where I at least have some reference to how you approach a problem.

Plan speed is necessarily seconds or less.
According to whom?

Who says fastest is best? Your fast query might have shocking resource consumption; using all the temp tablespace on a busy multiuser system will often be the straw that breaks the camel's back.
"Slow is steady, steady is smooth, smooth is fast." I'm sure you've heard that before. If you can correlate in metaphor the nature of all conceptually "compiled" languages, I am a least afforded the freedom to create a metaphor of "speed" being more than just alacrity of execution. Speed could more or less be related to efficiency, and using shocking amounts of resources is never efficient.

Youre trying to summarise points that take pages of white paper to discuss, into a few lines of forum post and all youre achiving is painting a picture of ignorance for yourself..
That is a valid opinion. I would disagree, and so would many others, but eh, that's the joy of being alive. :)

I doubt any the systems you use, in the way you use them, ever will
Well, it's a good thing you don't gamble on that bet. *chuckle*

Its like I've said before, though I have tried to refrain due to this forum being a question forum, not a pulpit for either you or me, but it seems that at every turn there are many here who point out my flaws as I ask for help with respect and courtesy, helping me find the things I need to find, while others seem adamant in attacking me as a person, and playing a game of one-ups-man-ship, an "I know more than you" tit for tat childish game because you've read a white paper or two.

Well, frankly I don't need your validation on the matter. I was programming before .NET, and Flash, and web applications, and before everybody had a PC in their house. I was programming before these "advanced database systems" even existed, and the ones that did took up an entire floor of a building to do what we do with Access on a single desktop. I was developing my own database systems, and I had to write my own "Framework" because they weren't any out there. Every menu, button, control, had to be made because Windows 2.0 and 3.1 didn't have the Rapid Application Designers where everything was pre-made, and that doesn't even count all the DOS programs which don't even have a graphics library - I had to write that too. I've had to program a protected mode interface, and design the entire graphic presentation, as well as hardware handlers, input handlers, and had to deal with VESA bank switching every 64K of memory because there was no linear addressing.

So please, I know exactly what those "clever beards" do and where they came from. I never claimed I would do any better, but having created systems similar to theirs I would not do much worse. I have only ever postulated that it is prudent to test the limits of a system in order to find where those clever beards failed and where they succeeded. You don't buy a house without doing your due diligence; you don't buy a car without making sure it runs. The clever beards you seem so adamant to support are not just the super genius MIT Grads in the basement designing this stuff, but they are also the MBA's; the slick and devious marketing reps who are the individuals writing those whitepapers, and make them 4 pages thick to confuse people into thinking the system is cooler than it is, just like a back lot used-car salesmen. When I buy a car I talk to the mechanics who have worked on that make and model, I talk to people who have owned the vehicle for several years, I don't ask the saleman, because hey, salesmen lie - go figure:confused:.

So until you have actually talked to a "clever beard" not associated with Sales Marketing or public relations, I believe I will trust my experience over your faith. In the end, I have been respectful of people on this forum, and most have been respectful of me, I will continue to do so, even for those who are incapable of reciprocating.

Cheers
 
Back
Top