Select your font size 
 
about us products & services consulting & support news & events contact us
Paul Meagher shows how to use a database query to calculate conditional probability.

Conditional Probability and SQL - PEI

print this article 
 

P(A | B) can be mapped onto database-query operations. For example, the probability of cancer given a positive test result, P(+cancer | +test), can be obtained by issuing this SQL query then doing some tallies on the result set like this:

SELECT cancer_status FROM Data WHERE test_status='+test'

If I gather information about how several boolean-valued tests co-vary with a boolean-valued diagnosis (like that of cancer or not cancer), then I can perform slightly more complex queries to study how diagnostically useful other factors are in determining whether a patient has cancer, such as in the following:

SELECT cancer_status FROM Data WHERE genetic_status='+' AND age_status='+' AND biopsy_status='+'

In the case of detecting e-mail spam, I might be interested in computing P(+spam | title_word='viagra' AND title_word='free'), which could be viewed as a directive to issue the following SQL query:

SELECT spam_status FROM Emails WHERE email_title LIKE 'viagra' AND email_title LIKE 'free' 

After enumerating the number of e-mails that are spam and have "viagra" and "free" in the title (like so):

count_emails(spam_status='+spam' AND email_title LIKE 'viagra' AND email_title LIKE 'free')

and dividing by the overall number of e-mails with the words "viagra" and "free" in the title:

count_emails(email_title LIKE 'viagra' AND email_title LIKE 'free')

I might arrive at the conclusion that the appearence of these words in the title strongly and specifically co-varies with the message being spam (after all, 18/18 = 100 percent) and this rule might be used to automatically filter such messages.

In Bayes spam filtering, you need to initially train the software in which e-mails are spam and which are not. One can imagine storing spam_status information with each e-mail record (for example, email_id, spam_status, email_title, or email_message) and doing the previous queries and counts on this data to decide whether to forward a new e-mail into your inbox.



Page:   1  2  3  4  5  6  7  8  9  10  11 Next Page: Frequency versus probability format

The content shown in this page was first published by IBM developerWorks and is reprinted with permission from Paul Meagher (www.datavore.com)


Most Recent Website and Regional Updates

 Timing Upgrades - Factors Affecting Time Between Purchases for Tech Toys
It is possible to understand client purchase decisions by performing a regression analysis. By forming strategies based on the results, companies can optimize strategic programs to maximize profits.

 
 Personal Shopping Assistants - Turning the Table Against Merchant Databases
Consumers can use technology to watch the merchants who already have been watching them. But to do this, they need a champion.

 
 Operations Research
Links to pages related to Operations Research, which is the methodical study of how to do things better.

 
 Operations Research: Avoiding the Taint of Corporate Espionage
A $200M corporate espionage lawsuit against Westjet uncovers the possibility that Operations Research firms may not fully protect client data. A mechanism is suggested to ensure two competitors do not share the same Operations Research team.

 
 Reviewing the Audit of Management Consulting Engagements in Government
Review of Audit of Management Consulting Engagements in Government (2000/01 Report 4), as well as the 2002 and 2003 follow-ups from the Office of the Auditor General of British Columbia.

 
 Competition in Content Distribution Raises Value of Creativity
Content distributors facing competition on all sides turn to creative types for product differentiation. Using stable cash flow, distributors might acquire creative teams. Instead, a risk mitigating joint venture alternative is proposed.

 
 Adaptive Management
How the Forest Practices Branch of the British Columbia Ministry of Forests describes Adaptive Management, with links to further information about business process optimization.

 
 Macintosh Technical Support
Transparen provides remote technical support for Macintosh, Apple, iBook, and PowerBook.

 
 Linux-based Web Service Solutions
Integrate disparate systems using Web Services, where such services are available in other than XML.

 

Google
 
Web transparen.com

Contact Information

Related Information

 
   
 
E C M | © 2003-2007 Transparen Corp.      

Standardized Services: Data Recovery Service / Creative Services / Premium Web Hosting Services / System Administration Tech Support Services
Recent Projects: Full-Service Mortgage and Financing Company / System to manage flights from Vancouver to Tofino / Photo exchange verification service
Our Vancouver BC Server Proudly Hosts: automated parking and revenue control systems, leafside lane at southlands, cost effective alternative power sources, the photo genie, pacific forage bag supply, sunburst medical, neosonic design, roger mahler photography - passionate, intriguing, desirable, the connection between east and west, affordable flights to victoria and tofino, low interest mortgage brokers in vancouver, richmond, surrey, toronto, mortgage brokers in calgary
* Alberton * Ascension * Belfast * Borden * Breadalbane * Cavendish * Central Bedeque * Charlottetown * Cornwall * Emerald Junction * Foxley River * Freeland * Georgetown * Hampton * Hebron * Hunter River * Kensington * Knutsford * Miminegash * Montague * Morell * Mt. Stewart * Murray Harbour * Murray River * North Cape * North Rustico * O'Leary * Orwell * Souris * Stanhope * Stratford * Summerside * Tignish * Tyne Valley * Avonlea