The world of table joins

So far we have considered joins based on a foreign key.
In the example from the last section, the field rantrade.stock was
declared as a subset of its foreign key stock.mystock.
That was done in the definition for rantrade
stock: `stock$n?stocks; ....

and the key of stock was declared as follows in the definition for stock
stock:([mystock: stocks] ....

Sometime, we want to do equi-joins where one table is a key,
but the other table may have values not in the key.
For example, file equijoin.q


stocks: `x1`x2`x3`x4`x5;
states: `ny`ca`wa`ca`ca;

stock:([mystock: stocks] place: states);

stockcoolness:([mystock: 1 _ reverse stocks] coolness:`y`y`n`n);
stockcoolness2:([]mystock: 1 _ reverse stocks; coolness:`y`y`n`n);


show select from stock ij stockcoolness

ij = "inner join"; joins on equality based on key, but 
if there is no match, then ij never produces nulls.

show select from stock lj stockcoolness

Note that x5 has no coolness value. This semantics (in which
missing key values from the right table turn into nulls) is known
as a left outer join.
http://code.kx.com/wiki/Reference/lj

http://code.kx.com/wiki/Reference/SystemCommands

Engineering note: The reason for this requirement on the right
table is that it permits quasi-linear processing.
Execution can proceed through the left table and perform lookups on
the right table via a data structure.
This is not as good as a foreign key join (where lookups
on the foreign key table occur by array indexing), but almost.


Exercise 1. (Easy) Without trying it, try to predict what would happen if 
we did
	show select from stockcoolness lj stock
Now try it.


There are two markets with different trades. We want
to find the pairs of trades t1 from the first table F 
and t2 from the second table S such that for each t1,
t2 has a time that is the latest that precedes or is equal to the time for
t1 and t2 is the last such record in S.
If the second table is ordered by time, then t2 is the most recent
trade whose time is equal to
or preceding t1 that joins with the t1 on symbol.
For this, we use an as-of join.
https://code.kx.com/trac/wiki/Reference/aj



n: 30;
stocks: `ibm`hp`amaz`goog`aapl;
rantrade:([]stock: n?stocks; price: 20 + n?380.0;amount: 100*(1+n?1000);
  time: 10:00:00.000 + n?00:01:00.000);

rantrade2:([]stock: n?stocks; price2: 20 + n?380.0;amount2: 100*(1+n?1000);
  time: 10:00:00.000 + n?00:01:00.000);

res: aj[`stock`time;rantrade;rantrade2]

But note that in the semantics:
"if there are several
matching records in the second table, 
the items of the last (in row order) matching record 
are appended to those of first table"
That's why we would want the second table to be ordered by time.

Engineering note: The use case for aj by its nature implies
that neither table can be key based.
However, suppose that we group the right table by stock in an order-preserving
way (to keep the time order). 
Then the join can again be linear time. 
(It is speculation on my part that this is what is doing.)


Exercise 2. (Easy) Make it so each trade 
of the first table is linked to the "correct" trade of the second table.
All trades take place between 10 AM and 10:01 AM.
	q asofjoin.q
Look at the patterns for some stock, say aapl, and explain which
trades are being linked.

Sometimes, we want to link one trade with some aggregate on other
trades that occur in a window of time.
For this, we use the window join (wj).
Look that up here
https://code.kx.com/trac/wiki/Reference/wj
and try to solve the following exercise.

Exercise 3. (Middle) There are two markets 
with trades on various symbols. We want
to find the price and amounts of the trades on the same symbols
in the two markets with the following semantics.
For each trade x1 from the first market, take the 
last x2 from the second market
such that both x1 and x2 pertain to the same symbol and either
x2 and x1 occur within one second of one another or if there is no trade
on symbol s within one second of x1 and x2 is the last trade on s in the
second market before time t.
All trades take place between 10 AM and 10:01 AM.
Use the wj primitive.
	q windowjoin.q

Exercise 4: Compare the results of the asof join with the
window join and try to find differences and explain them.
(Hint: look at the joins pertaining to hp.)

Exercise 5: Suppose you want to go through your holdings at some time
and find the information about each stock at a time that is as close
as possible to a given time.
e.g. if you put in 10:00:30.000, you'd want
stock| time2                                   
-----| ----------------------------------------
aapl | 167.8413 58100 10:00:27.493 10:00:30.000
hp   | 387.5511 22200 10:00:27.532 10:00:30.000
amaz | 244.9222 99800 10:00:47.787 10:00:30.000
goog | 380.9905 93500 10:00:15.882 10:00:30.000
	q nearestjoin.q
 


Now if you are interested in a bit of a challenge...
The following came up on the q list:



I have a list of trades sorted by time

For each there is time, side (buy sell), price, size, open or close trade

How can i group the opening trade with the closing trade?

here is an example:

OPEN MSFT BUY 100 59.99
CLOSE MSFT SELL -100 60.00
OPEN MSFT BUY 200 61.00
CLOSE MSFT SELL -150 61.00
CLOSE MSFT SELL -50 61.00

i basically want the open to be grouped with the subsequent closes up to 
its size. In the second above the open was filled by two separate orders.
There could also be the same scenario where there are mutliple opens to one
close but they still follow sequentially.

=========

myopenclose: ("SSFFT"; enlist ",") 0: `openclose.csv

openclose: ("SSFFT"; enlist ",") 0: `openclose.csv;

show openclose

Note that these are ordered in time, but not by stock.
So, first order them by stock.

You need not restrict yourself
to grouping on stock.
You can group on more stuff.
In particular, you might like to group on stocks and trades
within that stock whose amount sums to 0. 

Let's try  to rewrite the open-close query to get all the purchases and
sales together.
	q opencloseadvanced.q

Exercise 6 (hard)
Call a "bundle" in a stock a consecutive series of trades where the price
either stays the same or increases with each trade.
(Example: if price is 1 2 3 4 2 3 4 5  4 3 2 1 1 1 2 3
then the bundles will be 1 2 3 4, 2 3 4 5, 4 alone, 3 alone, 2 alone,: and then 1 1 1 2 3)
Find the weighted average price per bundle per stock.
	q bundle.q

Here is another related challenge 
for more practice with this concept.
We may not be interested in the data about individual stocks,
but rather about categories of stocks.
In our generated data, the amounts are generated as follows:
amount: 100*(1+n?1000)
Thus, we may be interested in statistics about those in the 
low range (0-9999 inclusive), middle (10,000 - 49,999) and larger.

Let's start by observing the following.

x: til 20
0 5 13 bin x

Notice that the first group (the zeroes) are between 0 and 4 inclusive,
the second group between 5 and 12 inclusive.

This doesn't require x to be sorted.
For example,
x: 20 ? 16
0 5 13 bin x

What is interesting is that we can group based on this idea.
Hint: Think about 0 10000 50000 bin amount



Exercise 7. (Middle) We want to find the average price of
stocks in which the amounts fall between 1 and 9999 (low), 
10000 to 49999 (middle), and more (high).
	q rangegroup.q



So far, we have dealt with small tables.
If we have big tables, we may want to put each column in 
a different file. 
https://code.kx.com/trac/wiki/Cookbook/SplayedTables

Exercise 8.  (Middle) Save a 100,000
sized generated random table in problem 4 as a splayed table below
a directory called splaydir.
	q genandsplay.q / to store
	q ./splaydir / to retrieve


To retrieve the table, type 
q ./splaydir
and try the following queries:
\t select avg amount by stock from rantrade
\t select avg amount, max time, avg price by stock from rantrade

An alternative to starting on the directory is to use the "get" function
http://code.kx.com/wiki/Reference/get
t: get `:splaydir/rantrade/


Here is how to insert into the table (even though this uses
the keyword upsert which is usually used for updates to keyed tables;
there are no keyed tables that are splayed).

Start q from inside smarties4
Within the interpreter:

\cd splaydir
`:rantrade/ upsert (`:/splaydir/sym?`amaz; 102.5; 1000; 10:25:00.000)

Should get back:
`:rantrade/


You need the reference to sym because 
symbols by themselves don't go into splayed tables.

