
Large Data, Performance, Machine Learning, and Specialized
Language Design



Tickerplants

Design a server that takes data that comes
in as pairs from many clients
and simply logs data as it arrives in an append only logfile called tmpout.
The server also holds the data and can be searched.
	rm tmpout
	q server.q -p 1234
	q client.q

Let's look at the server:

=========

/ q server.q -p 1234

h: hopen `:tmpout / opens file for writing
global: ()


/ append data to file; the file that will be called from the client
appendto:{[data]
  global,: enlist data;
  h enlist data;
  "done"}
  
"Remember to type q server.q -p 1234"

==========

Now let's look at client.q

========

h: hopen`::1234;

h"appendto[\"dennis; new york\"]";
h"appendto[\"arthur; palo alto\"]";
h"appendto[\"mike; new jersey\"]";
h"appendto[\"simon; switzerland\"]";


========

Note that we can also execute appendto functions from studio,
but it's simpler, e.g. in the studio window type:
appendto["bob; desmoines"]

Much easier for testing purposes.

Exercise 1. (Middle) 
This server should route data whose first element is addressed
to arthur to a secondary server that logs all messages to arthur.
All other data should be logged by the primary server.
The logging files are called tmpsecondary and tmpprimary.
Here is the setup. Note that the secondaryserver must be started before
the primary.
	rm tmpsecondary
	rm tmpprimary
	q secondaryserver.q -p 1236
	q primaryserver.q -p 1234
	q client.q


Performance Notes:

From Jeff Borror's Q for mortals.
Using commas between where clauses in sql is usually
faster than using & because the commas do sequential
reduction from left to right whereas clauses with &
apply to the whole table.
In the nesting case, use the most restrictive clause first.

Use grouping on non-unique columns in order to get a hash table
for those columns.
Grouping is like a non-clustered index.
Use "parting" to separate the key field. 
Parting is like a clustered index.

Exercise 2: Try the following and see what to conclude about the performance.
	q perfexamp.q

Note that parting undoes the effect of grouping.

Also note that foreign key joins are faster than inner joins.
	q perfjoin.q


Large data:
http://code.kx.com/wiki/Startingkdbplus/hdb
http://code.kx.com/wiki/JB:KdbplusForMortals/partitioned_tables

Partitioning is needed when even columns are too big to fit into
memory.
In a partition, we will have the full table schema but only for some
of the records.
q genandpartition.q

Then 
q
\l partdb
And now you can access rantrade more or less normally,
e.g. using select.
Put date first if used as it prevents searches across different partitions.

Associative aggregates -- avg, sum, count etc. work well on partitioned data
Lots of limits on how to update stuff.

select from rantrade where date = ...

Try 
q loadpartition.q

If there are many accesses to a partitioned table, we might want
to spread out the IO. For that, you use segmentation.
http://code.kx.com/wiki/JB:KdbplusForMortals/segments#a1.4.0Overview

Parts explosion: case study

A frequent problem in inventory applications is to find out
which subparts a given part depends on. 

Consider the following parent child relationship:


part,subpart,qty
car1, motor1,1
car1, motor2,1
car2, motor3,1
car2, exterior2,1
motor1, piston12,1
motor2, piston8,1
motor3, piston4,1
piston12, light1,4
car1, light1,5
car2, light2,5

As you can see, this is not a pure hierarchy because light1
is both directly a part of car1 and a part of piston12 which is part
of motor1 which is part of car1.
I want to know how many times each subpart appears in a part.

Exercise 3 (Optional hard).
Write a function finddesc that finds the constituent parts and 
how many are needed of a given part.
Can you scale this up and make it really fast?
   q partexp.q

Exercise 3a(Optional Continuation).
Write a function findanc that finds all ancestors
of a given part.
   q partexpanc.q


Language extensions:

Put an sql implementation into your q directory,
by copying s.k into that directory.
You can get that from www.kx.com/q/s.k.
Then generate random data and try the query

s) select stock, avg(price) from rantrade where price > 200 group by stock

The sql works fine for single table queries, but 
you should use the special purpose q joins for other queries.

stock:([stock: stocks] place:`ny`ca`wa`ca`ca);


Exercise 4  (easy). (i) Write the equivalent kdb code for the above and time
the two of them. (ii) Do the same query as above but join in stock.place
	q sql.q

Machine Learning

Much large data analysis comes down to machine learning.
One wants to learn the properties of the data so one can
act on it.

A fundamental problem in machine learning is how to deal with 
a large amount of unlabeled data. 
The basic idea is to cluster it, so that one can reduce the problem
to understanding what to do with the centroids of the clusters.
A fundamental algorithm is K means.

K means has the following pseudo-code:

Given number k and a set of data points S,
chose k points in S as centroids C
until no change to C
 for every point p in S,  
  assign p to the centroid c in C that is closest to p
  we say that p is in the cluster of c
 for each centroid c
  c' := find the center (arithmetic average) of all points in the cluster of c 
  replace c by c'


Exercise 5 (Hard). Write the k-means algorithm and test it on the
following one dimensional data.
X: (til 10), (40+ til 20), (90 + til 10);
Start with four centroids at points 0, 1, 2, 3.
See what you get.
	q kmeans.q

Is there any problem with the above solution?
Let's draw the result.
Kmeans gets to a local minimum but not a global one.
Look at what happens if we put our initial centroids elsewhere.

Exercise 6 (Middle). If you weren't satisfied with the results
of the  previous program, then try an approach that randomizes the
original points several times and then chooses the best results.
Best here means that the total distance to centroids should be as low as
possible.
	q kmeanspermute.q

Bayesian Processing

A Bayes net is a directed acyclic graph in which nodes represent
events and edges represent probabilities.
Given a Bayes network along with a probability table, we want to construct
an appropriately weighted set of assignment to all variables.
Given that, we can then ask questions like prob(condition x|condition y).
e.g. given conclusion what is prob of evidence.
We would do is find all examples in the generated sample that have the
conclusion and generate the evidence.
The question is how to generate a sample.

It's not so hard.
Start with the roots of the Bayesian network.
Assign based on their probabilities. Then for any non-root, use the
probabilities based on what the values up to the parents are.
So we will find a topological sort of the network and then generate
data from that.

Exercise 7 (Hard) Given a 
Bayesnet in the format nodeid, parentids, probabilities in the present
or absence of each parent (so for k parents there will be 2^k probabilities).
Note that the order of the probabilities
is all falses, then all false plus last one is true,
all the way up to all are true.
For example, in the line:
 (`L; `M`S; (0.2 0.1 0.1 0.05));
L depends on M and S. If M and S are both false, then the probability
of L is 0.2. If M is false but S is true, then 0.1. If M is true but S is true
then 0.1. If M and S are true, then 0.05.

relationships: ((`S; (); (0.3));
 (`L; `M`S; (0.2 0.1 0.1 0.05));
 (`T; enlist `L; (0.8 0.3));
 (`M; (); (0.6));
 (`R; enlist `M; (0.6 0.3)))

	q bayessamp.q
Find the probability of M given that T is true. 
Also find the bootstrap probability.
Find the probability of M given that T is true and R is false.


Parsing

This is a topic for experienced q programmers. 

Aquery is a syntactic sugaring over q that looks a bit more 
like SQL and declares the order rather than counting on the order
that is given.
We want to write a parser that will allow us to create an input
file with a line like this:
a) select price as myprice, time as mytime  from rantrade assuming time where price > 200 and amount < 400 group by stock;
and have it translated to the following:

select myprice: price, mytime:  time by stock from (`time xasc rantrade) where price > 200, amount < 400 ;


Here is the basic algorithm:
select [func] c1, ... cn from tab [assuming c1, c2...] 
where where_clause group by group_clause 
--> select 
--> for each ci if just a column then print the column, but if ci as x
then translate to x: ci
we are assuming the same functions are used
--> if group_clause is not empty, then ("by "), group_clause
--> from tab doesn't change but if there is an assuming then xasc or xdesc
--> where_clause -- add an extra level of parens around each subclause
where a subclause is delimited by an and/or and replace and by comma.



Exercise 8. (Middle) Write a translator from aquery on a single table to kdb.
It should allow aquery statements to be prefaced by "a)".
They will then be translated into kdb statements, e.g.
	q aquery.q hasaquery.q
will produce hasaquery_ready.q.
The aquery statements in question must each appear on one line
and may include assuming statements and group by, but no having.

The trouble with the above solution is that it does the sorting
every time.
For a very wide table (i.e., many columns), that will result in
slow performance.

Exercise 9. (Optional, Middle) 
Enhance the translator to suggest what to do if a table
is very wide and there is an assuming clause.
	q aquery2.q hasaquery.q
will produce hasaquery_ready.q.

https://code.kx.com/trac/wiki/DotQ

Exercise 10. (Optional Hard) Enhance the 
translator further to allow multiple table
foreign key joins. To do this, the user must give us information
in the form of statements such as
ac) foreignkey[rantrade.stock; stock.mystock]
This means that the foreign key of rantrade.stock is stock.mystock.
Note that in the table design, if there is 
foreignkey[R.X; S.Y], then Y must be the key of S, 
X and R must be the same name, and the query to follow must have
a clause R.X = S.Y.
	q aquery3.q foreignkeyaquery.q


Projects: 
1. Use different indices, splaying, and partitioning on a benchmark
of queries of interest to you.
Generate simulated data and try to figure out a high performance solution.
2. Online NAV
3. Better clustering than k means.
4. Anything on aquery.q

======

As an option, let me introduce you to Studio for kdb+.
This was written by Charlie Skelton and gives you a window from which
you can call various servers that are local or not.

Setup:
1. Go to
https://code.kx.com/trac/wiki/studioforkdb%2B
You can use anonymous/anonymous as username and password.

2. Download the zip file to your q directory.

3. Unzip it.

Run  by

1. cd q

2. Suppose you want to start a server, say at a port 1234
q -p 1234

3. java -jar studio.jar

4. Now,  add a server
Name: mytest
Host: localhost (in general an ip address)
Port: 1234

5. username and password -- if you have one on your local machine.

6. Within the q interpreter, you can type say
foo:{[x;y] x - 2*y}

7. In the top window of studio, type 
foo[5;6]
highlight it and control e (or command e on a mac) and it will execute.

So, studio provides a client execution environment.

=========
