python slice and sql every Nth row

I would like to retrieve every Nth row of a SQL table, and I would like this accessed via a python slice function.

A python slice allows access to a list (or any object that implements a __getitem__ method) by a start, stop, and step -- for example,

>>> foo = range(100)
>>> foo[5]
5
>>> foo[5:10]
[5, 6, 7, 8, 9]
>>> foo[5:10:2]
[5, 7, 9]
>>> 

The parameters of a slice behave identically to the builtin range() function,

>>> range(5,10,2)
[5, 7, 9]
>>> 

Given a python class that accesses a database table, you may want to implement slice functionality directly into that class. This can be done by implementing a __getitem__ method, a very simple __getitem__ looks something like this:

def __getitem__(self, key):
    curs = self.db.cursor()
    return list(curs.execute('SELECT * FROM %s WHERE ROWID = %s' %(self.table, key+1)).fetchall()[0])

However, this fails to account for slices and will only work for single item retrievals. When only a single key is provided python will pass that key to __getitem__, and when slice parameters are provided python will pass a slice object. In other words, inspect the key to figure out if a slice is passed in, e.g.,

def __getitem__(self, key):
    if isinstance( key, slice ):
        ... slice acccess, e.g., foo[1:5], foo[:5], foo[0:10:2]
    else:
        ... single element access, e.g., foo[5]

A slice object will have three parameters: slice.start, slice.stop, and slice.step; corresponding to object[start:stop:step]

Going back to the SQL case, we'll need to dynamically construct a SQL statement that will access every Nth row between a start and stop.

The simplest way to do this is using a modulo operator (available in all major relational databases), for example, object[5:10:2] in sqlite

SELECT * 
FROM table
WHERE ROWID > 5 AND ROWID <= 10
  AND ROWID % 2 = 5%2

This approach allow you to retrieve every Nth row between an arbitrary start and stop, to put this into practice in python, a __getitem__ may be implemented as follows:

def __getitem__(self, key):
    curs = self.db.cursor()
    if isinstance( key, slice ):
        start = 0 if key.start is None else key.start
        stop = len(self) - key.start if key.stop is None else key.stop
        qstep = ''
        if key.step is not None:
           qstep = ' AND ROWID %% %s = %s ' %(key.step, (start+1) % key.step)
        res = curs.execute('SELECT * FROM %s WHERE ROWID > %s AND ROWID <= %s %s'
            %(self.table, start, stop, qstep)).fetchall()
    else:
        res = list(curs.execute('SELECT * FROM %s WHERE ROWID = %s' %(self.table, key+1)).fetchall()[0])
    curs.close()
    return res

The above code is part of a sqlite python wrapper used for parsing large csv files, available here github.com/timwarnock/csv_reports

Posted in data arch., mysql, oracle, python | Comments Off on python slice and sql every Nth row

sqlplus, utility scripts

I would like to use sqlplus in development projects, and I would like it to be easy to use.

First off, I want command-history and tab-completion. The easiest approach is to use rlwrap which uses the GNU readline library as a wrapper to command-line programs such as sqlplus.

If you install rlwrap, you can then set an alias in your .bashrc (or similar)

alias sqlplus='rlwrap sqlplus'

That's it, sqlplus will now have GNU readline compatible command-line history and editing! If you wish to access sqlplus directly without rlwrap (for loading scripts, etc), simply unalias, i.e.,

$ \sqlplus

I would also like intuitive, terse, and easy-to-remember commands to perform basic tasks like inspecting tables and schemas, viewing and editing packages of stored procedures, etc.

I started a github repository [https://github.com/timwarnock/sqlpath] of utility scripts to make this easier.

Set your $SQLPATH environment variable to the directory containing these scripts, which allow for the following operations.

foo@DB> @show databases

SCHEMA
----------------
FOO
SPAM

2 rows selected.

Change to the SPAM schema,

foo@DB> 
foo@DB> @use spam
spam@DB> 

List all tables with 'abc' anywhere in the table name,

spam@DB> @show tables abc

TABLE_NAME
----------------
ABC_USERS
ABC_GROUPS
FOO_ABC_MAP

3 rows selected.

List all packages with 'abc' anywhere in the package name,

spam@DB> @show packages abc

NAME
----------------
PROCS_ABC
ABC_DATE

2 rows selected.

View the source code of a package of stored procedures,

spam@DB> @show code abc_date

TEXT
---------------------------------------------------------------------------------------
PACKAGE             "ABC_DATE" AS
  FUNCTION DATE_ID(IN_DATE DATE) RETURN NUMBER RESULT_CACHE;
END;
PACKAGE BODY  "ABC_DATE" AS

  FUNCTION DATE_ID(IN_DATE DATE) RETURN NUMBER RESULT_CACHE
  IS
    ID NUMBER;
  BEGIN
    SELECT D.ID INTO ID FROM DATES_DIM D WHERE SQL_DATE = TRUNC(IN_DATE);
    RETURN ID;
  END;
END;

13 rows selected.

View the explain plan output (also consider 'set autotrace on')

spam@DB> explain plan for SELECT D.ID FROM DATES_DIM D WHERE SQL_DATE = '07-feb-12';

Explained.

spam@DB> @explain

PLAN_TABLE_OUTPUT
-------------------------------------------------------------------------------------------------
Plan hash value: 1171817064

------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                  |     1 |    13 |     2   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| DATES_DIM        |     1 |    13 |     2   (0)| 00:00:01 |
|*  2 |   INDEX UNIQUE SCAN         | DATESDIMSQLDTUNQ |     1 |       |     1   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("SQL_DATE"='07-feb-12')

Edit a package in vim and automatically recompile

spam@DB> @editcode abc_date
... 
... opens ABC_DATE.sql in vim
... 

Package created.

Package body created.

spam@DB> 

Please see the github repository [https://github.com/timwarnock/sqlpath] for periodic updates.

Posted in oracle, shell tips | Comments Off on sqlplus, utility scripts

screen and screenrc

I would like to use GNU screen as a window manager.

By default screen seems like little more than a persistent shell (that you can resume even after logging out). By itself, this is incredibly useful if you wish to access the exact same terminal session from different locations.

For example, on the command-line you can run screen or start a named session, i.e.,

$ screen -S sessionname

This starts a new shell, which you can detach from by pressing Ctrl-a and then d, i.e.,

^A d

Ctrl-a is the default command-key, you press Ctrl-a followed by a screen command. This is abbreviated as ^A in the screen manual.

By default, you'll also detach from screen by simply closing your terminal.

You can list existing screen sessions, and then re-attach to a screen session as follows,

$ screen -ls
There are screens on:
        23762.this      (Detached)
        23814.that      (Detached)
2 Sockets in /var/run/screen/S-timwarnock.

$ screen -r that

In this case I've re-attached to a screen session named "that".

If a screen session is already attached, and you want to takeover the session, you can do the following,

$ screen -d -r sessionname

If desired, you can use the -D option which will detach and logout the other terminal. You can also use -R to re-attach to a session or create one if it doesn't exist.

$ screen -d -r
 ... detach the current attached terminal and then re-attach
$ screen -D -r
 ... and logout the other terminal
$ screen -D -R
 ... and if no session, create one
$ screen -D -RR
 ... and if more than one session, use the first one

I prefer the following command,

$ screen -dRR

This will connect to the first session, politely detach as needed, or create a new one if no screen session exists.

Once you're attached to screen, you can run multiple windows, split the screen multiple ways, and various other useful tricks. At the very least, here are the most common commands for everyday use:

^A ? - help
^A c - create new window, or screen -t tabname
^A A - rename screen window
^A k - kill a screen window, or ^D (or exit) to properly exit
^A " - window list, change windows
^A n - next window, or ^A space
^A p - previous window
^AA - switch to previous screen (toggle)
^A [0-9] - go to window [0-9]
^A esc - copy mode (for scrollback), ] or esc to abort

If you want to get more advanced, and see what screen can really do, you'll want to save a .screenrc in your home directory. Otherwise managing multiple windows and split-screens in a single screen session can be quite confusing.

Here is an example screenrc,

# ^A :source ~/.screenrc
bind , source "$HOME/.screenrc"

# no default message, no audible bell
startup_message off
vbell on

# more usable scrollback
scrollback 5000

# detach on hangup
autodetach on

# make bold work
attrcolor b ".I"
attrcolor i "+b"

# hardstatus and caption
hardstatus alwayslastline "%{= wk}%-w%{= Bw} %n %t %{-}%+w %-= %H [ %Y-%m-%d %c:%s ]"
caption splitonly "%{= wK}%-w%?%F%{= bw}%:%{= Wk}%? %n %t %{-}%+w %-= "

# resize
bind + resize +5
bind - resize -5
bind = resize =

# bindings
bind % screen -t top top

Using the above screenrc with a resized split-screen looks like this:

The important bits of the screenrc are the hardstatus and caption. The hardstatus can set the bottom line in a terminal, in this case a window-list highlighting the active window, the date and time, and the hostname of the machine running the screen session.

The hostname is also on my command-prompt, but it's useful for me to know the hostname running the screen session separate from wherever the current shell happens to be logged in.

The caption in this case is set to only display when the screen is split. You can split a screen multiple times. In this case the caption will appear underneath its respective window.

The commands to split a window are as follows,

^A S – split screen
^A Q – return to non split view
^A Tab - move to the next window in split-display
^A :resize [n|+n|-n]

In the above screenrc, I've bound custom commands for resizing, this way you simply hit Ctrl-A followed by + or - to resize the active window (or = to return to the default size).

To copy+paste within screen, you enter "copy mode" and can view the scrollback buffer. Once in copy mode you hit enter to start selecting, and enter again to end your selection.

^A esc - copy mode, or ^A ]
^A ] - to paste

If you're using screen to manage several windows and want to be alerted when something changes on a given window, you can use monitoring. This simply alerts you of activity on a given window, or silence in the case you were expecting activity.

^A M - monitor for activity
^A _ - monitor for silence (default 30 seconds)

Lastly, you can log a screen window or do a simple grab of the contents currently on display, i.e.,

^A h - screen-grab
^A H - logging
Posted in bash, shell tips | Leave a comment

nvl, ifnull, nullif, isnull, coalesce

I would like to programmatically handle NULL values in a database.

If I add or multiply a known value to a NULL, the result is still NULL. This is an appropriate result as a NULL value in a database schema means "unknown" or "nonexistent", so mathematic functions against known values and NULL should produce NULL.

That said, if I want a sum of line-items I could filter out NULL values, e.g.,

SQL> SELECT SUM(col) FROM foo WHERE col IS NOT NULL;

However, this could get difficult to manage in larger queries.

In Oracle, we can use NVL to get the above result

SELECT SUM(NVL(col,0)) FROM foo;

In MySQL and SQLite, we can use IFNULL

SELECT SUM(IFNULL(col,0)) FROM foo;

In Postgres we can use NULLIF

SELECT SUM(NULLIF(col,0)) FROM foo;

And in Microsoft SQL Server we can use ISNULL

SELECT SUM(ISNULL(col,0)) FROM foo;

While I prefer NVL for it's conciseness, I recommend COALESCE as it works in each of the above database engines. COALESCE accepts two or more parameters and simply returns the first non-null parameter, e.g.,

SELECT SUM(COALESCE(col,0)) FROM foo;

Accepting multiple parameters can be very useful in cases where we're aggregating otherwise mutually exclusive columns, e.g.,

SELECT SUM(COALESCE(eggs.weight, spam.weight, 0)) AS lonely_weight
FROM eggs FULL OUTER JOIN spam ON eggs.id = spam.id
WHERE eggs.id IS NULL OR spam.id IS NULL;

In the above case, it's the sum weight of all eggs without spam, and spam without eggs.

Posted in data arch., mysql, oracle | Comments Off on nvl, ifnull, nullif, isnull, coalesce

oracle, limit results and pagination

I would like to limit query results in oracle to support pagination.

In postgres, mysql, and sqlite you can use the LIMIT and OFFSET modifier, e.g.,

mysql> SELECT login FROM users ORDER BY login LIMIT 2 OFFSET 70001;
+-------------+
| login       |
+-------------+
| acrobat11   | 
| acrobat12   | 
+-------------+
2 rows in set (0.03 sec)

mysql>

In mysql and sqlite you can use a two-parameter version of LIMIT, and the format is

LIMIT offset, count

Which is identical to

LIMIT count OFFSET offset

In oracle you can leverage the magic ROWNUM variable, although the value of ROWNUM is assigned before an ORDER BY is applied (so the ORDER BY sql must be nested). There is a lengthy and very informative Ask Tom article that covers ROWNUM as well as optimization information.

For reference, to use ROWNUM as a LIMIT expression the general form is,

SELECT * FROM (...) WHERE ROWNUM <= 10

And a pagination query using ROWNUM as both LIMIT and OFFSET is,

SELECT * FROM
  ( SELECT /*+ FIRST_ROWS(10) */  a.*, ROWNUM rnum
    FROM ( ... ) a
    WHERE ROWNUM < 20 )
WHERE rnum >= 10;

The oracle equivalent of the above mysql would be as follows,

foo_owner@FOO> SELECT login FROM
 2  ( SELECT /*+ FIRST_ROWS(2) */  a.login, ROWNUM rnum
 3    FROM ( SELECT login FROM users ORDER BY login ) a
 4    WHERE ROWNUM <= 70002 )
 5  WHERE rnum >= 70001;

LOGIN
----------------------------------------------------------------
acrobat11
acrobat12

2 rows selected.

foo_owner@FOO> 

Posted in data arch., mysql, oracle | Leave a comment

sqlplus, edit stored procedures

Previously, we discussed viewing stored procedures in sqlplus.

It came to my attention in an Ask Tom article of a very handy way to edit a package in sqlplus.

Set your $SQLPATH environment variable to wherever you wish to store useful scripts, including a login.sql -- make sure to set an editor in login.sql, e.g.,

define _editor=vim

In the Ask Tom example, simply create a getcode.sql and save somewhere in your $SQLPATH -- for reference,

set feedback off
set heading off
set termout off
set linesize 1000
set trimspool on
set verify off
spool &1..sql
prompt set define off
select decode( type||'-'||to_char(line,'fm99999'),
               'PACKAGE BODY-1', '/'||chr(10),
                null) ||
       decode(line,1,'create or replace ', '' ) ||
       text text
  from user_source
 where name = upper('&&1')
 order by type, line;
prompt /
prompt set define on
spool off
set feedback on
set heading on
set termout on
set linesize 100

Then, in sqlplus, you can edit an existing procedure/package/function and update it in the database, e.g.,

foo_owner@FOO> @getcode PROCS_DATE
foo_owner@FOO> edit PROCS_DATE

foo_owner@FOO> @PROCS_DATE

Package created.

Package body created.

foo_owner@FOO> 
Posted in oracle | Leave a comment

sqlplus, view stored procedures

I would like to view stored procedures and packages in sqlplus.

All stored procedures, functions, and packages can be found in the user_source, dba_source, or all_source (depending on your access). Let's examine user_source.

foo_owner@FOO> desc user_source
 Name					   Null?    Type
 ----------------------------------------- -------- ----------------------------
 NAME						    VARCHAR2(30)
 TYPE						    VARCHAR2(12)
 LINE						    NUMBER
 TEXT						    VARCHAR2(4000)

foo_owner@FOO> 

Let's examine user_source.type to see what kind of objects are available to view,

foo_owner@FOO> SELECT DISTINCT type FROM user_source;

TYPE
------------
PROCEDURE
PACKAGE
PACKAGE BODY
FUNCTION
TYPE

5 rows selected.

foo_owner@FOO>

The user_source table (like dba_source and all_source) contains line-numbered text for functions, procedures, and packages.

If you want a list a packages, you could simply run the following:

foo_owner@FOO> SELECT DISTINCT name FROM user_source WHERE type = 'PACKAGE';

NAME
------------------------------
PROCS_GIFT_SUBS
PROCS_TAX_EXCEPTION_REPORT_V1
PROCS_GIFT_SUBS_TAX
PROCS_FLASH_REPORT
PROCS_TAX_V1
PROCS_FISCAL
PROCS_EXCEPTION_REPORT
PROCS_TAX
...

If you'd like to view the source for a package (the package and package body), you can query the user_source.text as follows,

foo_owner@FOO> SELECT text FROM user_source WHERE name = 'PROCS_DATE' ORDER BY type,line;

TEXT
----------------------------------------------------------------------------------------------------
PACKAGE 	    "PROCS_DATE" AS
  FUNCTION DATE_ID(IN_DATE DATE) RETURN NUMBER RESULT_CACHE;
  FUNCTION DATE_FROM_ID(IN_ID NUMBER) RETURN DATE RESULT_CACHE;
END;

PACKAGE BODY		 "PROCS_DATE" AS

  FUNCTION DATE_ID(IN_DATE DATE) RETURN NUMBER RESULT_CACHE
  IS
    ID NUMBER;
  BEGIN
    SELECT D.ID INTO ID FROM DATES_DIM D WHERE SQL_DATE = TRUNC(IN_DATE);
    RETURN ID;
  END;

  FUNCTION DATE_FROM_ID(IN_ID NUMBER) RETURN DATE RESULT_CACHE
  IS
    the_date DATE;
  BEGIN
    SELECT sql_date INTO the_date FROM dates_dim WHERE id = in_id;
    return the_date;
  END;

END;

foo_owner@FOO> 
Posted in oracle | Leave a comment

sqlplus explain plan

I would like to view explain plan output in sqlplus.

One easy approach is to set autotrace. You can set autotrace on to show query statistics as well as an exection plan. You can also view only the execution plan, for example,

foo_owner@FOO> set autotrace on explain
foo_owner@FOO> SELECT COUNT(*) FROM tax_facts WHERE create_date <= TO_DATE('10-JAN-12', 'DD-MON-YY');

  COUNT(*)
----------
   3768447


Execution Plan
----------------------------------------------------------
Plan hash value: 3136400752

--------------------------------------------------------------------------------------
| Id  | Operation	      | Name	     | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      | 	     |	   1 |	   8 |	1988   (5)| 00:00:36 |
|   1 |  SORT AGGREGATE       | 	     |	   1 |	   8 |		  |	     |
|*  2 |   INDEX FAST FULL SCAN| TAXFCT_CRTDT |	3774K|	  28M|	1988   (5)| 00:00:36 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("CREATE_DATE"<=TO_DATE('10-JAN-12','DD-MON-YY'))

foo_owner@FOO>

Alternatively, you can use EXPLAIN PLAN FOR in front of your sql-statement, but viewing the plan output is less-intuitive, e.g.,

foo_owner@FOO> EXPLAIN PLAN FOR  SELECT COUNT(*) FROM tax_facts WHERE create_date <= TO_DATE('10-JAN-12', 'DD-MON-YY');

Explained.

foo_owner@FOO> SELECT PLAN_TABLE_OUTPUT FROM TABLE(DBMS_XPLAN.DISPLAY());

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3136400752

--------------------------------------------------------------------------------------
| Id  | Operation	      | Name	     | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      | 	     |	   1 |	   8 |	1988   (5)| 00:00:36 |
|   1 |  SORT AGGREGATE       | 	     |	   1 |	   8 |		  |	     |
|*  2 |   INDEX FAST FULL SCAN| TAXFCT_CRTDT |	3774K|	  28M|	1988   (5)| 00:00:36 |
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("CREATE_DATE"<=TO_DATE('10-JAN-12','DD-MON-YY'))

14 rows selected.

foo_owner@FOO> 
Posted in oracle | Leave a comment

sqlplus, exploring schemas and data

I would like to get information about different schemas and tables in sqlplus. I would like this to be as easy as mysql, it's less intuitive but almost as easy.

To get a list of schemas (similar to mysql "show databases"), you can run the following,

foo_owner@FOO> SELECT username FROM all_users ORDER BY username;

USERNAME
------------------------------
BOB_APP
BOB
FOO_OWNER
SYS
SYSTEM

5 rows selected.

foo_owner@FOO> 

To change to a different schema (similar to mysql "use database"), you can run the following,

foo_owner@FOO> ALTER SESSION SET CURRENT_SCHEMA = foo_owner;

Session altered.

foo_owner@FOO> 

To view the tables (similar to mysql "show tables"), you can run the following,

foo_owner@FOO> SELECT table_name FROM tabs;

TABLE_NAME
------------------------------
FOO_USERS
FOO_GROUPS
FOO_USER_GROUP
...

To view the columns of the table, it's the same as mysql, i.e.,

rpt_owner@FOO> DESC foo_user_group

 Name								   Null?    Type
 ----------------------------------------------------------------- -------- --------------------------------------------
 USER_ID							   NOT NULL NUMBER(38)
 GROUP_ID							   NOT NULL NUMBER(38)
 DATE_ID							   NOT NULL NUMBER(38)

And all the normal SQL you feel like running, e.g.,

foo_owner@FOO> SELECT COUNT(*) FROM foo_users;

  COUNT(*)
----------
    746202

foo_owner@FOO> 
Posted in oracle | Leave a comment

sqlplus pagesize and linesize

I would like sqlplus output to be more readable.

Fortunately, you can adjust the pagesize and linesize variables on the fly, e.g.,

foo_owner@FOO> set pagesize 50000
foo_owner@FOO> set linesize 120

You can set pagesize to 0, which is very useful for scripted output, as it will not print any column headers and only print the results. However, if you want to see column-headers but not repeat-them every nth line, set pagesize to something reasonably high.

The linesize defaults to 80, which is fine on a 1980's dumb-terminal, but I prefer this set to a more reasonable width to fit my terminal screen.

You can add both of these lines to your login.sql file (sqlplus will look for login.sql in the current directory or in $SQLPATH environment variable)

-- format output
set pagesize 50000
set linesize 120
Posted in oracle, shell tips | Leave a comment

sqlplus command prompt (sqlprompt)

I would like to change the sqlplus command-prompt to something more useful than

SQL>

You can modify the sqlprompt variable as follows,

SQL>
SQL> set sqlprompt "_USER'@'_CONNECT_IDENTIFIER> "
FOO_OWNER@FOO> 

You can also use variables such as _PRIVILEGE and _DATE, although all I really want is the user and schema. Unfortunately, with the above technique, if you ever connect to a database using a full connection string (rather than a tnsnames.ora entry), this prompt will be a bit unwieldy. e.g.,

$ sqlplus foo_owner/foo_pass@'(DESCRIPTION=(ADDRESS_LIST=(load_balance=on)(failover=on)(ADDRESS=(PROTOCOL=TCP)(HOST=10.1.1.11)(PORT=1528)))(CONNECT_DATA=(SERVER=DEDICATED)(service_name=FOO)(FAILOVER_MODE=(TYPE=SELECT)(METHOD=BASIC)(RETRIES=20)(DELAY=15))))'

SQL*Plus: Release 10.2.0.4.0 - Production on Wed Jan 11 12:47:19 2012

Copyright (c) 1982, 2007, Oracle.  All Rights Reserved.


Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning option

SQL> 
SQL> set sqlprompt "_USER'@'_CONNECT_IDENTIFIER> "        
FOO_OWNER@(DESCRIPTION=(ADDRESS_LIST=(load_balance=on)(failover=on)(ADDRESS=(PROTOCOL=TCP)(HOST=10.1.1.11)(PORT=1528)))(CONNECT_DATA=(SERVER=DEDICATED)(service_name=FOO)(FAILOVER_MODE=(TYPE=SELECT)(METHOD=BASIC)(RETRIES=20)(DELAY=15))))> 

I love that I can connect without a tnsnames.ora entry, but I would like a simplistic prompt and I want it loaded every time I start sqlplus.

Fortunately, sqlplus will look in the current directory and in the $SQLPATH environment variable for a login.sql, and execute it automatically. For example, if you include the following in your login.sql

SET TERMOUT OFF 
DEFINE sqlprompt=none
COLUMN sqlprompt NEW_VALUE sqlprompt
SELECT LOWER(SYS_CONTEXT('USERENV','CURRENT_USER')) || '@' || SYS_CONTEXT('USERENV','DB_NAME') as sqlprompt FROM DUAL;
SET SQLPROMPT '&sqlprompt> '
UNDEFINE sqlprompt
SET TERMOUT ON

Then whenever you run sqlplus the prompt will be set accordingly, e.g.,

$ sqlplus foo_owner/foo_pass@FOO

SQL*Plus: Release 10.2.0.4.0 - Production on Wed Jan 11 12:47:19 2012

Copyright (c) 1982, 2007, Oracle.  All Rights Reserved.


Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning option

foo_owner@FOO> show user
USER is "FOO_OWNER"
foo_owner@FOO> 
Posted in oracle, shell tips | Leave a comment

reverse ssh tunnel

I would like ssh access to a protected host that is not directly accessible on the Internet but does have outbound access. This is a common scenario in corporate networks that often require a vpn for remote access; but in situations where vpn access is not available (e.g., I forgot my keyfob, or I don't want to install vpn software) a reverse ssh tunnel to a trusted host can be used instead.

Basically, a reverse ssh tunnel is setup by the private host (i.e., any host that's not directly accessible from the Internet). The private host connects to a trusted host (that is accessible on the Internet).

A reverse ssh tunnel is usually started with something like,

$ ssh -R 12354:localhost:22 user@example.com

This will create a tunnel from example.com to the private host. In other words, you would simply login to example.com and issue the following command,

$ ssh user@localhost -p 12345

And you would now be logged into the private host.

The only problem is that you must establish the reverse tunnel from the private host. In order to keep this tunnel alive (without you physically being able to login to the private host), you can use a periodic cron, e.g.,

#!/bin/bash
#
# e.g., crontab every 5 minutes
# */5 * * * * ~/.ssh/reverse_tunnel >/dev/null 2>&1
#
# -OR-
# */5 * * * * ~/.ssh/reverse_tunnel hostname port >/dev/null 2>&1

## set defaults
REMOTE_HOST=user@hostname
REMOTE_PORT=12345

SSH_KEY=/path/to/your/private.key

if [ $# == 2 ]; then
  REMOTE_HOST=$1
  REMOTE_PORT=$2
fi

## reverse tunnel command
REVERSE_TUNNEL="ssh -i $SSH_KEY -q -N -R $REMOTE_PORT:localhost:22 $REMOTE_HOST"

## start tunnel if not already running
pgrep -f -x "$REVERSE_TUNNEL" > /dev/null 2>&1 || $REVERSE_TUNNEL

## test tunnel by executing a command on the remote host
ssh -i $SSH_KEY $REMOTE_HOST netstat -an | egrep "tcp.*:$REMOTE_PORT.*LISTEN"  > /dev/null 2>&1
if [ $? -ne 0 ] ; then
   pkill -f -x "$REVERSE_TUNNEL"
   $REVERSE_TUNNEL
fi

This will maintain a tunnel on the public host to the private host. Now you can easily login to the private host anywhere that has access to the public host.

Posted in shell tips, ssh | Leave a comment

ssh agent across multiple hosts

I would like secure single-sign-in across multiple hosts. An easy way to do this is with ssh-agent, however, ssh-agent is a bit limited.

For example, the normal use of ssh-agent looks like this,

$ ssh-agent
SSH_AUTH_SOCK=/tmp/ssh-stSwW11394/agent.11394; export SSH_AUTH_SOCK;
SSH_AGENT_PID=11395; export SSH_AGENT_PID;
echo Agent pid 11395;

You would need to set those environment variables to use the newly created agent, and then use ssh-add to add your credentials, e.g.,

$ eval `ssh-agent`
Agent pid 11464
$ ssh-add
Enter passphrase for /home/foobar/.ssh/id_dsa: 
Identity added: /home/twarnock/.ssh/id_dsa (/home/foobar/.ssh/id_dsa)
$ ssh-add -l
1024 84:e3:23:7b:f4:22:a2:da:53:fb:04:19:67:78:2b:3d /home/foobar/.ssh/id_dsa (DSA)
$

From here you can access any hosts that has your public key in ~/.ssh/authorized_keys

However, any new shell instances on this same host would have to go through this process. Ideally, I want to login ONCE and ONLY ONCE while working across multiple hosts.

The following code snippet can be added to your .bashrc which maintains a single agent across multiple shells per host, and also respects ssh auth forwarding. This code will also work in situations where your home directory is mounted across multiple hosts (maintaining an ssh-agent per host as needed, depending on where you login first).

# ssh agent -- for shared home directory across hosts
SSH_ENV=$HOME/.ssh/.environment.`hostname`
function start_agent {
  echo "Starting a new ssh-agent on this host"
  ssh-agent | sed 's/^echo/#echo/' > ${SSH_ENV}
  chmod 600 ${SSH_ENV}
  . ${SSH_ENV} > /dev/null
  ssh-add;
  echo succeeded
}

## ssh-agent
if [ -e "$SSH_AUTH_SOCK" ]; then
  echo "Using ${SSH_AUTH_SOCK}"
elif [ -f "${SSH_ENV}" ]; then
  echo "Using ${SSH_ENV}"
  . ${SSH_ENV} > /dev/null
  ps -ef | grep ${SSH_AGENT_PID} | grep ssh-agent$ > /dev/null || {
    echo "${SSH_ENV} agent is no longer running"
    start_agent;
  }
else
  start_agent;
fi

With this scenario you typically need to only login once per work session (regardless of which host you first login to). And with appropriate use of ssh auth forwarding, e.g., ssh -A user@host, you can jump around from host-to-host without constantly typing in the same password.

Posted in shell tips, ssh | Leave a comment

python, finding recurring pairs of data

I would like to find all pairs of data that appear together at least 10 times.

For example, given a large input file of keywords:

>>> foo, bar, spam, eggs, durian, stinky tofu, ...
>>> fruit, meat, vinegar, sphere, foo, ...
>>> merlot, red, hearty, spam, oranges, durian, ...
>>> ...

"durian" and "spam" appear together twice and all other combinations appear together only once.

Rather than a brute-force approach counting through all possible keyword pairs, we can use python set operations to quickly determine the number of times two keywords appeared together.

#!/usr/bin/env python
'''
Processes the given input and returns all pairs that appear together in 
at least 10 different lines.

Algorithm Explaination:

Hash each value such that each value maps a 'set' of row indexes 
(i.e., each line in the file)

Next, rather than compare all possible combinations of values, create a 
shorter list of candidate values who have 10 or more row indexes.

The set-intersection will contain all rows that BOTH values were listed in, 
if the size of the set-intersection is greater than 10, output this pair
'''

import codecs, csv, argparse, sys

def setIntersection(infile):
    datafile = codecs.open(infile, 'r', 'utf-8')
    csv_data = csv.reader(datafile)
    data_sets = {}
    i = 0
    for row in csv_data:
        i = i+1
        for a in row:
            if a in data_sets:
                data_sets[a].add(i)
            else:
                data_sets[a] = set([i])
    candidates = {a:s for a,s in data_sets.items() if len(s) > 10}
    pairc = {a for a in candidates.keys()}
    for a,s in candidates.items():
        pairc.remove(a)
        for c in pairc:
            if len(s & candidates[c]) > 10:
                print('%s, %s, %s' % (a,c,len(s & candidates[c])))

def main():
    p = argparse.ArgumentParser()
    p.add_argument('--output-file', '-o', default='')
    p.add_argument('--bitmap', '-b', action="store_true")
    p.add_argument('files', nargs='+')
    args = p.parse_args()

    if (args.output_file != ''):
        sout = sys.stdout
        fh = open(options.output_file, 'w')
        sys.stdout = fh

    for infile in args.files:
        setIntersection(infile)

    if (args.output_file != ''):
        sys.stdout = sout
        fh.close()

if __name__ == '__main__':
    main()

Alternatively, we could use a bitmap comparison and compute the Hamming Weight, e.g.,

def bitmapIntersection(infile):
    '''
    The bitmap example is more general purpose and easily ported to C.  In fact, 
    the bitCount function is a python version of computing Hamming Weight from K&R.

    A simple example illustrates the bitmap,

      keyword1 : [0 0 0 0 0 0 1 0 1 0 ... ]
      keyword2 : [0 1 1 0 0 0 1 1 1 0 ... ]
             & : [0 0 0 0 0 0 1 0 1 0 ... ] <- bitwise-AND
    '''
    datafile = codecs.open(infile, 'r', 'utf-8')
    csv_data = csv.reader(datafile)
    data_maps = {}
    i = 0
    for row in csv_data:
        for a in row:
            if a in data_maps:
                data_maps[a] += (2 ** i)
            else:
                data_maps[a] = (2 ** i)
        i = i+1
    candidates = {a:b for a,b in data_maps.items() if bitCount(b) > 10}
    pairc = {a for a in candidates.keys()}
    for a,b in candidates.items():
        pairc.remove(a)
        for c in pairc:
            bits = bitCount(b & candidates[c])
            if bits > 10:
                print('%s, %s, %s' % (a,c,bits))

def bitCount(int_type):
    ''' ported from Kernighan & Ritchie's "The C Programming Language"
    '''
    count = 0
    while(int_type):
        int_type &= int_type - 1
        count += 1
    return(count)

A bitmap comparison is functionally identical to the set intersection but provides potential optimization enhancements such as using encoded vectors like the BitVector module.

Posted in python | Leave a comment

python, analyzing csv files, part 2

Previously, we discussed analyzing CSV files, parsing the csv into a native python object that supports iteration while providing easy access to the data (such as a sum by column header).

For very large files this can be cumbersome, especially where more advanced analytics are desired.

I would like to keep the same simple interface but use an in-memory database connection, thus transforming the CSV files into database tables for deeper analysis.

For example, I would like to do the following (leveraging the builtin sqlite3 module):

>>> 
>>> reports = Reports('/home/user/reports-1109')
>>> reports.billing_detail.sum('Tax Amount', {'Fiscal Period':'2011-09'})
Decimal('123321.1')
>>> 
>>> reports.tax_summary.sum('Amount', {'Fiscal Period':'2011-09'})
Decimal('123321.1')
>>> 
>>> len(reports.billing_detail)
719153
>>> 
>>> curs = reports.conn.cursor()
>>> curs.execute('SELECT name FROM sqlite_master WHERE type="table"').fetchall()
[(u'billing_detail',), (u'billing_summary',)]
>>> 

This approach can be orders of magnitude faster for even the most basic analysis. Furthermore, this allows OLAP cube analysis of the data from the CSV files, e.g.,

>>> 
>>> curs.execute('CREATE TABLE t_fact(id TEXT UNIQUE, b INT, t INT, r INT)').fetchall()
[]
>>> curs.execute('CREATE INDEX IF NOT EXISTS idxt ON t_fact(id)').fetchall()
[]
>>> 
>>> ## load some data into the fact table
>>> curs.execute('''INSERT OR REPLACE INTO t_fact(id,b,t,r)
                SELECT bd.%(id)s as id, bd.ROWID as b, ts.ROWID as t, rf.ROWID as r
                FROM billing_detail bd
                LEFT OUTER JOIN tax_summary ts ON bd.%(id)s = ts.%(tax_id)s
                LEFT OUTER JOIN refunds r ON bd.%(id)s = rf.%(ref_id)s
                ''' % query_dict).fetchall()
[]
>>> 
>>> ## e.g., find billing records without tax summaries
>>> billings_without_tax = curs.execute('SELECT id FROM t_fact WHERE t IS NULL').fetchall()
>>> 

Using the same Report and Reports objects discussed previously, the code can be modified to leverage a database connection to support this type of analytics:

class Report(collections.Mapping):
    def __init__(self, filehint, table = None, conn = sqlite3.connect(':memory:')):
        self.filename = Reports.find_report(filehint)
        self.info = []
        self.headers = []
        self.table = table
        self.conn = conn
        self.indexes = []
        self._load()

    def _load(self):
        logging.debug('loading %s' %(self.filename))
        curs = self.conn.cursor()
        fh = open(self.filename)
        reader = csv.reader(fh)
        self.info = reader.next()
        self.headers = reader.next()
        columns = ', '.join(['c'+str(x) for x in range(len(self.headers))])
        columnTypes = ' TEXT, '.join(['c'+str(x) for x in range(len(self.headers))]) + ' TEXT'
        try:
            curs.execute('CREATE TABLE %s(%s)' %(self.table, columnTypes))
        except sqlite3.OperationalError as e:
            logging.debug('%s -- using existing table' %(e))
        else:
            curs.executemany('INSERT INTO %s (%s) VALUES(%s)' %(
                self.table, columns,
                '?, ' * (len(self.headers) -1) + '?'
            ), reader)
            self.conn.commit()
        curs.close()

    def _column(self, key):
        if key.lower() not in [x.lower() for x in self.headers]:
            raise IndexError('%s not in %s'%(key, self.table))
        return 'c' + str([x.lower() for x in self.headers].index(key.lower()))

    def create_index(self, col):
        col = self._column(col)
        icol = 'i' + col
        if icol not in self.indexes:
            logging.debug('adding index %s to %s(%s)' %(icol, self.table, col))
            curs = self.conn.cursor()
            curs.execute('CREATE INDEX IF NOT EXISTS %s ON %s(%s)' %(icol, self.table, col))
            curs.close()
            self.indexes.append(icol)

    def __getitem__(self, key):
        curs = self.conn.cursor()
        res = list(curs.execute('SELECT * FROM %s WHERE ROWID = %s' %(self.table, key+1)).fetchall()[0])
        curs.close()
        return res

    def __iter__(self):
        curs = self.conn.cursor()
        self.__iter = curs.execute('SELECT * FROM %s' %(self.table))
        curs.close()
        return self

    def next(self):
        return self.__iter.next()

    def __len__(self):
        curs = self.conn.cursor()
        ret = curs.execute('SELECT COUNT(*) FROM %s' %(self.table)).fetchall()[0][0]
        curs.close()
        return ret

    def get(self, column, value):
        '''get rows where column matches value'''
        curs = self.conn.cursor()
        column = self._column(column)
        ret = curs.execute('SELECT * FROM %s WHERE %s = "%s"' %(self.table, column, value)).fetchall()
        curs.close()
        return ret

    def sum(self, col, filter = {}):
        curs = self.conn.cursor()
        _where = []
        for k,v in filter.iteritems():
            _where.append(' %s = "%s" ' %(self._column(k),v) )
        ret = Decimal(str(curs.execute('SELECT SUM(%s) FROM %s %s' %(
            self._column(col),
            self.table,
            ' WHERE ' + ' AND '.join(_where) if _where else ''
            )).fetchall()[0][0]))
        curs.close()
        return ret
Posted in data arch., python, software arch. | Leave a comment