Geek as a platform

String manipulation exercise: Perl, Python, Awk

Here comes a small comparation of the performance of Perl, Awk and Python while parsing and splitting lines in a BIG ldif file with thousands or millions of subscriber profiles like this one (I created this ldif file as an example, just for the sake of clarity)

As in the example, one of the attributes in my ldif files was a huge base64 string (more than 10k bytes long) in a single line, which is not supported by slapadd/slapd (I have checked LDIF rfc and I don't see any mention to the 4096 bytes limitation, so it could be our own implementation, not sure about this), so the idea was to spit this line into 76 characters lines (as per recommendation)... Something like this:

Original line (short version):

...
Service: SSBhbSBoYXBweSB0byBqb2luIHdpdGggeW91IHRvZGF5IGluIHdoYXQgd2lsbCBnbyBkb3duIGluIGhpc3RvcnkgYXMgdGhlIGdyZWF0ZXN0IGRlbW9uc3RyYXRpIGV2ZXJ5IHN0YXRlIGZa
...

Replaced by:

...
Service: SSBhbSBoYXBweSB0byBqb2luIHdpdGggeW91IHRvZGF5IGluIHdoYXQgd2lsbCBnbyBkb3
 duIGluIGhpc3RvcnkgYXMgdGhlIGdyZWF0ZXN0IGRlbW9uc3RyYXRpIGV2ZXJ5IHN0YXRlIGZa
...

(Notice the blank at the beginning of the second line)

My first approach was to use Python standard module textwrap:

import sys
import textwrap

try:
    ldiffile = sys.argv[1]
except:
    print("Input file missing... Exit")
    sys.exit(0)


with open(ldiffile, "rU") as f:
    for line in f:
        if line.startswith("Service:"):
            print(textwrap.fill(line, width=76,subsequent_indent=' '))
        else:
            print(line.strip())

It was really simple and produced the expected result, but it was ridiculously slow. RIDICULOUSLY. Really, difficult to believe...

So I decided to do something similar, but this time I dealt with the line split myself:

import sys

try:
    ldiffile = sys.argv[1]
except:
    print("Input file missing... Exit")
    sys.exit(0)

def fixlen(s, n):
    #First line, no blank at the beginning of the line
    print (s[:n])
    s = s[n:]
    #now, starting with blank
    while s:
        print (" " + s[:n])
        s = s[n:]

with open(ldiffile, "r") as f:
    for line in f:
        if line.startswith("Service:"):
            fixlen(line,76)
        else:
            print(line.strip())

The results I got were way better, but still, it took more than I expected... Time to give it a try with other tools: AWK and Perl

Basically I "translated" the python script to awk and perl, almost to the letter...

Perl version:

#!/usr/bin/perl
my $noident = 0;
while ($line = <>) {
  chomp($line);
  if ( $line =~ /^Service:.*/)
  {
    for (unpack("(A76)*",$line)) {
        if ($noindent == 0) {
            print "$_\n";
            $noindent = 1; }
        else { print " $_\n"; }
    }
  }
  else
  {
    $noindent = 0;
    print "$line\n";
  }
}

Awk version:

awk '
  BEGIN {
    indent = 0;
    i=0;
  }
  {
  if ( $0 ~ /^Service:.*/) {
    while(i<=length($0)){
        if ( $indent == 0 ) { printf "%s\n", substr($0,i,76);i+=76;indent=1; }
        else { printf " %s\n", substr($0,i,76);i+=76;}
   }
   }
   else {
        print $0;
   }
  }' $1

I "faked" several input files with thousands of lines (discarding the output), executed all the scripts in Cygwin64, and then checked the time it took.. Something like this:

time python3 textwrap.py XXXX.ldif &> /dev/null
time python3 pythonv1.py XXXX.ldif &> /dev/null
time python3 pythonv2.py XXXX.ldif &> /dev/null
time perl split.pl  XXXX.ldif &> /dev/null
time ./split.awk XXXX.ldif &> /dev/null

All of them generated the exact same output (except the textwrap version that handles the first line in a different way), but the execution time differs:

'results_table'

It is easier to see in a chart:

'results_chart'

My takeaways after this small exercise:

  • Awk rocks.
  • Perl's black magic is almost as fast as awk.
  • Python is not really fast at file processing. Yes, there are ways to improve this, by splitting in chunks, parallel processing and whatnot... But it takes more than 15 lines.
  • Stay away of textwrap module for heavy usage.

It was fun... Bye!

PS: As you can see in the results, there is a "Python v2" column which produced better results... It is a modification of the original script where I used a comprehension list approach:

import sys

try:
    ldiffile = sys.argv[1]
except:
    print("Input file missing... Exit")
    sys.exit(0)

def fixlen(s, n):
    first = True
    tmp = (s[0+i:n+i] for i in range(0, len(s), n))
    for x in tmp:
        #First line, no blank at the beginning of the line
        if first:
            print(x)
            first=False
        else:
            #now, starting with blank
            print(" " + x)

with open(ldiffile, "r") as f:
    for line in f:
        if line.startswith("Service:"):
            fixlen(line,76)
        else:
            print(line.strip())

Issues running gpg in a container

In case it helps...

I am giving a try to this docker componse example, and for some strange reason, docker was stuck in this block of code in the Dockerfile for one of the components:

# grab gosu for easy step-down from root
ENV GOSU_VERSION 1.7
RUN set -x \
        && apt-get update && apt-get install -y --no-install-recommends ca-certificates wget && rm -rf /var/lib/apt/lists/* \
        && wget -O /usr/local/bin/gosu "https://github.com/tianon/gosu/releases/download/$GOSU_VERSION/gosu-$(dpkg --print-architecture)" \
        && wget -O /usr/local/bin/gosu.asc "https://github.com/tianon/gosu/releases/download/$GOSU_VERSION/gosu-$(dpkg --print-architecture).asc" \
        && export GNUPGHOME="$(mktemp -d)" \
        && gpg --keyserver ha.pool.sks-keyservers.net --recv-keys B42F6819007F00F88E364FD4036A9C25BF357DD4 \
        && gpg --batch --verify /usr/local/bin/gosu.asc /usr/local/bin/gosu \
        && rm -r "$GNUPGHOME" /usr/local/bin/gosu.asc \
        && chmod +x /usr/local/bin/gosu \
        && gosu nobody true

Specifically, in this line:

gpg --keyserver ha.pool.sks-keyservers.net --recv-keys B42F6819007F00F88E364FD4036A9C25BF357DD4 

It is a really simple command... I could ping the keyserver, I checked the website and everything looked ok, but it wasn't working, neither in the host machine or inside a container... The response was the same: timeout while getting the keys

So... Here comes netstat to the rescue:

# netstat -natop | grep gpg
tcp        0      1 192.168.1.10:34340      104.236.209.43:11371    SYN_SENT    8653/gpg2keys_hkp    on (7,10/3/0)

SYN_SENT? So it was trying to stablish the connection, but there wasn't any response from the remote host... In that port

It turns out I recently upgraded my internet connection, and the new router allows you to customize the firewall security levels (you know, low, medium, high and paranoid... yeah, I am using the latter). I noticed that port 11371 was not defined in as a "known service", so I wasn't able to reach it from within my home network.

As soon as I allowed the connection to that port:

# gpg --keyserver ha.pool.sks-keyservers.net --recv-keys B42F6819007F00F88E364FD4036A9C25BF357DD4
gpg: solicitando clave BF357DD4 de hkp servidor ha.pool.sks-keyservers.net
gpg: /root/.gnupg/trustdb.gpg: se ha creado base de datos de confianza
gpg: clave BF357DD4: clave p├║blica "Tianon Gravi <tianon@tianon.xyz>" importada
gpg: no se encuentran claves absolutamente fiables
gpg: Cantidad total procesada: 1
gpg:               importadas: 1  (RSA: 1)

So make sure all your ip flows are open, even for such a small thing as this one...

Take care out there!

Using different sudo passwords for the same host group in Ansible

I'll write this down here before I forget.

I have 2 VPSs that I manage with Ansible and recently I updated the root password for one of the servers, and my old playbook, which worked fine if all the hosts in the group have the same password, stopped working.

I tried some stuff that didn't work, until I found out that the way to solve this is described in this link:

Ansible - Become

Basically I added a connection variable called ansible_become_pass for each of my servers, and assigned them with the variable names inside of my Vault. Like this:

[vps]
foo.com ansible_user=foo ansible_ssh_private_key_file=/server/id_rsa_foo ansible_become_pass='{{ foo_sudo_pass }}'
bar.foo.com ansible_user=bar ansible_ssh_private_key_file=/server/id_rsa_bar ansible_become_pass='{{ bar_sudo_pass }}'

The variables foo_sudo_pass and bar_sudo_pass are the root passwords for each server and they are stored in my Vault file (called passwords.yml).

Those variables will be readed at the time the playbook is executed... So you could have something as simple as this:

$ cat upgradeAll.yml
---
- hosts: vps
  become: true
  tasks:
  - name: "apt-get update"
    apt:
      upgrade: dist
      update_cache: yes
      cache_valid_time: 7200

And run the playbook with the verbose option to get some more information with this command:

$ ansible-playbook upgradeAll.yml -e@/server/playbooks/passwords.yml --vault-password-file vault_pass.txt -v

Let's see an example:

$ ansible-playbook upgradeAll.yml -e@/server/playbooks/passwords.yml --vault-password-file vault_pass.txt  -v
PLAY [vps] 
*********************************************************************************

TASK [Gathering Facts] 
*********************************************************************************
ok: [foo.com]
ok: [bar.foo.com]

TASK [read vars] 
*********************************************************************************
ok: [foo.com] => {"ansible_facts": {"foo_sudo_pass": "pass1", "bar_sudo_pass": "pass2", "test_sudo_pass": "testpass"}, "changed": false}
ok: [bar.foo.com] => {"ansible_facts": {"foo_sudo_pass": "pass1", "bar_sudo_pass": "pass2", "test_sudo_pass": "testpass"}, "changed": false}

TASK [apt-get update] 
*********************************************************************************
changed: [foo.com] => {"changed": true, "msg": "Reading package lists.
[...]

PLAY RECAP 
*********************************************************************************
foo.com                : ok=3    changed=1    unreachable=0    failed=0
bar.foo.com            : ok=3    changed=1    unreachable=0    failed=0

$

It is interesting to notice how I added the extra variables in the command by using -e@/server/playbooks/passwords.yml, otherwise it won't work!

Later!

Interview question: chmod chmod?

Not sure why, today I remembered about a question from a job interview I did probably 5 years ago... It was a pretty easy question, and my answer was pretty dumb.

Q: "Ok, imagine you remove execution permissions to chmod... How would you put them back???"

A: "quick and dirty, I'd copy the binary from the machine sitting next to it".

How about that? I remember it was very late in the evening (I was working in Tokyo at that time and they called me from Europe), but anyway, there was no excuse. I think they guy who interviewed me is still laughing :)

There are multiple ways to solve it. Using perl, python, go, C++, etc allows you to change the permissions of any file, including chmod... Let me show you how I'd do it in python:

#Current permissions: 755
root@94d9407f1002:/# # ls -l /bin/chmod
-rwxr-xr-x 1 root root 56112 Feb 18  2016 /bin/chmod

#Let's remove execution permission
root@94d9407f1002:/# chmod -x /bin/chmod

#New permissions: 644 
root@94d9407f1002:/# ls -l /bin/chmod
-rw-r--r-- 1 root root 56112 Feb 18  2016 /bin/chmod

#Checking execution is not allowed
root@94d9407f1002:/# chmod +x /bin/chmod
bash: /bin/chmod: Permission denied

#Here comes python to the rescue
root@94d9407f1002:/# python3
Python 3.5.2 (default, Nov 17 2016, 17:05:23)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.chmod("/bin/chmod",0o755)
>>> exit()

#Confirming it worked as expected
root@94d9407f1002:/# ls -l /bin/chmod
-rwxr-xr-x 1 root root 56112 Feb 18  2016 /bin/chmod

Not sure why this came to my mind today... But in case someone asks you this very same question, don't f$ck it up as I did. Just breathe normally, and think...

\\psgonza

All the images missing... Thanks Dropbox :(

I bet nobody noticed but all the images in the blog were not here anymore... No pictures, no logo, etc. Ooops!

I was using dropbox as storage for the blog images, and a while ago they decided to change the policy about their public folders. Bummer!

I don't think Dropbox is to blame here though. They provide a really good service (for free!) and I guess they have to draw a line somewhere.

Anyway, I will be storing the images in the server from now on, so I am updating the posts and changing the image paths.

Bye