httpy - Let's build a (nano) http client

What?

Continuing with the same theme as the previous post (Let's build a (nano) web framework), in this post we'll be bulding a nano http client similar to urllib3.

Introducing httpy

The http client will be built using Python. Here's the usage we're aiming to get:

import httpy

req = httpy.get('http://httpbin.org/robots.txt')

req.status # => 200
req.data # => 'User-agent: *\nDisallow: /deny'

To achieve this, we need to build:

URL parser: This will allow us to get from the string http://httpbin.org/robots.txt to know what the scheme, host & path are.
Communication function: This will send the HTTP request and get the raw response.
HTTP response parser: This will allow us to parse the raw HTTP response to get the response status and the page content.

How deep are we gonna go?

We're going to basing our work on Python's socket and pyparsing libraries.

Compromises

As this is experiment is to get an mvp, these are the compromises for this library:

Only work with HTTP
URL's format is only a subset.
Only IPV4

URL parser

The usage will be:

try:
    host, port, path = parse_url('http://httpbin.org/robots.txt')
except ValueError as e:
    print(e)
    exit()

Let's start by having a couple of test cases. I'm going to use doctest for the examples.

def parse_url(url):
    """Pase url based on this format:

    http://host[:port]path[?query][#fragment]

    >>> parse_url('http://httpbin.org/')
    ('httpbin.org', 80, '/')
    >>> parse_url('http://httpbin.org/robots.txt')
    ('httpbin.org', 80, '/robots.txt')
    >>> parse_url('http://test:1234/lorem?a=b#c')
    ('test', 1234, '/lorem?a=b')
    >>> parse_url('https://mhasbini.com/')
    Traceback (most recent call last):
        ...
    ValueError: Invalid URL
    >>> parse_url('httpmhasbini.com')
    Traceback (most recent call last):
        ...
    ValueError: Invalid URL
    """

    return ""

We can define a context-free grammar to parse the url based on the specified format. I opted to use pyparsing to achieve this. The grammar is as follow:

import pyparsing as pp

host_pp = pp.Word(pp.alphanums + '.' + pp.alphas).setResultsName('host')
port_pp = pp.pyparsing_common.signed_integer.setResultsName('port')
path_pp = pp.Combine('/' + pp.Optional(pp.Word(pp.srange("[a-zA-Z0-9.-_~!$&'()*+,;=:@]")))).setResultsName('path')
fragment_pp = pp.Optional('#' + pp.Word(pp.srange("[a-zA-Z0-9/?"))).setResultsName('fragment')

This is a simple implementation that only supports http and don't adhear completely to the syntax specified in rfc3986 but is good enough for our mvp.

What's left is to compine these two parts while capturing parsing exceptions and returing the values in the correct format:

import pyparsing as pp

def parse_url(url):
    """Pase url based on this format:

    http://[host[:port]]path[?query][#fragment]

    >>> parse_url('http://httpbin.org/')
    ('httpbin.org', 80, '/')
    >>> parse_url('http://httpbin.org/robots.txt')
    ('httpbin.org', 80, '/robots.txt')
    >>> parse_url('http://test:1234/lorem?a=b#c')
    ('test', 1234, '/lorem?a=b')
    >>> parse_url('https://mhasbini.com/')
    Traceback (most recent call last):
        ...
    ValueError: Invalid URL
    >>> parse_url('httpmhasbini.com')
    Traceback (most recent call last):
        ...
    ValueError: Invalid URL
    """

    host_pp = pp.Word(pp.alphanums + '.' + pp.alphas).setResultsName('host')
    port_pp = pp.pyparsing_common.signed_integer.setResultsName('port')
    path_pp = pp.Combine('/' + pp.Optional(pp.Word(pp.srange("[a-zA-Z0-9.-_~!$&'()*+,;=:@]")))).setResultsName('path')
    fragment_pp = pp.Optional('#' + pp.Word(pp.srange("[a-zA-Z0-9/?"))).setResultsName('fragment')

    syntax_pp = 'http://' + host_pp + pp.Optional(':' + port_pp) + path_pp + fragment_pp

    try:
        result = syntax_pp.parseString(url)        
    except pp.ParseException:
        raise ValueError('Invalid URL')

    return result.get('host'), result.get('port', 80), result.get('path')

Communication function

This is the main part of the library, the usage will be:

try:
    raw_response = get(host, port, path)
except ConnectionError as e:
    print(e)
    exit()

As usual, we start with defining a couple of test cases:

def get(host, port, path):
    """Open connection and send GET HTTP request and return raw response

    >>> get('httpbin.org', 80, '/robots.txt') # doctest:+ELLIPSIS
    'HTTP/1.1 200 OK\\r\\nDate: ...\\r\\nContent-Type: text/plain\\r\\nContent-Length: 30\\r\\nConnection: close\\r\\nServer: gunicorn/19.9.0\\r\\nAccess-Control-Allow-Origin: *\\r\\nAccess-Control-Allow-Credentials: true\\r\\n\\r\\nUser-agent: *\\nDisallow: /deny\\n'
    >>> get('mhasbini.com', 1234, '/robots.txt')
    Traceback (most recent call last):
        ...
    tt2.ConnectionError: [Errno 113] No route to host
    """

    ...

To execute this logic, we need to:

Construct the request hearder.
Open socket and send the request: We'll use socket.AF_INET family because we're only supporting ipv4 and code.SOCK_STREAM type because we're using TCP.
Read the response a chunk at a time and then return it.

Putting the above requirements into code:

import socket

class ConnectionError(OSError):
    """Raised when a socket connection fail for any reason"""
    pass

def get(host, port, path):
    """Open connection and send GET HTTP request and return raw response

    >>> get('httpbin.org', 80, '/robots.txt') # doctest:+ELLIPSIS
    'HTTP/1.1 200 OK\\r\\nDate: ...\\r\\nContent-Type: text/plain\\r\\nContent-Length: 30\\r\\nConnection: close\\r\\nServer: gunicorn/19.9.0\\r\\nAccess-Control-Allow-Origin: *\\r\\nAccess-Control-Allow-Credentials: true\\r\\n\\r\\nUser-agent: *\\nDisallow: /deny\\n'
    >>> get('mhasbini.com', 1234, '/robots.txt')
    Traceback (most recent call last):
        ...
    tt2.ConnectionError: [Errno 113] No route to host
    """

    # Generate request message 
    request_m  = f'GET {path} HTTP/1.1\r\n'
    request_m += f'Host: {host}:{port}\r\n'
    request_m += 'Connection: close\r\n'
    request_m += '\r\n'

    try:
        # AF_INET -> ipv4
        # SOCK_STREAM -> TCP
        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock.connect((host, port))
        sock.sendall(request_m.encode())

        # Get data 1024 bytes at a time
        data = b''
        while True:
            _buffer = sock.recv(1024)

            if not _buffer:
                break

            data += _buffer

        sock.close()

        return repr(data.decode())
    except OSError as e:
        raise ConnectionError(str(e)) from None

HTTP response parser

This is the final step where we take the raw response from the function above and parse it to get the status code and the response body. The API is as follow:

status, body = parse_response(response)

Example of responses:

HTTP/1.1 200 OK
Date: Thu, 05 Nov 2020 00:05:19 GMT
Content-Type: text/plain
Content-Length: 30
Connection: close
Server: gunicorn/19.9.0
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true

User-agent: *
Disallow: /deny

HTTP/1.1 404 NOT FOUND
Date: Thu, 05 Nov 2020 00:05:46 GMT
Content-Type: text/html
Content-Length: 233
Connection: close
Server: gunicorn/19.9.0
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>404 Not Found</title>
<h1>Not Found</h1>
<p>The requested URL was not found on the server.  If you entered the URL manually please check your spelling and try again.</p>

We start by defining test cases:

def parse_response(raw_response):
    """Parse raw http response and return status code and body.

    >>> parse_response('HTTP/1.1 200 OK\\r\\nDate: Thu, 05 Nov 2020 03:22:48 GMT\\r\\nContent-Type: text/plain\\r\\nContent-Length: 30\\r\\nConnection: close\\r\\nServer: gunicorn/19.9.0\\r\\nAccess-Control-Allow-Origin: *\\r\\nAccess-Control-Allow-Credentials: true\\r\\n\\r\\nUser-agent: *\\nDisallow: /deny\\n')
    (200, 'User-agent: *\\nDisallow: /deny')
    >>> parse_response('lorem ipsum')
    Traceback (most recent call last):
        ...
    ValueError: Invalid raw response
    """

    ...

The headers and the body are seperated by \r\n\r\n so we parse it as follow:

DELIMITER = '\\r\\n\\r\\n'

status_pp = pp.pyparsing_common.signed_integer.setResultsName('status')
body_pp = pp.Suppress(DELIMITER) + pp.Regex(r'(.*?)$').setResultsName('body')

response_pp = pp.LineStart() + 'HTTP/1.1' + status_pp + pp.SkipTo(DELIMITER) +  body_pp

Putting it together:

import pyparsing as pp

def parse_response(raw_response):
    """Parse raw http response and return status code and body.

    >>> parse_response('HTTP/1.1 200 OK\\r\\nDate: Thu, 05 Nov 2020 03:22:48 GMT\\r\\nContent-Type: text/plain\\r\\nContent-Length: 30\\r\\nConnection: close\\r\\nServer: gunicorn/19.9.0\\r\\nAccess-Control-Allow-Origin: *\\r\\nAccess-Control-Allow-Credentials: true\\r\\n\\r\\nUser-agent: *\\nDisallow: /deny\\n')
    (200, 'User-agent: *\\nDisallow: /deny')
    >>> parse_response('lorem ipsum')
    Traceback (most recent call last):
        ...
    ValueError: Invalid raw response
    """

    DELIMITER = '\r\n\r\n'

    status_pp = pp.pyparsing_common.signed_integer.setResultsName('status')
    body_pp = pp.SkipTo(pp.Regex(r'$')).setResultsName('body')

    response_pp = pp.LineStart() + 'HTTP/1.1' + status_pp + pp.SkipTo(DELIMITER) + body_pp + pp.LineEnd()

    try:
        result = response_pp.parseString(raw_response)
    except pp.ParseException:
        raise ValueError('Invalid raw response')

    return result.get('status'), result.get('body')

Plugging everything together

This is the final step where we plug everything together and some syntactic sugar.

httpy.py

from parsers import parse_url, parse_response
from request_helper import get as raw_get

class Result:
    def __init__(self, status, data):
        self.status = status
        self.data = data

def get(url):
    return Result(*parse_response(raw_get(*parse_url(url))))

And we have our demo working! 🎉

import httpy

req = httpy.get('http://httpbin.org/robots.txt')

req.status # => 200
req.data # => 'User-agent: *\nDisallow: /deny'

Conclusion

The full source code is hosted in this gist.

I enjoyed using doctest and pyparsing for this project. Working on this made me appreciate the effort put into keeping the internet running as it is. So many things can go wrong and I'm glad that there are smart people on helm.