What?
Continuing with the same theme as the previous post (Let's build a (nano) web framework), in this post we'll be bulding a nano http client similar to urllib3.
Introducing httpy
The http client will be built using Python. Here's the usage we're aiming to get:
import httpy
req = httpy.get('http://httpbin.org/robots.txt')
req.status # => 200
req.data # => 'User-agent: *\nDisallow: /deny'
To achieve this, we need to build:
- URL parser: This will allow us to get from the string
http://httpbin.org/robots.txt
to know what the scheme, host & path are. - Communication function: This will send the HTTP request and get the raw response.
- HTTP response parser: This will allow us to parse the raw HTTP response to get the response status and the page content.
How deep are we gonna go?
We're going to basing our work on Python's socket
and pyparsing libraries.
Compromises
As this is experiment is to get an mvp, these are the compromises for this library:
- Only work with HTTP
- URL's format is only a subset.
- Only IPV4
URL parser
The usage will be:
try:
host, port, path = parse_url('http://httpbin.org/robots.txt')
except ValueError as e:
print(e)
exit()
Let's start by having a couple of test cases. I'm going to use doctest for the examples.
def parse_url(url):
"""Pase url based on this format:
http://host[:port]path[?query][#fragment]
>>> parse_url('http://httpbin.org/')
('httpbin.org', 80, '/')
>>> parse_url('http://httpbin.org/robots.txt')
('httpbin.org', 80, '/robots.txt')
>>> parse_url('http://test:1234/lorem?a=b#c')
('test', 1234, '/lorem?a=b')
>>> parse_url('https://mhasbini.com/')
Traceback (most recent call last):
...
ValueError: Invalid URL
>>> parse_url('httpmhasbini.com')
Traceback (most recent call last):
...
ValueError: Invalid URL
"""
return ""
We can define a context-free grammar to parse the url based on the specified format. I opted to use pyparsing to achieve this. The grammar is as follow:
import pyparsing as pp
host_pp = pp.Word(pp.alphanums + '.' + pp.alphas).setResultsName('host')
port_pp = pp.pyparsing_common.signed_integer.setResultsName('port')
path_pp = pp.Combine('/' + pp.Optional(pp.Word(pp.srange("[a-zA-Z0-9.-_~!$&'()*+,;=:@]")))).setResultsName('path')
fragment_pp = pp.Optional('#' + pp.Word(pp.srange("[a-zA-Z0-9/?"))).setResultsName('fragment')
This is a simple implementation that only supports http and don't adhear completely to the syntax specified in rfc3986 but is good enough for our mvp.
What's left is to compine these two parts while capturing parsing exceptions and returing the values in the correct format:
import pyparsing as pp
def parse_url(url):
"""Pase url based on this format:
http://[host[:port]]path[?query][#fragment]
>>> parse_url('http://httpbin.org/')
('httpbin.org', 80, '/')
>>> parse_url('http://httpbin.org/robots.txt')
('httpbin.org', 80, '/robots.txt')
>>> parse_url('http://test:1234/lorem?a=b#c')
('test', 1234, '/lorem?a=b')
>>> parse_url('https://mhasbini.com/')
Traceback (most recent call last):
...
ValueError: Invalid URL
>>> parse_url('httpmhasbini.com')
Traceback (most recent call last):
...
ValueError: Invalid URL
"""
host_pp = pp.Word(pp.alphanums + '.' + pp.alphas).setResultsName('host')
port_pp = pp.pyparsing_common.signed_integer.setResultsName('port')
path_pp = pp.Combine('/' + pp.Optional(pp.Word(pp.srange("[a-zA-Z0-9.-_~!$&'()*+,;=:@]")))).setResultsName('path')
fragment_pp = pp.Optional('#' + pp.Word(pp.srange("[a-zA-Z0-9/?"))).setResultsName('fragment')
syntax_pp = 'http://' + host_pp + pp.Optional(':' + port_pp) + path_pp + fragment_pp
try:
result = syntax_pp.parseString(url)
except pp.ParseException:
raise ValueError('Invalid URL')
return result.get('host'), result.get('port', 80), result.get('path')
Communication function
This is the main part of the library, the usage will be:
try:
raw_response = get(host, port, path)
except ConnectionError as e:
print(e)
exit()
As usual, we start with defining a couple of test cases:
def get(host, port, path):
"""Open connection and send GET HTTP request and return raw response
>>> get('httpbin.org', 80, '/robots.txt') # doctest:+ELLIPSIS
'HTTP/1.1 200 OK\\r\\nDate: ...\\r\\nContent-Type: text/plain\\r\\nContent-Length: 30\\r\\nConnection: close\\r\\nServer: gunicorn/19.9.0\\r\\nAccess-Control-Allow-Origin: *\\r\\nAccess-Control-Allow-Credentials: true\\r\\n\\r\\nUser-agent: *\\nDisallow: /deny\\n'
>>> get('mhasbini.com', 1234, '/robots.txt')
Traceback (most recent call last):
...
tt2.ConnectionError: [Errno 113] No route to host
"""
...
To execute this logic, we need to:
- Construct the request hearder.
- Open socket and send the request: We'll use
socket.AF_INET
family because we're only supporting ipv4 andcode.SOCK_STREAM
type because we're using TCP. - Read the response a chunk at a time and then return it.
Putting the above requirements into code:
import socket
class ConnectionError(OSError):
"""Raised when a socket connection fail for any reason"""
pass
def get(host, port, path):
"""Open connection and send GET HTTP request and return raw response
>>> get('httpbin.org', 80, '/robots.txt') # doctest:+ELLIPSIS
'HTTP/1.1 200 OK\\r\\nDate: ...\\r\\nContent-Type: text/plain\\r\\nContent-Length: 30\\r\\nConnection: close\\r\\nServer: gunicorn/19.9.0\\r\\nAccess-Control-Allow-Origin: *\\r\\nAccess-Control-Allow-Credentials: true\\r\\n\\r\\nUser-agent: *\\nDisallow: /deny\\n'
>>> get('mhasbini.com', 1234, '/robots.txt')
Traceback (most recent call last):
...
tt2.ConnectionError: [Errno 113] No route to host
"""
# Generate request message
request_m = f'GET {path} HTTP/1.1\r\n'
request_m += f'Host: {host}:{port}\r\n'
request_m += 'Connection: close\r\n'
request_m += '\r\n'
try:
# AF_INET -> ipv4
# SOCK_STREAM -> TCP
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect((host, port))
sock.sendall(request_m.encode())
# Get data 1024 bytes at a time
data = b''
while True:
_buffer = sock.recv(1024)
if not _buffer:
break
data += _buffer
sock.close()
return repr(data.decode())
except OSError as e:
raise ConnectionError(str(e)) from None
HTTP response parser
This is the final step where we take the raw response from the function above and parse it to get the status code and the response body. The API is as follow:
status, body = parse_response(response)
Example of responses:
HTTP/1.1 200 OK
Date: Thu, 05 Nov 2020 00:05:19 GMT
Content-Type: text/plain
Content-Length: 30
Connection: close
Server: gunicorn/19.9.0
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true
User-agent: *
Disallow: /deny
HTTP/1.1 404 NOT FOUND
Date: Thu, 05 Nov 2020 00:05:46 GMT
Content-Type: text/html
Content-Length: 233
Connection: close
Server: gunicorn/19.9.0
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>404 Not Found</title>
<h1>Not Found</h1>
<p>The requested URL was not found on the server. If you entered the URL manually please check your spelling and try again.</p>
We start by defining test cases:
def parse_response(raw_response):
"""Parse raw http response and return status code and body.
>>> parse_response('HTTP/1.1 200 OK\\r\\nDate: Thu, 05 Nov 2020 03:22:48 GMT\\r\\nContent-Type: text/plain\\r\\nContent-Length: 30\\r\\nConnection: close\\r\\nServer: gunicorn/19.9.0\\r\\nAccess-Control-Allow-Origin: *\\r\\nAccess-Control-Allow-Credentials: true\\r\\n\\r\\nUser-agent: *\\nDisallow: /deny\\n')
(200, 'User-agent: *\\nDisallow: /deny')
>>> parse_response('lorem ipsum')
Traceback (most recent call last):
...
ValueError: Invalid raw response
"""
...
The headers and the body are seperated by \r\n\r\n
so we parse it as follow:
DELIMITER = '\\r\\n\\r\\n'
status_pp = pp.pyparsing_common.signed_integer.setResultsName('status')
body_pp = pp.Suppress(DELIMITER) + pp.Regex(r'(.*?)$').setResultsName('body')
response_pp = pp.LineStart() + 'HTTP/1.1' + status_pp + pp.SkipTo(DELIMITER) + body_pp
Putting it together:
import pyparsing as pp
def parse_response(raw_response):
"""Parse raw http response and return status code and body.
>>> parse_response('HTTP/1.1 200 OK\\r\\nDate: Thu, 05 Nov 2020 03:22:48 GMT\\r\\nContent-Type: text/plain\\r\\nContent-Length: 30\\r\\nConnection: close\\r\\nServer: gunicorn/19.9.0\\r\\nAccess-Control-Allow-Origin: *\\r\\nAccess-Control-Allow-Credentials: true\\r\\n\\r\\nUser-agent: *\\nDisallow: /deny\\n')
(200, 'User-agent: *\\nDisallow: /deny')
>>> parse_response('lorem ipsum')
Traceback (most recent call last):
...
ValueError: Invalid raw response
"""
DELIMITER = '\r\n\r\n'
status_pp = pp.pyparsing_common.signed_integer.setResultsName('status')
body_pp = pp.SkipTo(pp.Regex(r'$')).setResultsName('body')
response_pp = pp.LineStart() + 'HTTP/1.1' + status_pp + pp.SkipTo(DELIMITER) + body_pp + pp.LineEnd()
try:
result = response_pp.parseString(raw_response)
except pp.ParseException:
raise ValueError('Invalid raw response')
return result.get('status'), result.get('body')
Plugging everything together
This is the final step where we plug everything together and some syntactic sugar.
httpy.pyfrom parsers import parse_url, parse_response
from request_helper import get as raw_get
class Result:
def __init__(self, status, data):
self.status = status
self.data = data
def get(url):
return Result(*parse_response(raw_get(*parse_url(url))))
And we have our demo working! 🎉
import httpy
req = httpy.get('http://httpbin.org/robots.txt')
req.status # => 200
req.data # => 'User-agent: *\nDisallow: /deny'
Conclusion
The full source code is hosted in this gist.
I enjoyed using doctest
and pyparsing
for this project. Working on this made me appreciate the effort put into keeping the internet running as it is. So many things can go wrong and I'm glad that there are smart people on helm.