Decompressing HTTP bodies

Apr 4, 2011 at 10:05 PM

Hi Boaz,

First, pcap.net is awesome.  It's rare to find such well-documented, well-supported and USEFUL utility.   Thanks for all your work.

My question is - what the best way to decompress a gzipped/compressed HTTP body?  I've tried a few things, including:

      string httpBody = http.Body.ToString(Encoding.UTF8);
      if (http.Header.ToString().ToLower().Contains("content-encoding: gzip"))
      {
       // body compressed, need to decompress
       MemoryStream decompressedStream = new MemoryStream();
       using (GZipStream gz = new GZipStream(http.Body.ToMemoryStream(), CompressionMode.Decompress))
       {
           byte[] bufffer = new byte[0x400];
           int count = gz.Read(bufffer, 0, bufffer.Length);
           while (count != 0)
           {
               decompressedStream.Write(bufffer, 0, count);
               count = gz.Read(bufffer, 0, bufffer.Length);
           }
       }
       httpBody = Encoding.UTF8.GetString(decompressedStream.ToArray());
      }

However, this just throws the error: "The magic number in GZip header is not correct. Make sure you are passing in a GZip stream."

Any ideas you have would be very appreciated.

Thanks again!

Jeff

Coordinator
Apr 16, 2011 at 8:09 PM

Hi Jeff,

 

Could you provide more details about the exception (is in the first iteration of the loop? full call stack)?

Also, note that an HTTP packet usually doesn't contain the entire gzip data due to TCP fragmentation.

 

Boaz.

May 11, 2011 at 11:25 PM

Hi Brickner,

I think I found the problem - it appears the TCP payload is never populated with the gzip payload data (the length only reports that of the header plus the two carriage returns), and thus the HttpDatagram body never gets populated - the body.length is always zero.

Steps to replicate:

Here's a very small, compressed javascript file hosted at yahoo I used to test with: http://e.yimg.com/ii/yabcs.js

Via CharlesProxy and HTTPWatch, the request and responses look like this:
--------------------------------------------------------------------------------------

GET http://e.yimg.com/ii/yabcs.js HTTP/1.1
Accept: image/gif, image/jpeg, image/pjpeg, image/pjpeg, application/x-shockwave-flash, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, application/x-ms-application, application/x-ms-xbap, application/vnd.ms-xpsdocument, application/xaml+xml, */*
Accept-Language: en-us
User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022; InfoPath.1; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; .NET4.0C; .NET4.0E; MS-RTC LM 8)
Accept-Encoding: gzip, deflate
Proxy-Connection: Keep-Alive
Host: e.yimg.com


HTTP/1.1 200 OK
Date: Wed, 11 May 2011 20:39:09 GMT
Cache-Control: max-age=28800
Expires: Thu, 12 May 2011 04:39:09 GMT
Last-Modified: Mon, 01 Feb 2010 20:47:05 GMT
Accept-Ranges: bytes
Vary: Accept-Encoding
Content-Type: application/x-javascript
Content-Encoding: gzip
Age: 4064
Content-Length: 913
Server: YTS/1.19.5
Proxy-Connection: Keep-alive

[compressed javascript here in body]

--------------------------------------------------------------------------------------

With PCAP.net the TCP payload is identical, except it does not contain any of the compressed body bytes in the HTTP response - just the header data.

Here's my code.  You can run it, then simply launch the URL to the .js file in a browser - note the response from yahoo will only include the header and carriage returns - but no body bytes.

  public static void sniff()
  {
   // control via separate thread
   Sniffer.on = true;

   // Send anonymous statistics about the usage of Pcap.Net
   PcapDotNet.Analysis.PcapDotNetAnalysis.OptIn = false;

   // Retrieve the device list from the local machine
   string deviceName = ConfigurationManager.AppSettings["deviceName"];
   IList<LivePacketDevice> allDevices = LivePacketDevice.AllLocalMachine;
   PacketDevice selectedDevice = allDevices.First(item => item.Description.Contains(deviceName));
   if (selectedDevice == null) { throw new System.Exception("Cannot find device named '" + deviceName + "'."); }

   // 65536 guarantees that the whole packet will be captured on all the link layers
   int i = 0;
   int snapshotLength = 65536;
   string myIp = ConfigurationManager.AppSettings["localIpAddress"];
   Packet packet;
   PacketCommunicator communicator = selectedDevice.Open(snapshotLength, PacketDeviceOpenAttributes.NoCaptureLocal, 1000);
   communicator.CreateFilter("ip and tcp");

   do
   {
    PacketCommunicatorReceiveResult result = communicator.ReceivePacket(out packet);
    if (packet == null) { continue; }
    if (packet.Ethernet == null) { continue; }
    if (packet.Ethernet.IpV4 == null) { continue; }
    if (packet.Ethernet.IpV4.Tcp == null) { continue; }
    if (packet.Ethernet.IpV4.Tcp.Http == null) { continue; }

    int sourcePort = packet.Ethernet.IpV4.Tcp.SourcePort;
    int destinationPort = packet.Ethernet.IpV4.Tcp.DestinationPort;
    IpV4Address sourceAddress = packet.Ethernet.IpV4.Source;
    IpV4Address destinationAddress = packet.Ethernet.IpV4.Destination;

    if (sourceAddress.ToString() == myIp || destinationAddress.ToString() == myIp)
    {
     IpV4Datagram ip = packet.Ethernet.IpV4;
     TcpDatagram tcp = ip.Tcp;
     HttpDatagram http = tcp.Http;
     string httpBody = "";
     string httpHeader = "";

     try
     {
      // parse packet
      if (tcp.IsValid && tcp.PayloadLength > 0)
      {
       // pull the payload
       Datagram dg = tcp.Payload;
       MemoryStream ms = dg.ToMemoryStream();
       StreamReader sr = new StreamReader(ms);
       string content = sr.ReadToEnd();

       // skip if encrypted / non parsable
       if (content.IndexOf("HTTP") == -1) { continue; }

       // parse out header
       int endHeader = content.IndexOf("\r\n\r\n");
       if (endHeader == -1) { throw new System.Exception("Cant discern header breakpoint."); }
       httpHeader = content.Substring(0, endHeader);

       // parse out body
       // but make sure it isn't just composed of only the CRLF CRLF breaks
       if (http.Body != null && (content.Length - endHeader > 4))
       {
        // we have some body content
        // parse out and decompress if necessary
        Stream bodyStream = new MemoryStream(http.Body.ToArray());
        if (http.Header.ToString().ToLower().Contains("content-encoding: gzip"))
        {
         bodyStream = new GZipStream(bodyStream, CompressionMode.Decompress);
        }
        if (http.Header.ToString().ToLower().Contains("content-encoding: deflate"))
        {
         bodyStream = new DeflateStream(bodyStream, CompressionMode.Decompress);
        }

        // ERROR: for gzip streams, getting:
        // "The magic number in GZip header is not correct. Make sure you are passing in a GZip stream."
        // works fine for non-encrypted streams
        byte[] bodyBytes = Utils.readStream(bodyStream, 0);
        httpBody = Encoding.UTF8.GetString(bodyBytes);
       }
      }
     }
     catch (Exception ex)
     {
      // do something
     }
    }
   } while (Sniffer.on);
  }


 

May 13, 2011 at 5:02 PM

Hi Boaz -- Browsing through the issue tracker, it seems this bug is related to http://pcapdotnet.codeplex.com/workitem/7907 .  Can you replicate?

Coordinator
May 20, 2011 at 3:24 PM

Can you attach a .pcap file to show an example of a packet?

May 26, 2011 at 3:50 PM
Edited May 26, 2011 at 4:13 PM

had the same issue with HttpDatagram with gzip encoded http packets

finded source of issue for me:

when i get content from HttpDatagram or HttpResponseLayer with gzip content -
this data has some additional headers at top and at end of gzip content.

And this cause problems for ungzip.

This additional headers delimited from gzipped content by \r\n  (0x0d 0x0a )

ps: I have only big help - at my situation packets are small so dont need compose together chunked data: each packet contains full gzip data

I am curious what headers added before and after gzip content, but didnt have time to check this

 

So my solution for this (very roughly, just test that works ):
Check these headers by 
- for top part - by magick number 1f8b, and cut of all before magick header
- at end - find first 0d0a bytes ; and cut off all from these bytes
- result can be ungzipped 

test sample 
ps: this works only when one packet contains full chunk  of data

        // Callback function invoked by libpcap for every incoming packet
        private static void PacketHandler( Packet packet ) {
            // print timestamp and length of the packet
            // Console.WriteLine(packet.Timestamp.ToString("yyyy-MM-dd hh:mm:ss.fff") + " length:" + packet.Length);

            IpV4Datagram ip = packet.Ethernet.IpV4;
            TcpDatagram tcp = ip.Tcp;
            HttpDatagram httpdgrm = ip.Tcp.Http;


            // for http packets only
            if ( httpdgrm != null && httpdgrm.Body != null ) {

                //------- response
                if ( httpdgrm.ExtractLayer() is HttpResponseLayer ) {
                    HttpResponseLayer http2 = ( PcapDotNet.Packets.Http.HttpResponseLayer ) httpdgrm.ExtractLayer();

                    string httpBody = http2.Body.ToString();

                    // decode from gzip
                    // body compressed, need to decompress
                    if ( ( !String.IsNullOrEmpty( httpBody ) ) &&
                            http2.Header.ToString().ToLower().Contains( "content-encoding: gzip" ) ) {


                        // at first we can have some additional headers before real data
                        // so we must remove them
                        // check this by finding Magick header
                        String sMagick = "1f8b";
                        Int32 iMagickPos = httpBody.IndexOf( sMagick );
                        if ( iMagickPos >= 0 ) {
                            httpBody = httpBody.Substring( iMagickPos );
                        } else {
                            Console.WriteLine( "No gzipped data" );
                            return;
                        }

                        // we can have at end some additional headers
                        // they usually begin with "0d0a"
                        // so we remove them

                        // find pos of 
                        String sFind = "0d0a";
                        Int32 iFindPos = httpBody.IndexOf( sFind );
                        if ( iFindPos >= 0 ) {
                            httpBody = httpBody.Substring( 0, iFindPos );
                        }



                        // now convert string to bytes
                        // we have string that look like 
                        // "1f8b08000000000000035d8f4baec2300c45f7e271a8ec381f2733d681106ad3... "
                        // we must convert it to bytes
                        // each 2 chars converted to byte
                        String OneByte;

                        Int32 ilen = 0;
                        Int32 iLenMax = httpBody.Length;
                        byte[] byteArray = new Byte[ httpBody.Length / 2 ];
                        while ( ilen < iLenMax ) {
                            OneByte = httpBody.Substring( ilen, 2 );
                            byteArray[ ilen / 2 ] = Convert.ToByte( OneByte, 16 );
                            ilen += 2;
                        }

                        // now begin unzip
                        MemoryStream gZipSourceData = new MemoryStream( byteArray );

                        MemoryStream decompressedStream = new MemoryStream();
                        // using ( GZipStream gz = new GZipStream( http2.Body.ToMemoryStream(), CompressionMode.Decompress ) ) {
                        using ( GZipStream gz = new GZipStream( gZipSourceData, CompressionMode.Decompress ) ) {
                            byte[] bufffer = new byte[ 0x400 ];
                            int count = gz.Read( bufffer, 0, bufffer.Length );
                            while ( count != 0 ) {
                                decompressedStream.Write( bufffer, 0, count );
                                count = gz.Read( bufffer, 0, bufffer.Length );
                            }
                        }
                        httpBody = Encoding.ASCII.GetString( decompressedStream.ToArray() );
                    }


                    Console.WriteLine( httpBody );
                }
            }
        }

may be this will help tune classes..
great package btw - help greatly, thanks guys for it )
Coordinator
May 27, 2011 at 2:57 PM

If you can send me a pcap example for that I might be able to incorporate the solution in Pcap.Net.

May 28, 2011 at 11:11 PM

Hi,

is there an easy way to get around the TCP fragmentation of the HTTP Body? Or do I have to parse the tcp stream by myself?

Thank you

May 29, 2011 at 3:40 PM
Edited May 29, 2011 at 3:46 PM
brickner wrote:

If you can send me a pcap example for that I might be able to incorporate the solution in Pcap.Net.

Hi, thanks, prepared pcaped packets with example

zip file with pcap data placed here:
http://mylinenpants.com/_test/all_packets_from session.zip 

here are few packets from one http get/response session

There are chunked data

Problem packets are 2 http response packets (#4 and #8 ) with http headers like:
HTTP/1.1 200 OK  (text/html)

They have gzipped data, and lib get body with additional bytes before magic bytes "0x1f 0x1b"  

Coordinator
Jun 11, 2011 at 1:35 PM
sbetzin wrote:

Hi,

is there an easy way to get around the TCP fragmentation of the HTTP Body? Or do I have to parse the tcp stream by myself?

Thank you

Currently there isn't.

You can vote for this in the Issue Tracker.

Coordinator
Jun 11, 2011 at 1:55 PM
checkitmore wrote:
brickner wrote:

If you can send me a pcap example for that I might be able to incorporate the solution in Pcap.Net.

Hi, thanks, prepared pcaped packets with example

zip file with pcap data placed here:
http://mylinenpants.com/_test/all_packets_from session.zip 

here are few packets from one http get/response session

There are chunked data

Problem packets are 2 http response packets (#4 and #8 ) with http headers like:
HTTP/1.1 200 OK  (text/html)

They have gzipped data, and lib get body with additional bytes before magic bytes "0x1f 0x1b"  

I took a look at the pcap file.

There's is TCP reconstruction here.

Packet 4 is actually the continuation of packet 3.

If you take packet 3 and try to parse it using Pcap.Net, it will parse the HTTP part and give you some of the body.

It won't give you all the body because TCP reconstruction is not supported in Pcap.Net.

 

The situation with packets 7 and 8 seems to be similar.

 

I hope this helps,

 

Boaz.