Let's make web more responsive

This document contains condensed knowledge about how to save bytes on the link and (second priority) millisecond of rendering on user's screen. Doing many of these things make sense esp. on heavily visited webpages where difference between serving 20requests/sec and 200req./sec in peak hours is not just about speed, but also about not loosing potential customers.

DNS

Use long TTL (time-to-live), like 1 day or more, to make better use of dns-caching and thus speed-up resolving your domain name: e.g. if clients A and B use the same DNS server D, client A visits your website and client B visits it 30 minutes later (and if TTL for your domain is 30+ minutes): the client B will resolve DNS faster, because server D will have the data about your domain cached.
Note: using higher TTL will means DNS data will be cached longer on caching-DNS-servers, so any changes you'll make will take longer to propagate over the Internet.
Note2: benefits of increasing TTL doesn't scale linearly, that is increasing TTL from 5 minutes to just 10 minutes give better results that increasing TTL from 1 day to 2 days.

Avoid using CNAME (canonical name), use A record. CNAME records are "redirects" and CNAME resolved domain name has to be resolved by a client once again: e.g. if somesite.net has a CNAME record foo.someisp.com then after resolving somesite.net to foo.someisp.com there is still work to do before a browser knows the IP address.

If you use your own nameserver, ensure the domain of the nameserver itself has a long TTL. Also, check out this DNS-server test – it checks (among other things) if parent nameserver provide so called "glue records" (i.e., if it returns also A records of your nameservers), or not: it returns just names that needs to be resolved to A records.
If you have a possibility to turn "glue records" on, and they are not enabled yet, turn them on. Also, play with these DNS tools/tests, they could give you hints what could be improved.

TCP/IP

When client finally knows the IP address of the server: it sends SYN packet to the server, server replies with SYN+ACK, and after receiving the SYN+ACK (not earlier) client finally can send a query, along with ACK for the SYN+ACK. If you use QoS (Quality of Service – control mechanisms that can provide different priority to different data flows or packets) put ACK and SYN packets in the highest possible priority and minimal delay.

When query is received by a server (the query fits into one TCP packet): server replies with sending the asked file. It sends 1-4 packets (depending on the system and settings) and waits for receiving ACK packet before sending the rest. TCP protocol, in this way, sends more packets faster and faster, probing the pipe for its maximum bandwidth. Problem is: it starts slowly; high latency links (or just being far away from server, light takes 66ms to travel to the opposite point of the Earth, TCP packet cannot travel faster) and requests for many small files can result in most time spend on just waiting for an ACK packet from the other side. Even if your server sits on a low-latency link and is connected to low-latency routers: take into account that many clients have a high latency link, are on the other side of the Earth, have the link overloaded. With satellite links: RTT (round-trip-time) is ~600ms; with an overloaded DSL link: RTT can be even 1500ms; receiving a web-page results in many round-trips.

DSL

If you host websites on a DSL link: make sure the DSL works in FastPath mode (not in interleaved mode), it reduces latency by >= 15ms in RTT. Ping to the first hop in traceroute should be <=10ms; if it's >=25ms: you're on interleaved mode: complain to your ISP, ISP can usually change it without bad side-effects unless you're far away from DSLAM or have a noisy line (in such cases less error-proness of interleaved mode starts to matter).

Fitting data in packets

Most Internet pipes have MTU 1500B (DSL in PPPoE mode has MTU=1492B). 40B is used by TCP header (or 62B, if you have timestamps turned on), about 300B takes HTTP header. We are left with ~1160B for data in the first packet and ~1460B for data in subsequent packets. If a web-page fits into these initial ~1160B, it'll be send just in one TCP packet, and it'll be really fast, all data will be transmitted in one shot, without waiting for ACK from the other side.
However, modern operating system now sends 2 or even 4 TCP packets at once on start, so you may use up to ~1160B + ~1360B (or ~1160B + ~1360B * 3) for the trick. (Use a packet sniffer to make sure how many bytes TCP and HTTP headers really take and how many packets your operating system sends on start). Knowing the exact border could be useful: e.g. you may discover that if you trim you CSS file just few bytes, it'll be served in one shoot, without the overhead of header in the next TCP packet and without waiting for ACK.

Persistent connections

The overhead of opening new TCP connection (and wasting time on SYN, SYN+ACK packets and TCP slow-start) can be eliminated thank to feature called TCP persistent-connection: here all data is send over one TCP connection instead of opening separate TCP connections for every file (CSS, images, JavaScript, etc.). It saves bandwidth (on SYN and RST packets) and makes web more responsive as further requests don't spend time on establishing new TCP connection.
However, the feature is implemented in many web-servers (incl. Apache) in a such way that it completely blocks one web-server instance for waiting over such idle persistent connection, and every persistent connection could take hours or days. Web-servers thus use a timeout (Apache's default is 15s) to knock-out an idle persistent connection and free a web-server instance from sitting idle on such connection. In same cases timeout of 15 seconds can be too much: still too many web-server instances sitting idle and taking RAM memory, and there is no memory for more Web-server instances for serving new clients.
If you have problem with that, don't turn off persistent connections completely, just use a low timeout (1-5 seconds), until even this will be too much. Ideally, persistent connection should last as long as a client have a web-page opened and possibly can click and request another file from the same server, so latency of serving such request is minimized by one RTT.
A good solution to KeepAlive problem is thread-based "event" module in Apache-2.2 that uses separate thread for all keep-alive connections, so you can leave a long timeout without sacrificing resources on the web-server.
If you use classic (prefork) Apache2 module, and want to use KeepAlive, but worry about sudden, unexpected surges in web traffic, you can make a script (i.e., in cron) that will monitor Apache's error.log for "MaxClients reached" phrase and, if detected, lower KeepAlive timeout (or turn off KeepAlive) in apache2.conf and then restart Apache – it's an ugly solution, but ensures a compromise between using KeepAlive (on prefork module) to minimize latency, and not worrying that during unexpected peaks KeepAlive will do much harm.

HTTP pipelining

A natural "extension" to persistent-connections is called pipelining: a client doesn't wait for asked file to be received before sending more requests over the same TCP connection; in fact, client can send all the requests at once over the TCP connection, so a web-server can send all the files, one by one, without waiting for a next request just after fulfilling a previous request.
Sadly, some browsers don't support pipelining (IE <=7), Firefox browser have it disabled by default and only(?) Opera makes use of it be default (allegedly using some heuristics and back-offs from using pipelining if it senses the web-server doesn't support pipelining).

Unfortunately, there is nothing we can do about the issue (on the web-server side), other than making sure our web-server supports pipelining (I'm almost sure that all modern web servers do support pipelining, problem was with old IIS) and persistent-connections are turned on. I.e. there is no way to force browser to use pipelining (from web-server point of view) or inform browser that we do support pipelining. More about HTTP pipelining on Wikipedia.

OS specific tuning

URLs

sysctl settings
OS	setting	default	comment
OpenBSD/FreeBSD/NetBSD?	rfc3390=1	0?	increases initial TCP window from 2 to 4 packets
Solaris	tcp_slow_start_initial=X	?	sets initial TCP window to X packets
Linux/?	tcp_timestamps=0 or 1	1?	turning it off gives you 12B in every TCP packet, but also results in less data for TCP congestion algorithms: less accurate RTT measurement. benchmarks needed
Linux/?	tcp_ecn=1	0	turns on using Explicit Congestion Notification (less latency over congested link if both sides support it) but can result (unlikely these days?) in your TCP packets dropped by some broken/old routers
Linux/?	~~tcp_recycle=1~~	0	?
Linux/?	~~tcp_reuse=1~~	0	?

http://asite.com is not the same URL as http://asite.com/
If you'll type URL without ending "/": web-server will redirect you to URL with ending "/", wasting a RTT in the process.

http://asite.com/ ≠ http://asite.com/index.html for web-agents, so stick to "http://asite.com/" and don't use URL with "index.html" (or vice versa), and perhaps use a redirect from index.html to URL without the file.

http://www.asite.com/ ≠ http://somesite.com/ for web-agents, so stick to one of these URLs and use redirect from one to another.

The two optimization above are especially important if you make use of web-caching on client and on proxies, and it's also a positive search-engine-ranking optimization (i.e., your web-ranking will cumulate on one URL).

Use relative URLs for internal links (e.g. <a href=hello> instead of <a href="http://mydomain.com/mysite/hello">). Hide your filename extensions (.html, .php, .asp, etc.) using internal redirects.

HTML

Normal human visitors don't care about HTML code readability, nor comments in HTML code. You can remove comments on-the-fly or use some tools to keep a readable file for you, yet the web-server's file can be without comments and useless whitespaces. You can usually remove new-lines completely and use only space (thus you save bytes after compression later: with less unique characters in file, compression will have easier job).

In HTML (but not in XHTML) some things are optional and browsers work perfectly fine without them (and W3C validator doesn't complain):

Use > instead of >. Use only lowercase in HTML tags (e.g. <p> instead of <P>) – usually you'll have slightly better compression ratios, also you'll be ready for XHTML with lowercased tags. If you edit HTML files on MS Windows, remove \r characters in the on-line files.

Fast tables

Tables, esp. long tables, can take long time to render. "table{table-layout:fixed}" attribute in CSS2 allow browser to use so called fixed, fast table layout algorithm that starts to render table just after downloading the first row. To make it work, browser has to know the table width. Width of every column is based on explicit width (if set) or (if there is no explicit data about column's width) based on contents of the column in the first row.

XHTML

If you want to use minimum bandwidth: stick to HTML, as XHTML always requires closing tags and quotation marks around values. If you want to have XHTML backward compatible with HTML, you have to waste even more bytes on whitespaces (e.g. <br />). Default encoding in XML/XTML is UTF-8, thus encoding="UTF-8" is superfluous in XML prolog (e.g. <?xml version="1.0" ~~encoding="UTF-8"~~?>). In fact, entire XML prolog is optional in XHTML.

Character encoding

UTF-8 is nice, but if you use only characters from one country, it does make sense to use (in output file) ISO-8859-x encoding; ISO-8859-x takes 1B per every character covered, while UTF-8 at least 2B per every non-ASCII character.

CSS

If you care (you should) about readability and maintainability of the code: keep original file intact and serve an optimized file; optimization can be done by a script; or, if you have content-caching on the server-side, optimizing the file on the-fly with few simple regular-expressions may be OK.

Keep short class names.
Colors like #00000 or #ffffff can be written as #000 or #fff
Colors like #FFF can be #fff (if you keep all the letters in lowercase and there will be no uppercase 'f': file will be better compressed by gzip/deflate)
color:white can be written as color:#fff (1B saved)
No need to keep unit-type if number is 0 (e.g. border:0; instead of ~~border:0px~~)
Semicolon at the end of a CSS rule is optional, e.g. p{font:10px verdana;border:0} instead of ~~p{font:10x verdana;border:0;}~~
Instead of 100px you can write 99px in many cases (1B saved).
Combine many rules into one, e.g.: p,table,li{font:14px sans-serif}
Use shorthand properties, e.g. instead of:
font-family:verdana;font-size:12px;font-weight:bold;font-style:italic
you can write:
font:italic bold 12px verdana

Instead of:
border-left:auto;border-right:auto;border-top:9px;border-bottom:5px
you can write:
border:9px auto 5px

Use external CSS file if you have more than one HTML-file using one CSS: CSS file can be cached on client-side and thus fetched only once. Use lowercased font and class names (it'll result in better compression). Avoid fixed position, it slows down scrolling (position:scroll is the default, so unless you use "fixed" keyword in your CSS, you don't have to worry about the issue).

JavaScript

If a script doesn't have to be fetched and executed on page load, it may be worth to add attribute "defer": fetching (if it's an external file) and execution of the script will be delayed until page is loaded, so that page-load will be faster.

There are some programs that "condenses" JavaScript: replaces variables with short ones, removes whitespaces, etc.

Caching on client-side (and on web proxies in-the-middle)

No excuse to not use client-side-caching when it comes to static content, a smart caching on client-side can (and should) be done safely even in cases of most dynamic content. That is: even if your dynamic content changes frequently, you force a browser (and/or caching proxies in-between) to cache received content and do freshness-validations on subsequent reloads of the page, so a browser asks for a file with a conditional request to a web-server "send me the file only if version X of the file is now outdated": web-server can reply with a fresh file or with a short response saying that the version X is still fresh and can be displayed to user immediately.

Most web-servers already send (for static files) file-modification time and ETag in HTTP headers and they handle conditional requests by browsers: if-modified-since:date and if-none-match:etag.
But for dynamic content (e.g. PHP, ASP, CGI), a web-server doesn't return file-modification-time nor ETag. You can do it by yourself though.

Optional Cache-Control: line in HTTP header says to browsers and proxies-in-between if/how a content can be cached. No such line means a content can be cached. Example usage in PHP: header('Cache-Control: max-age=86400, no-public'); or header('Cache-Control: max-age=86400');

Cache-Control directives
public	this is the default behavior; if there is no Cache-Control line: it's like there is `Cache-Control: public`; it means the content may be cached by any cache
private	do not cache content by public caches in-the-middle (i.e., web-proxies)
no-cache	do not cache the content at all
no-store	do not store the content on disk (can be still cached in memory if no-cache is not specified)
max-age: X	consider the content fresh without validation for X seconds. 1 day = 86400 seconds. X shouldn't be more than 1 year.
s-maxage: X	overrides max-age (and Expiry header) for shared caches; private caches caches ignore this token
must-revalidate	content can be cached (if there is no no-cache) but must be revalidated (on subsequent use, every time) with conditional-request like if-modified-since
proxy-revalidate	it's like must-revalidate above, but it doesn't apply to non-shared caches

Note: URLs with question mark (e.g. somesite.com/bla.php?a=3&b=6) are never cached by some browsers and proxies, no matter what HTTP header says about caching.

Now, most dynamic content can return file-modification time in HTTP header and/or ETag (e.g. hash of contents) and can handle conditional requests. Example PHP code handling it: header('Cache-Control:max-age=600'); // consider content fresh for 600s, after the 600s for subsequent refreshes browser should use conditional requests
$lastmod = getLastModTime(); // lets say getLastModTime() returns time in UNIX date format (seconds since 1970-01-01)
$dateToSend = date('D, d M Y H:i:s', $lastmod).' GMT'; // date in HTTP header should always use GMT

if(isset($_SERVER['HTTP_IF_MODIFIED_SINCE']) && $_SERVER['HTTP_IF_MODIFIED_SINCE'] === $dateToSend) {
  header('HTTP/1.1 304 Not Modified');
  die();
}
$etag = '"'.base_convert(substr($lastmod, 3), 10, 36).'"'; // $etag could be simpler, but here we make it short

if(isset($_SERVER['HTTP_IF_NONE_MATCH']) && stripslashes($_SERVER['HTTP_IF_NONE_MATCH']) === $etag) {
  header('HTTP/1.1 304 Not Modified');
  die();
}
header('Last-Modified:' . $dateToSend);
header("ETag:$etag");

Even if you cannot supply something like getLastModTimeToSend() function in your code, you probably could send an ETag. At the worst: you could buffer all the output, compute sha1() (or another fast hash) of the buffered output (and use it as ETag) and check the conditional HTTP_IF_NONE_MATCH: you still save on bandwidth and on clients time by returning just "Not modified" instead of the contents. If you use an output compression, you're buffering the output anyway.

You don't have to send file-modification-time and ETag; in fact, sending both is waste of bytes in HTTP-header until there is a browser (or caching-web-proxy) that only handle one and not another(?).

Expiry line in HTTP header

Besides Cache-Control: line in HTTP/1.1 header, you can also use Expiry: line (which is compatible with HTTP/1.0 and HTTP/1.1). Expiry: is less powerful and if you already use Cache-Control, sending Expiry: line in HTTP header is waste of bytes (unless there is a proxy/cache that is still limited to HTTP/1.0).

Evil cookies

If you set a cookie for mysite.com/ with path=/, then every request from the client that has the cookie (to any page on mysite.com) cannot be satisfied from a public cache, i.e. if client connects though a web-proxy that has the file already cached, the web-proxy still needs to connect to your server and fetch the web page anyway because of the cookie.

Thus cookies break caching in-the-middle. Additionally, cookies make requests containing them larger. Solution is to use cookies selectively for some paths only (cookies will be send by browser only for requests under the path), or serve images (and other heavy files) from another domain without setting cookies for the domain, so the heavy files can be cached by public web-proxies.

Yahoo! uses yimg.com domain for images. Idea of serving images from another domain is worth a digression: for some webmasters main reason for keeping images on another domain is to force browser to open another TCP connection(s): if there is no keep-alive connection between server and browser, then this is what you can do to speed-up loading a web-page that has many images: force browser to open more TCP connections. Browsers has some limits of connections per domain (in order to not be a burden for a web-server), and offering images from another domain (even if it's on the same IP) allow the browser to open even more TCP connections. Such solution indeed improves load-times of setups without keep-alive turned on, but at the cost of more wasted resources at the both sides (more web-server instances busy per client, more wasted packets (SYNs,RSTs); more work for browser: additional DNS lookup, more TCP connections). In pre HTTP/1.1 times this solution seemed OK, but with HTTP/1.1 it seems obsoleted: keep-alive (even without pipelining) offer even better perfomance and don't waste the resources at the same time.
It doesn't mean it doesn't pay to keep images on another domain for other reasons, like aforementioned caching in-the-middle thank to no cookies per domain. Serving images from another domain also allows to do some load balancing: the images can be served from another server physically, sitting on another link, by web-server software that can be more optimized for serving static content, without any PHP modules loaded into memory.

Back to the cookies: make them short as possible to not overload uplink of individual clients (many have asymmetric DSL with slow upload) and web-server downlink, and (ideally) to make every HTTP request fit into just one TCP packet.

HTTP headers

Saving bytes in HTTP headers is important because HTTP header is never compressed and is sent along every HTTP response. Many HTTP headers are optional and are not used by browsers at all, and normal human visitors don't care if you run PHP 4.3.10 or 5.2.3.

Example HTTP header with crossed out things (351B) that are superfluous in this example:

HTTP/1.1 200 OK
Date: Sun, 01 Jul 2007 00:03:34 GMT
Server: Apache/1.3.35 (Unix) mod_fastcgi/2.4.2 mod_jk/1.2.14-dev mod_ssl/2.2.10 OpenSSL/0.9.9a PHP/4.3.10 mod_perl/1.29 FrontPage/5.0.2.2510
X-Powered-By: PHP/4.3.10
Pragma: no-cache
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Content-Encoding: gzip
~~Vary: Accept-Encoding~~
Keep-Alive: timeout=1, max=20
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html

Prefetching

Use <link rel=next href=url..> (in <head> section of a web-page) if there is a logical next web-page. It not only makes possible surfing "forward" button (in theory), it also (in practice) makes possible to prefetch the next web-page to local cache (Firefox does it). Also, Firefox uses <link rel=prefetch href=...url> to prefetch a web-page while idle.

Apache tuning

Sadly, you can't remove some HTTP-headers from Apache output just by changing Apache-config (you can remove them by modifying the sources) and they aren't needed:

interesting Apache config directives
Directive	What it does / comment
ServerTokens Prod	eliminates several bytes from every Apache response: every HTTP request will contain "Server: Apache" instead of e.g. ~~"Server: Apache/1.3.35 (Unix) mod_jk/1.2.14-dev mod_ssl/2.2.10 OpenSSL/0.9.9a PHP/4.3.10 mod_perl/1.29 FrontPage/5.0.2.2510"~~
AllowOverride None	don't search for a .htaccess file on every request in every (sub)directory of a www-directory; cons: htaccess settings have to be moved to Apache main config and such change is usually not possible on shared hosting
BufferedLogs on	don't sync the log files after every request
Options +FollowSymLinks -FollowSymLinksIfOwnerMatch -Multiviews	If FollowSymLinks option is not set: Apache has to check if requested file is a symlink. If FollowSymLinksifOwnerMatch is set: Apache has to check ownership of requested file. Multiviews also add an overhead.
HostnameLookups off	dns-lookups take time
MaxKeepAliveRequests 0	saves ~9B in HTTP header ("Keep-Alive: timeout=15~~, max=100~~") if KeepAlive setting is on

By default, Apache returns in ETag header data about inode, modification-time and file-size. Chances are, you'll be fine just with modification-fime ("FileEtag -Inode -Size +Mtime") and "ETag" header will look like this: Etag: 4592b30d instead of Etag: 14482a6-2dca-4592b30d

Consider using mod_cache, especially with dynamic content that can be cached for some time. Don't turn RewriteEngine if you don't have to.

You can switch Apache with a more lightweight web-server, but don't expect here much benefits, if at all. Properly configured Apache is fast and your main slow-downs are most likely network-pipe-latency and dynamic language and database.
If you want to speed-up serving static files, a HTTP server working in kernel-space (AKA "HTTP accelerator") can help you, but beware of security consequences (a potential buffer overflow in such code could give attacker full access to your system).

MPM

MPM	description	pros	cons
prefork	many processes (heavy), no threads; one process handles one request at a time	good isolation (i.e., crash of one process won't take apart other processes)	high memory consumption, slower
worker	many processes, but every process uses threads (lightweight)	low memory consumption, more available memory for caching, buffers, MySQL, etc.	bad isolation (a crash of one process will disturb many requests)
event	like worker, but uses dedicated thread to handle listening socket and idle keep-alive connections	very low memory consumption, good handling of keep-alive connections	experimental + cons of worker
perchild	non-functional
winnt	optimized for MS Windows

Theoretical reasons why worker and event are faster than heavyweight prefork:

So, for maximum speed you would want to use worker (or event); if you use PHP, PHP is usually compiled with thread-safety disabled, so cannot be used with worker (nor event) which uses threads. You can check if your PHP is compiled with thread-safety by looking at the output of phpinfo().
Solution is compiling PHP by yourself to be thread-safe (it needs an argument to `./configure`, see `./configure --help`) and pay attention that PHP modules you use are thread-safe.

PHP

Parsing text in single quotes ('bla') is faster than parsing in double quotes ("bla"), because PHP parser looks for special characters (e.g. "\n") and variables in a double-quoted string. [According to Ilia Alshanetsky ' is not faster than ", benchmark needed.]

if($a === $b) is faster than if($a == $b) (== also uses casts, so a string with content of "0" will be ==equal to number 0, while ===equality is also type-equality).

echo() is slightly faster than print(), because print() also returns value of 1, echo() doesn't return anything.

true is slightly faster than TRUE. Internally names are always stored lowercased.

Set expose_php=Off (can be set in php.ini only): it'll make PHP not add PHP header in every HTTP request; this header is redundant.
Set register_argc_argv=Off if you don't use argc/argv variables.

Avoid calling functions with name prepended with "@" (to suppres errors) – according to Ilia Alshanetsky, it has high overhead.

Be careful with for() loops end-condition. Instead of:
for($ii=0; $ii < strlen($str); ++$ii) foo($str, $ii);
you can write:
$len = strlen($str); for($ii=0; $ii < $len; ++$ii) foo($str, $ii);

While PHP doesn't offer multi-threaded programming environment, you can use some things asynchronously. E.g., pg_send_query() function sends query to PostgreSQL and returns immediately (i.e., it's a non-blocking function), so you can do other things in the meantime and later call pg_get_result() – doing so makes sense on machines with more than 1 CPU or when database is on another computer.

Use a PHP-optimizer and PHP-compiled-code-cacher (like eAccelerator, XCache, APC) – it reduces overhead and makes PHP code about two times faster, without changes to PHP code.

Using include() with full path (e.g. include('./file.php') instead of include('file.php')) is faster if include_path doesn't begin with "." (current directory).

MySQL

I recommend Matthew Montgomery's script (tuning-primer.sh): www.day32.com/MySQL/, it collects data from your MySQL and gives recommendations about eventual changes in MySQL-config accordingly. If MySQL is on the same machine, connect to it by socket (e.g. /var/run/mysql.sock). Run mysqlcheck --optimize db_name from time to time, more frequently if there a lot of changes (UPDATEs, DELETEs) in the database.

Compression

All modern browsers ask-for and accept compressed (gzip/deflate) content and you can usually end up with 30-40% of the original file-size with typical HTML/CSS/JavaScript content; similar saving is also possible for other content (e.g. favicon.ico), but not for already well compressed files like JPG/PNG/GIF images. Web-agents that don't support gzip/deflate compressed content don't ask for it (i.e., they don't advertise that they support the compression when asking web-server for the content, so web-server sends them uncompressed files).

gzip/deflate level

gzip/deflate can have 10 level of compression (0 for no compression, 9 for best compression); 6 is the default level; higher compression levels gives little benefit in most cases, yet make compression much more CPU-intensive. However, decompression (done on client side) is faster (in case of gzip/deflate) when compression level is higher, thus if you can store pre-compressed file (i.e., don't compress on-the-fly during clients' requests) compression of level 9 is the best overall.

gzip vs deflate vs winner

gzip, deflate, deflate-raw
zlib format	header	data	footer	overhead	PHP function
gzip	10B: flags, version and timestamp	compressed data	8B: CRC32 checksum (4B) and length (4B) of the original data	18B + time spend on computing CRC32 (slow)	gzencode()
deflate-http	2B: compression flags	compressed data	4B checksum (lightweight ADLER32)	6B + time spend on computing ADLER32 (fast)	gzcompress()
deflate-raw	0	compressed data	0	0	gzdeflate()

Other possible tokens in Accept-Encoding header
compress	LZW compression, as used by compress utility in UNIX; LZW is very fast, but compress disappeared from UNIX beacause of a patent for LZW, fallen out of favor and seems forgotten along with the compress encoding in HTTP; patent for LZW expired in 2003-2006 in various countries.
x-compress	equivalent to compress (former experimental compress)
x-gzip	equivalent to gzip (former experimental gzip)
bzip2	high compression ratio (~5-25% better than deflate/gzip), slow compression (~2-3x slower than deflate/gzip), very slow decompression (~5-10x slower than deflate/gzip)

Support for HTTP-compression in browsers
Browser	gzip	deflate-http	deflate-raw	compress	bzip2	bzip
MSIE 6	OK	broken	OK	-	-	-
MSIE 7	OK	broken	OK	-	-	-
MSIE 8 (8.0.6001.18702)	OK	broken	OK	-	-	-
Opera 9	OK	OK	OK	-	-	-
Opera 10 (10.00) for Linux	OK	OK	OK	-	-	-
Firefox 2	OK	OK	OK	-	-	-
Google Chrome (3.0.195.27) for Windows	OK	OK	OK	-	-	-
Google Chrome (4.0.249.43) for Linux	OK	OK	OK	-	-	-
Konqueror 3.5	OK	broken	OK	-	-	-
Epiphany	OK	OK	OK	-	-	-
Lynx 2.8.5	OK	-	-	-	-	-
links 2.1	OK	OK	OK	-	-	-
Dillo 0.8.5	OK	-	-	-	-	-
w3m 0.5.1	OK	OK	OK	OK	OK	OK

gzip and deflate are using the same compression algorithm (called deflate), the differences are in header. gzip was designed for files and has longer header with heavier (slower to compute, more reliable) CRC checksum; I/O from/to disk (disk tapes, CD, etc.) is already slow, so overhead in gzip format is relatively (to disk access) small and good checksum might be important with file archiving, storing modification time also may make sense with files.
deflate (deflate-format, not deflate-algorithm) was designed for purposes such as compression of the web, where minimal overhead is more important and timestamp or file-header are superfluous.
Yet, gzip is used almost exclusively to compress the web, because there is a misreading of HTTP specification related to deflate; using gzip is not ambiguous. In other words, when HTTL specification says "deflate" it means something I here refer to as "deflate-http" (this is in fact: "ZLIB format"), while some implementations incorrectly produce (or expect) something I here refer to as "deflate-raw".

Many browsers now play safe and correctly handle deflate-http and deflate-raw when http-header says 'Content-Encoding: deflate'; browsers can distinguish deflate-http and deflate-raw by first two bytes. Some browsers, however, still expect only deflate-raw. Which is bad, according to HTTL spec, but which is good from our point of view: deflate-raw is the most space efficient.

deflate-raw doesn't have a checksum, but using compression without checksum is not dangerous, TCP already uses checksums. The checksum in the compression stream acts more like a debug thing here, in case a decompressor on client-side finds the checksums don't match: it cannot repair the contents, can only discard it.
I do use deflate-raw on few moderately visited web-pages since few years, and I have never had any problem with it (Opera, IE6/7, Firefox1.5/2, lynx, links2, Konqueror – all these were tested with it), but still: no browser (nor web-bot like search-engine) has to support this unofficial format; a next generation browser can break on it, a search-engine upgraded to support deflate compression (strictly with the official specification) could not read your site, and thus drop your site from the index.
Reality for deflate-raw doesn't look black in the meantime, the only web-agent I spotted (this is 2007) that breaks on raw-deflate is `linkchecker` utility, and I haven't found a browser, or search-engine, that would advertise support for deflate, and yet couldn't read deflate-raw.

Note: zlib's implementation (used in Apache, PHP, and many other projects, including closed-source) of gzip/deflate compression is not optimal when it comes to compression-ratio, even with max. compression level: it's not exhaustive in searching for patterns. If you don't need fast compression (e.g. you use caching of compressed data) you could squeeze better compression-ratio from using other code or a re-compression tool like AdvanceCOMP (1-4% saving of file-size with AdvanceCOMP is possible (in contrast to zlib implementation), but it's still not exhaustive; the only exhaustive deflate-compressor I'm aware was done by Ken Silverman (Linux version here) and further 0.5% of saving with it (in contrast to AdvanceCOMP) is usually possible.).
If you'll decide to use AdvanceCOMP with deflate or deflate-raw, or use KZIP.EXE with gzip or deflate or deflate-raw: you'll have to play with headers: internal compression is the same, but file headers are different (or lack of header in case of deflate-raw). I've written a simple program in C that extracts deflate-raw data from KZIP.EXE compressed file.

Output compression in practice: PHP

Adding basic support for output compression (gzip) to PHP scripts can be as simple as adding this line at the beginning of your code (e.g. to a header.php): ob_start('ob_gzhandler');

When you'll be implementing more advanced solutions, esp. while serving pre-compressed data, try to also return "Content-Length" HTTP header to avoid chunked Transfer-Encoding. Chunked encoding is used when total size of response is uknown, data is then send in chunks and every chunk is stamped with its size – it adds a bit of overhead.

Making easier work for gzip/deflate

The compression makes use of statistics and dictionary. Dictionary makes use of the fact that some words and phrases are repeated. Statistics makes use of the fact that on a web-page some characters are used more frequently than others, and some are not used at all. The shorter the list of characters on a web-page: the more compression gzip/deflate can squeeze. This may be of more importance in CSS where you can freely choose to use only lowercase characters (e.g. class names) and not use rare characters at all as class names. Also, with the compression, rare characters take more bits than those frequently used, so using several times frequently used character (like 'a') could take less bits than using one time a rarely used character like 'x'.

Graphics file optimization

In HTML, on the beginning of the page (i.e., part visible without scrolling down) specify height and width of every image (e.g. <img src=i.png width=55 height=39 alt=foo>), so space can be rendered, without further re-rendering, before images are downloaded. Replace graphic files with CSS/HTML if possible. Combine many graphic files into one (to avoid overhead of many requests), e.g. you can use background-position with negative values, in CSS, to use only part of an image in a place.

PNG

PNG uses the same compression as http (deflate), and again: the best possible deflate compressor (I'm aware of) was made by Ken Silverman (here is Linux version) – PNGOUT uses the compressor to recompress a .png file.
Before using Ken's compressor, it may be worth, using a program like GIMP, to change colors from true-RGB-color-mode (24bit per pixel) to indexed-mode (8bpp to 1bpp) if the image doesn't have more than 256 colors (you lose nothing then, and gain size-saving). For some true-color images, converting to 8bpp without sacrificing much of quality is possible with pngquant. Use lowest possible index: if a PNG file has 4 colors then use index with 4 values (not default 256) to have the lowest possible file-size. In GIMP, if the file is already indexed (often with 256 colors), you'll have to convert it to RGB and then you can convert it back to indexed-mode with specific index-value.

GIF

Converting from GIF to PNG (and optimizing as PNG file – see above) can be a small win. Otherwise, reducing index of colors to a lower value can significantly decrease file-size without decrease in quality (as long as the image itself contains no more colors than the index).

JPEG

Avoid changing and re-compressing JPEG files: JPEG is a lossy compression, after many changes and re-compressions and re-saving image will be of bad quality and file-size not necessary lower. It's good to have a reference file stored with lossless compression (e.g. TIFF, PNG). Blur part of the JPEG file that don't have to be sharp (e.g. background), blurring allows better JPEG compression. jpegoptim utility can sometimes squeeze a bit more JPEG compression without affecting quality. Avoid sharp lines in JPEG file; I once had a JPEG file with 1 pixel black frame on the top and both sides of the image – removing the "frame" allowed to decrease file-size to about 50%, and such frame can be made cheaply using CSS (img.withframe{border:1px black solid}).

Local I/O optimizations

Disk elevator algorithm

Linux (2.6.21) offers 4 I/O schedulers to choose from: no-op (no scheduler), deadline, anticipatory, CFQ (Complete Fair Queuing). If you use SCSI with TCQ (Tagged Command Queuing) or SATA with NCQ (Native Command Queuing), and have TCQ/NCQ turned-on, then electronics of disk itself should (in theory) do the best possible job at re-ordering i/o commands to have the highest possible efficiency (for all processes, generally), as it knows its own disk structure, so choosing a simple scheduler (e.g. deadline or even no-op) makes sense.

Excerpt from Anticipatory IO scheduler documentation (/usr/src/linux-2.6.19/Documentation/block/as-iosched.txt):

Database servers, especially those using "TCQ" disks should investigate performance with the 'deadline' IO scheduler.

On the other hand, if you have on the box processes with low priority that do a lot of i/o operations: it make sense to lower their i/o priority; only CFQ implements concept of i/o priorities. O/S cannot control TCQ/NCQ queue (with exception for possibility to issue high-priority commands in case of TCQ). A bad TCQ/NCQ design can do bad things, like delay some i/o commands for too long (see this example), so (ideally) make a benchmark to make sure your TCQ/NCQ design does a good work.

Note: TCQ adds a small overhead, so for light load where TCQ don't re-order commands (i.e., there is no queue), performance can be a bit worse than without TCQ.

In theory, with TCQ/NCQ, no-op should be the best, as you don't want to queue/delay commands in O/S, but issue them all to the disk ASAP, and disk itself should do the best job. CFQ should be good if on box some low-priority processes do a lot of i/o, to lower their i/o priority, but then you should turn-off TCQ/NCQ when using CFQ. If possible, off-load low-priority processes that do a lot of i/o to another machine.

Cache

TCQ vs NCQ
	when designed	interface	maximum queue	additional features	notes
TCQ	mid-1990	SCSI, some (rare) ATA	256, typically 64	1. high-priority command that should be done ASAP; 2. strict order	2 interrupts per command, thus higher latency
NCQ	~2002-'03	SATA	32		max. only 1 interrupt per command

Linux i/o schedulers
i/o scheduler	description
no-op	very simple, minimal overhead; a fifo queue that merge some requests (i.e., would merge requests A and B, if A and B relate to continuous area, that is: A is just after B, or B is just after A)
deadline	a simple scheduler; re-orders requests, to send many requests one-by-one, that apply to the same/near area of disk, but adds a deadline per request (read: 500ms, write: 5sec); favors reads over writes
anticipatory	the default i/o scheduler between in 2.6.0 - 2.6.17; assumes 1 physical head, bad for RAID, bad for TCQ/NCQ; it's a non-work-conserving scheduler, can make disk sit idle even though there is a request waiting, so it adds latency in some cases
CFQ	Complete Fair Queuing; the default i/o scheduler since 2.6.18; complicated; big advantage is that is uses concept of priorities per-process and per-user (see "ionice" command): e.g. you can make web-server have highest possible i/o priority

How to turn-on write-cache on a disk
system	disk interface	example command
Linux	IDE	hdparm -W1 /dev/hda
Linux	SCSI	sdparm --set WCE=1 /dev/sda

O/Ss are conservative when it comes to initial I/O settings and play safe, so even if your power will fail: almost all data saved just before a power-failure can be recovered. If you have a back-up power and/or don't mind loosing data from last minutes: you can tweak things here: data you write to disk don't have to be really written physically to disk immediately (taking time when client waits for a web-page), but can take place where there is an opportunity to do I/O operation cheaply/fast (e.g. we have some idle time or disk's head is already near the track anyway). Psychical I/O from/to disk is slow and have high latency, I/O from/to cache-memory is very fast.

Directories

Remove unnecessary files and directories from the web-directory, avoid long directories names, avoid placing the files in a deep directory structure.
Example: if a file is /var/www/foo.html then file-system starts with root and scans linearly (and compares) directories-names traversing linked-list until it finds "/var/", this inode points to contents of /var/ directory; now it have to traverse contents of /var/ directory to find inode pointing to "/var/www/"; at least two hard-disk seeks already (1:/var/, 2:/var/www/) uless the data is in a cache; when we're in /var/www/, we traverse file-names that are in /var/www/ until we finds foo.html file.
It's clear that root / directory should have as few directories as possible, /var/ directory should have as few directories as possible, and /var/www/ should have as few files as possible.
Also, directories and files are compared (most likely) using something like C function strcmp(), and strcmp() returns a value faster when closer to beginning of string there is a difference [benchmark needed]. In other words: try to make begining of file-names different, e.g. instead of abc1.html, abc2.html, abc3.html, it could be faster to access files where names would be a1bc.html, a2bc.html, a3bc.html (or even better, 1abc.html, 2abc.html, 3abc.html). The trick can be especially more useful with graphics files, where there are many graphics files in one directory and less human-friendly URL doesn't matter.

Some file-systems (e.g. ReiserFS) use (or, like ext3, use optionally) hashes, so above recommendations are less important and you can store huge number of files in one directory without worrying about access time.
Be warned though, that ext3 with hashes (i.e., ext3 mounted with dir_index option) is slower (than traditional directory traversing) when directory have few files [benchmarked on www.namesys.com/benchmarks.html, but the link is dead now (2008-05-28)].

Avoid symbolic links, use hard links: hard links point directly to an inode.
Avoid using ~ (tilde) in paths, ~ has to be resolved (e.g. using /etc/passwd file) to user's home directory.

Priorities

By default web-server usually runs with a default priority, so if another process with the same priority competes for CPU-time: it'll slow down web-server (unless another CPU, on SMP box, is idle). Many processes don't need high priority (e.g. mail-server): if they will be granted CPU time 200ms later, nobody will notice, so it makes sense to lower theirs priority and/or increase priority of web-server. Utilities like `nice` and `renice` can be used to (indirectly) alter priorities on UNIX/Linux. Don't be afraid of using maximum/minimum possible nice levels (20, -19), you can go even further than that, with caution, by using `chrt` command on Linux, which can be used to set real-time policy for a process (e.g. web-server). "nice level" is, in fact, not a priority, it's just a suggestion for O/S to lower/rise priority. Why with caution? Because process in real-time policy (SCHED_FIFO or SCHED_RR) can (e.g. by doing infinite loop) make all processes in SCHED_OTHER policy starve for cpu-time.

Process CPU-affinity on SMP system [Linux]

Linux and scheduling policies: a process can be in one of the following policy
policy	description
SCHED_OTHER	default Linux time-sharing scheduling; uses dynamic priority (from 100 to 140) that is influenced by "nice level" (see `nice` command) and process cpu-hungriness; process with minimum possible priority or nice level in this policy still will be granted cpu-time (from time-to-time) even if another process in SCHED_OTHER with maximum priority or nice level needs CPU-time 100% of the time.
SCHED_FIFO	uses static priority (from 0 to 99), concept of "nive level" doesn't apply here; if a process in SCHED_FIFO wants cpu-time, it'll immediately preempt all processes in SCHED_OTHER policy; process in SCHED_FIFO can be preempted only by another process in SCHED_FIFO/SCHED_RR that has higher priority
SCHED_RR	RR stands for round-robin; SCHED_RR is just like SCHED_FIFO, but each process is allowed only quantum of time before another process in SCHED_RR with the same static priority is granted cpu-time

In SMP system Linux makes good job at assigning processes to specific CPU and migrating them to less-busy CPU if there is a much less busy CPU. However, switching a process to another CPU is expensive, such switch loses data in L1/L2 caches, so in some (rare) cases it makes sense to assign some processes to specific CPUs manually. E.g. if you have two CPUs, have two important processes that often run at the same time (like MySQL and Apache) it could sometimes make sense to assign them to a different CPUs. Linux doesn't know that these two processes will run over-and-over always at the same time. You know, so you can use `taskset` command to bound apache to one cpu and mysql to other. (MySQL+Apache is a bad example here, but you get the idea).

IRQ-handler CPU-affinity on SMP system [Linux]

By the same token, you can assign IRQ-handler to specific CPU, so interrupt-handler-code will have a higher chance of being (at least partially) in L1/L2 cache of specific CPU. More on SMP IRQ affinity.

DMA

How to turn DMA for IDE disk
system	command
Linux	hdparm -d1 /dev/hda
*BSD	DMS is always turned on if device supports DMA, isn't it?
Windows	settings are in "Device Manager" in your IDE controller settings, if your IDE controller supports DMA

In case of IDE drive: make sure Direct-Memory-Access is turned on for your HDD, I/O without DMA takes much more CPU-time.

ECC

RAM with turned on Error-Correcting-Code can be slower. If you don't run a server with mission critical data, and speed is more important to you, on the tweaked server, than reliability, then you may consider running a benchmark to see if there is a better RAM performance with ECC turned off in BIOS and maybe eventually leaving ECC off. On my AP550 (2xPIII, i820) with Rambus memory, turning off "ECC Support" in BIOS resulted in memory faster by 5-25% (depending on sub-test) according to lmbench with writes/copies, but reads were not affected.
Others have found similar performance gain with ECC disabled:

CAS, RAS and others

Usually you shouldn't have to touch these memory settings in BIOS: modern motherboards detects SPD to set RAM parameters. But older motherboards (and some dumb not-so-old) set by default conservative settings and don't use SPD (and some offer extra parameters that you can tweak, that are not covered in SPD). If you know what your RAM parameters are (CAS is the most important) and have such a dumb motherboard, you can set settings by yourself. Google for "RAS, CAS" for more.

Last-access timestamp

O/Ss track last-access-time to every file. Most people never use this feature, yet they pay a small price of tracking last-access timestamp.

Specific filesystem tricks

ext3

How to turn off tracking last-access timestamp
system	example command
Linux/*BSD	mount partitions with noatime option, e.g. modify /etc/fstab, fourth option could look like this: defaults,noatime
Windows	fsutil behavior set disablelastaccess 1

data=writeback,commit=180 in fstab options may do the trick (faster I/O, cache(buffer?) metadata in memory up to 180 seconds), but please research it further before jumping-in with this change; data=orded is safer, yet in same cases [read-and-write at the same time?] is faster than writeback (in most: it's slower). If you decide using writeback, you may consider using also 'nobh' mount option – it'll make code not use a buffer that it's not needed and may result in less overhead, but also more I/O operations. Benchmark needed. Also, changing only options in /etc/fstab may be not enough with root partition, please read this before making the change to "writeback".

ReiserFS

NTFS

interesting ReiserFS mount options
option	comment
notail	can speed-up I/O to [and "from" too?] small files at the cost of being less space-efficient about storing small files
nolog	?
data=ordered / journal / writeback	ReiserFS already uses writeback (the fastest) as default

NTFS (in Windows NT/2000/XP/2003 at least) by default creates additional "8.3" names that add some overhead and are not needed at all until you use an old 16-bit application. Read how to turn it off on Microsoft's site.

Compiling

Don't use stock pre-compiled kernel (if you have the sources): it's compiled to work on many different computers, and you can benefit from having kernel optimized just for your hardware.
Recompiling your web-software, database and everything that is used during clients web-requests (if you have the sources) with latest version of your favorite compiler, with turned on maximum optimizations for your specific processor can also speed-up some things a bit. prelink can speed-up starting/forking binaries.

Load-balancing with DNS round-robin

You can have more than one A record for every FQDN (fully qualified domain name). For example, the web page you're reading was on two different IPs in 2007:
$ dig david.fullrecall.com
; [...]
;; ANSWER SECTION:
david.fullrecall.com. 86400 IN A 84.40.140.214
david.fullrecall.com. 86400 IN A 83.149.104.72

Such mirroring gives ability to spread traffic on more than one server, more than one link, and increases reliability: if one the links/servers is down, a web-browser cannot connect to it and tries another IP [it's is not the ideal failover mechanism though: if the broken server doesn't reply with RST packet, browser can wait for SYN+ACK for some time before a timeout; I'm also not certain if all browsers failover to another A record].
Browser connect initially to a random IP from the possible pool of addresses (at least this is end effect if you use BIND as nameserver: BIND hands the addresses in cyclic order, but can be changed to fully random). With such mirroring, and dynamic web-services, you obviously have to ensure efficient synchronization of the data between your servers.

If you have servers on geographically different locations, you can use geoip database in nameserver to return a range of IPs (or just one IP) that are close (geographically) to the visitor (google for GeoDNS, BIND, geoip if you are interested), this can significantly decrease latency, for visitors, for successive requests from the web-server.

HTTPS/SSL

Avoid https if you can: content transferred via https is never cached and there is more overhead (ciphering). A dated, but still interesting, paper about https' overhead: A Performance Analysis of Secure HTTP Protocol. Today's CPU should cope better with https, as they're faster; but on the other hand, we now use stronger cryptographic algorithms and they eat more CPU cycles.

Write concisely~~, omit not needed words~~

~~Try to~~ write concisely ~~as much as possible~~, you'll save the bandwidth and ~~other~~ people's time spend on reading ~~your webpage~~. Be to the point. But don't be too concise: don't force ~~your~~ readers to spend ~~their precious~~ time on guessing what you ~~really~~ meant.

Footnote

Please write me if you have additional tips or you spot incorrect information here.

website optimization tips – don't make me wait!

DNS