123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960 |
- <?php
- /**
- * <https://y.st./>
- * Copyright © 2015 Alex Yst <mailto:copyright@y.st>
- *
- * This program is free software: you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation, either version 3 of the License, or
- * (at your option) any later version.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program. If not, see <https://www.gnu.org./licenses/>.
- **/
- $xhtml = array(
- 'title' => 'The spider's first run',
- 'body' => <<<END
- <p>
- For my new search engine to be at all effective, it needs to know how to handle relative $a[URI]s.
- It obviously cannot request pages with them directly, so I built a function that takes a base $a[URI] and a relative $a[URI] and merges them to form a new absolute $a[URI].
- I found that $a[PHP]'s <a href="https://php.net/manual/en/function.parse-url.php"><code>parse_url()</code> function</a> was very helpful for breaking $a[URI]s into their components so that they could be merged.
- However, all accounting for <code>.</code> and <code>..</code> directories, as well as all processing of the $a[URI] components to form the new absolute $a[URI], had to be coded by hand.
- It took several hours to get it right, but I think that my new function does what I need it to now.
- </p>
- <p>
- My next task was to find a way to find hyperlinks in a downloaded page so that the $a[URI]s can be collected, normalized, and added to the database.
- I thought that this task would be more difficult than the task of normalizing relative $a[URI]s, but I was pleasantly surprised.
- $a[PHP]'s <a href="https://php.net/manual/en/function.xml-parse-into-struct.php"><code>xml_parse_into_struct()</code> function</a> performs most of the leg work.
- This function's output will also make it easy to take into account the page's preferred base base $a[URI], a feature that I wanted to add way later, but will now be able to build very early on.
- I was worried that this would only work on $a[XHTML] pages, as $a[HTML] is not $a[XML]-compliant, but it also seems to work on the few $a[HTML] pages that I tested it on.
- Like the curl_*() functions, the xml_*() functions require passing around a resource handle, so I wrapped them up in a class as well.
- </p>
- <p>
- I quickly found a strange issue with the new class though.
- Each object instantiated from it can only be used to parse a single $a[XML] document.
- I do not think that this is an error in the class, but rather, an issue with the underlying $a[PHP] functions, and in fact, I witness the same behavior when I remove my wrapper class.
- </p>
- <p>
- I performed my first trial run of the search engine's spider today, and at first, it seemed to be doing very well.
- However, it found a large file that someone had linked to and it got stuck there.
- I think that it ate all the memory of my machine, too, as my machine ended up locking up entirely after a couple hours of being stuck on this one file.
- I considered screening the Content-Type headers of files before downloading them, but the particular file that clogged up the spider claims to be of type <code>text/plain; charset=UTF-8</code>.
- It was not a text file though, it was an XZ compressed file.
- Headers cannot always be trusted, as servers can be misconfigured.
- However, I think that the real threat is not misconfigured servers, but maliciously-configured servers.
- I should not rely on headers for anything as important as keeping the spider unclogged.
- There does not seem to be a direct way to limit file download size, but someone on $a[IRC] gave me <a href="https://stackoverflow.com/questions/17641073/how-to-set-a-maximum-size-limit-to-php-curl-downloads">a hint</a> as to how to set a download limit in a less direct way.
- It seems a little confusing to me, potentially because it has been a long day, so I will try working with this again tomorrow.
- </p>
- <p>
- My <a href="/a/canary.txt">canary</a> still sings the tune of freedom and transparency.
- </p>
- END
- );
|