28.xhtml 4.2 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
  1. <?php
  2. /**
  3. * <https://y.st./>
  4. * Copyright © 2015 Alex Yst <mailto:copyright@y.st>
  5. *
  6. * This program is free software: you can redistribute it and/or modify
  7. * it under the terms of the GNU General Public License as published by
  8. * the Free Software Foundation, either version 3 of the License, or
  9. * (at your option) any later version.
  10. *
  11. * This program is distributed in the hope that it will be useful,
  12. * but WITHOUT ANY WARRANTY; without even the implied warranty of
  13. * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
  14. * GNU General Public License for more details.
  15. *
  16. * You should have received a copy of the GNU General Public License
  17. * along with this program. If not, see <https://www.gnu.org./licenses/>.
  18. **/
  19. $xhtml = array(
  20. 'title' => 'The spider&apos;s first run',
  21. 'body' => <<<END
  22. <p>
  23. For my new search engine to be at all effective, it needs to know how to handle relative $a[URI]s.
  24. It obviously cannot request pages with them directly, so I built a function that takes a base $a[URI] and a relative $a[URI] and merges them to form a new absolute $a[URI].
  25. I found that $a[PHP]&apos;s <a href="https://php.net/manual/en/function.parse-url.php"><code>parse_url()</code> function</a> was very helpful for breaking $a[URI]s into their components so that they could be merged.
  26. However, all accounting for <code>.</code> and <code>..</code> directories, as well as all processing of the $a[URI] components to form the new absolute $a[URI], had to be coded by hand.
  27. It took several hours to get it right, but I think that my new function does what I need it to now.
  28. </p>
  29. <p>
  30. My next task was to find a way to find hyperlinks in a downloaded page so that the $a[URI]s can be collected, normalized, and added to the database.
  31. I thought that this task would be more difficult than the task of normalizing relative $a[URI]s, but I was pleasantly surprised.
  32. $a[PHP]&apos;s <a href="https://php.net/manual/en/function.xml-parse-into-struct.php"><code>xml_parse_into_struct()</code> function</a> performs most of the leg work.
  33. This function&apos;s output will also make it easy to take into account the page&apos;s preferred base base $a[URI], a feature that I wanted to add way later, but will now be able to build very early on.
  34. I was worried that this would only work on $a[XHTML] pages, as $a[HTML] is not $a[XML]-compliant, but it also seems to work on the few $a[HTML] pages that I tested it on.
  35. Like the curl_*() functions, the xml_*() functions require passing around a resource handle, so I wrapped them up in a class as well.
  36. </p>
  37. <p>
  38. I quickly found a strange issue with the new class though.
  39. Each object instantiated from it can only be used to parse a single $a[XML] document.
  40. I do not think that this is an error in the class, but rather, an issue with the underlying $a[PHP] functions, and in fact, I witness the same behavior when I remove my wrapper class.
  41. </p>
  42. <p>
  43. I performed my first trial run of the search engine&apos;s spider today, and at first, it seemed to be doing very well.
  44. However, it found a large file that someone had linked to and it got stuck there.
  45. I think that it ate all the memory of my machine, too, as my machine ended up locking up entirely after a couple hours of being stuck on this one file.
  46. I considered screening the Content-Type headers of files before downloading them, but the particular file that clogged up the spider claims to be of type <code>text/plain; charset=UTF-8</code>.
  47. It was not a text file though, it was an XZ compressed file.
  48. Headers cannot always be trusted, as servers can be misconfigured.
  49. However, I think that the real threat is not misconfigured servers, but maliciously-configured servers.
  50. I should not rely on headers for anything as important as keeping the spider unclogged.
  51. There does not seem to be a direct way to limit file download size, but someone on $a[IRC] gave me <a href="https://stackoverflow.com/questions/17641073/how-to-set-a-maximum-size-limit-to-php-curl-downloads">a hint</a> as to how to set a download limit in a less direct way.
  52. It seems a little confusing to me, potentially because it has been a long day, so I will try working with this again tomorrow.
  53. </p>
  54. <p>
  55. My <a href="/a/canary.txt">canary</a> still sings the tune of freedom and transparency.
  56. </p>
  57. END
  58. );