08.xhtml 6.9 KB

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889
  1. <?php
  2. /**
  3. * <https://y.st./>
  4. * Copyright © 2016 Alex Yst <mailto:copyright@y.st>
  5. *
  6. * This program is free software: you can redistribute it and/or modify
  7. * it under the terms of the GNU General Public License as published by
  8. * the Free Software Foundation, either version 3 of the License, or
  9. * (at your option) any later version.
  10. *
  11. * This program is distributed in the hope that it will be useful,
  12. * but WITHOUT ANY WARRANTY; without even the implied warranty of
  13. * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
  14. * GNU General Public License for more details.
  15. *
  16. * You should have received a copy of the GNU General Public License
  17. * along with this program. If not, see <https://www.gnu.org./licenses/>.
  18. **/
  19. $xhtml = array(
  20. 'title' => "An error in the handling of relative $a[URI]s",
  21. 'body' => <<<END
  22. <p>
  23. I awoke this morning to find that the spider had choked on a bad $a[URI] that it had pulled from its own database.
  24. It assumes that all $a[URI]s in its database are valid, but somehow, it had managed to put an invalid $a[URI] there.
  25. Before attempting to diagnose the problem, I made sure to check my email to see if the school had written me back though.
  26. They had not.
  27. With that out of the way, I ran a query against the database to find the offensive page so that I could run tests on my <code>merge_uris()</code> function, which had to have returned invalid results for this to happen.
  28. The link was to <code>https://5jp7xtmox6jyoqd5.onion</code>, so I just searched for the page that linked to it.
  29. Only one result came up: <a href="http://52wdeibt3ivmcapq.onion/darknet.html"><code>http://52wdeibt3ivmcapq.onion/darknet.html</code></a>.
  30. This page contains many technically-invalid $a[URI]s that my spider should have successfully sanitized.
  31. Or rather, my <code>merge_uris()</code> function should have successfully sanitized them.
  32. Running another query against the database, I found that this page alone had been allowed to add eight $a[URI]s with no path component to my database; $a[URI]s that are technically invalid and which, by feature, not bug, would choke <code>merge_uris()</code> when used as the &quot;absolute $a[URI]&quot; parameter of that function.
  33. But before reaching the database, they should have been given as the &quot;relative $a[URI]&quot; parameter, causing the function to return the $a[URI] with a slash at the end, making the $a[URI]s valid.
  34. To make sure that it was in fact a bug in the function and not the spider, I tested the function on every invalid $a[URI] that the page had added to my database.
  35. The function returned the expected incorrect result every time.
  36. </p>
  37. <p>
  38. I added a call to <code>\\var_dump()</code> right before the return statement, but it was not getting executed.
  39. Searching for another spot where the function returns, I immediately found the problem.
  40. If the relative $a[URI] has a different scheme that the absolute $a[URI] that it is merged with, the relative $a[URI] is assumed to be absolute.
  41. In theory, this should be accurate.
  42. If the scheme is different, relative $a[URI]s should not be used at all, so the hyperlink should point to an absolute $a[URI].
  43. In practice though, some webmasters do not respect this, or even do not know it.
  44. It does not help that many Web browsers stupidly hide the path of a $a[URI] when the path is only a slash.
  45. All this does is breed ignorance that the $a[URI] even has the trailing slash, resulting in more people writing bad hyperlinks, such as those on the page that messed up my database.
  46. That said, my function should account for bad input as well, at least if the bad input is said to be a relative $a[URI].
  47. This function&apos;s whole purpose is to take incomplete $a[URI]s and make them whole in an automated way.
  48. I think I managed to fix the function, then I used a short script to repair the damage to the database so I would not have to throw out the whole database.
  49. </p>
  50. <p>
  51. In the process of scanning the database for errors, I found several erroneous $a[IRC] $a[URI]s, which got me thinking.
  52. At some point, I should scan the database and see how many onion-based $a[IRC] networks I can find.
  53. </p>
  54. <p>
  55. Upon the next run of the spider, I found that it was requesting $a[FTP] and Gopher pages.
  56. I thought that I had implemented a protocol whilelist and only allowed $a[HTTPS]- and $a[HTTP]-based $a[URI]s, but it seems that I neglected to make the spider actually check the whitelist.
  57. That means that the spider will be requesting pages that it does not know how to handle yet.
  58. However, because of the new logic flow that allows use of the MySQL database, if I fix the protocol whitelist feature, the spider will look endlessly.
  59. </p>
  60. <p>
  61. Having needed to update my base library, I put aside work on the spider, despite its current issues, to work on a wrapper class, as I had agreed to include at least new wrapper class in every update.
  62. I have also decided to restructure the library version numbers a bit.
  63. Currently, the version numbers are increasing quickly, making it look like I am making more progress on it than I am.
  64. I will also be holding the version numbers back a bit until progress catches up with the version number.
  65. After building the <a href="https://secure.php.net/manual/en/ref.fdf.php">FDF</a> wrapper class, I started work on a <a href="https://secure.php.net/manual/en/ref.ftp.php">$a[FTP]</a> wrapper class.
  66. I found though that some $a[FTP] functions rely on file resources, so I would need to complete a <a href="https://secure.php.net/manual/en/function.fopen.php">file</a> wrapper class first.
  67. However, for this wrapper class, I would need a wrapper class for stream resources.
  68. Looking into stream resources, I found that there is a documented prototype for implementing <a href="https://secure.php.net/manual/en/class.streamwrapper.php">stream objects</a>.
  69. My understanding of this class prototype though is that it does not meet my goals.
  70. Instead of wrapping up a stream resource with its related functions, it instead replaces stream resources altogether.
  71. I might try creating a class that both implements this prototype and wraps up the functions I want wrapped up.
  72. This class prototype though, and probably stream resources in general, requires stream context support, so I moved on and built the a class wrapping stream context resources.
  73. </p>
  74. <p>
  75. For my own reference, current work that needs to be done on the spider includes:
  76. </p>
  77. <ul>
  78. <li>find a way to avoid trying to crawl uncrawlable $a[URI]s without creating an endless loop</li>
  79. <li>replace mis-implemented protocol whitelist with a <code>switch()</code> statement to allow handling different protocols in different ways</li>
  80. <li>restructure program flow to cause <code>&lt;a/&gt;</code>s to be scanned for and saved before the <code>&lt;title/&gt;</code> is recorded to make spider interruptions to no longer be detrimental</li>
  81. <li>Fix handling of <code>&lt;a/&gt;</code>s that have child nodes</li>
  82. </ul>
  83. <p>
  84. My <a href="/a/canary.txt">canary</a> still sings the tune of freedom and transparency.
  85. </p>
  86. END
  87. );