You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

410 lines
14 KiB

8 years ago
7 years ago
7 years ago
7 years ago
7 years ago
7 years ago
8 years ago
7 years ago
  1. XRay
  2. ====
  3. XRay parses structured content from a URL.
  4. ## Discovering Content
  5. The contents of the URL is checked in the following order:
  6. * A silo URL from one of the following websites:
  7. * Instagram
  8. * Twitter
  9. * GitHub
  10. * XKCD
  11. * Facebook (public events)
  12. * (more coming soon)
  13. * Microformats
  14. * h-card
  15. * h-entry
  16. * h-event
  17. * h-review
  18. * h-recipe
  19. * h-product
  20. * h-item
  21. ## Library
  22. XRay can be used as a library in your PHP project. The easiest way to install it and its dependencies is via composer.
  23. ```
  24. composer require p3k/xray
  25. ```
  26. ### Parsing
  27. ```php
  28. $xray = new p3k\XRay();
  29. $parsed = $xray->parse('https://aaronparecki.com/2017/04/28/9/');
  30. ```
  31. If you already have an HTML or JSON document you want to parse, you can pass it as a string in the second parameter.
  32. ```php
  33. $xray = new p3k\XRay();
  34. $html = '<html>....</html>';
  35. $parsed = $xray->parse('https://aaronparecki.com/2017/04/28/9/', $html);
  36. ```
  37. In both cases, you can add an additional parameter to configure various options of how XRay will behave. Below is a list of the options.
  38. * `timeout` - The timeout in seconds to wait for any HTTP requests
  39. * `max_redirects` - The maximum number of redirects to follow
  40. * `include_original` - Will also return the full document fetched
  41. * `target` - Specify a target URL, and XRay will first check if that URL is on the page, and only if it is, will continue to parse the page. This is useful when you're using XRay to verify an incoming webmention.
  42. Additionally, the following parameters are supported when making requests that use the Twitter or GitHub API. See the authentication section below for details.
  43. ```php
  44. $xray = new p3k\XRay();
  45. $parsed = $xray->parse('https://aaronparecki.com/2017/04/28/9/', [
  46. 'timeout' => 30
  47. ]);
  48. $parsed = $xray->parse('https://aaronparecki.com/2017/04/28/9/', $html, [
  49. 'target' => 'http://example.com/'
  50. ]);
  51. ```
  52. The `$parsed` return value will look like the below. See "Primary Data" below for an explanation of the vocabularies returned.
  53. ```
  54. $parsed = Array
  55. (
  56. [data] => Array
  57. (
  58. [type] => card
  59. [name] => Aaron Parecki
  60. [url] => https://aaronparecki.com/
  61. [photo] => https://aaronparecki.com/images/profile.jpg
  62. )
  63. [url] => https://aaronparecki.com/
  64. [code] => 200
  65. )
  66. ```
  67. ### Rels
  68. You can also use XRay to fetch all the rel values on a page, merging the list of HTTP `Link` headers with rel values with the HTML rel values on the page.
  69. ```php
  70. $xray = new p3k\XRay();
  71. $xray->http = $this->http;
  72. $rels = $xray->rels('https://aaronparecki.com/');
  73. ```
  74. This will return a similar response to the parser, but instead of a `data` key containing the parsed page, there will be `rels`, an associative array. Each key will contain an array of all the values that match that rel value.
  75. ```
  76. $rels = Array
  77. (
  78. [url] => https://aaronparecki.com/
  79. [code] => 200
  80. [rels] => Array
  81. (
  82. [hub] => Array
  83. (
  84. [0] => https://switchboard.p3k.io/
  85. )
  86. [authorization_endpoint] => Array
  87. (
  88. [0] => https://aaronparecki.com/auth
  89. )
  90. ...
  91. ```
  92. ### Customizing the User Agent
  93. To set a unique user agent, (some websites will require a user agent be set), you can set the `http` property of the object to a `p3k\HTTP` object.
  94. ```php
  95. $xray = new p3k\XRay();
  96. $xray->http = new p3k\HTTP('MyProject/1.0.0 (http://example.com/)');
  97. $xray->parse('http://example.com/');
  98. ```
  99. ## API
  100. XRay can also be used as an API to provide its parsing capabilities over an HTTP service.
  101. To parse a page and return structured data for the contents of the page, simply pass a url to the parse route.
  102. ```
  103. GET /parse?url=https://aaronparecki.com/2016/01/16/11/
  104. ```
  105. To conditionally parse the page after first checking if it contains a link to a target URL, also include the target URL as a parameter. This is useful when using XRay to verify an incoming webmention.
  106. ```
  107. GET /parse?url=https://aaronparecki.com/2016/01/16/11/&target=http://example.com
  108. ```
  109. In both cases, the response will be a JSON object containing a key of "type". If there was an error, "type" will be set to the string "error", otherwise it will refer to the kind of content that was found at the URL, most often "entry".
  110. You can also make a POST request with the same parameter names.
  111. If you already have an HTML or JSON document you want to parse, you can include that in the parameter `body`. This POST request would look like the below:
  112. ```
  113. POST /parse
  114. Content-type: application/x-www-form-urlencoded
  115. url=https://aaronparecki.com/2016/01/16/11/
  116. &body=<html>....</html>
  117. ```
  118. or for Twitter/GitHub/Facebook where you might have JSON,
  119. ```
  120. POST /parse
  121. Content-type: application/x-www-form-urlencoded
  122. url=https://github.com/aaronpk/XRay
  123. &body={"repo":......}
  124. ```
  125. ### Authentication
  126. If the URL you are fetching requires authentication, include the access token in the parameter "token", and it will be included in an "Authorization" header when fetching the URL. (It is recommended to use a POST request in this case, to avoid the access token potentially being logged as part of the query string.) This is useful for [Private Webmention](https://indieweb.org/Private-Webmention) verification.
  127. ```
  128. POST /parse
  129. url=https://aaronparecki.com/2016/01/16/11/
  130. &target=http://example.com
  131. &token=12341234123412341234
  132. ```
  133. ### Twitter Authentication
  134. XRay uses the Twitter, Github and Facebook APIs to fetch posts, and those API require authentication. In order to keep XRay stateless, it is required that you pass in the credentials to the parse call.
  135. You should only send the credentials when the URL you are trying to parse is a Twitter URL, a GitHub URL or a Facebook URL, so you'll want to check for whether the hostname is `twitter.com`, `github.com`, etc. before you include credentials in this call.
  136. #### Twitter Authentication
  137. XRay uses the Twitter API to fetch Twitter URLs. You can register an application on the Twitter developer website, and generate an access token for your account without writing any code, and then use those credentials when making an API request to XRay.
  138. * twitter_api_key - Your application's API key
  139. * twitter_api_secret - Your application's API secret
  140. * twitter_access_token - Your Twitter access token
  141. * twitter_access_token_secret - Your Twitter secret access token
  142. #### GitHub Authentication
  143. XRay uses the GitHub API to fetch GitHub URLs, which provides higher rate limits when used with authentication. You can pass a GitHub access token along with the request and XRay will use it when making requests to the API.
  144. * github_access_token - A GitHub access token
  145. #### Facebook Authentication
  146. XRay uses the Facebook API to fetch Facebook URLs. You can create a Facebook App on Facebooks developer website.
  147. * facebook_app_id - Your application's App ID
  148. * facebook_app_secret - Your application's App Secret
  149. At this moment, XRay is able to get it's own access token from those credentials.
  150. ### Error Response
  151. ```json
  152. {
  153. "error": "not_found",
  154. "error_description": "The URL provided was not found"
  155. }
  156. ```
  157. Possible errors are listed below:
  158. * `not_found`: The URL provided was not found. (Returned 404 when fetching)
  159. * `ssl_cert_error`: There was an error validating the SSL certificate. This may happen if the SSL certificate has expired.
  160. * `ssl_unsupported_cipher`: The web server does not support any of the SSL ciphers known by the service.
  161. * `timeout`: The service timed out trying to connect to the URL.
  162. * `invalid_content`: The content at the URL was not valid. For example, providing a URL to an image will return this error.
  163. * `no_link_found`: The target link was not found on the page. When a target parameter is provided, this is the error that will be returned if the target could not be found on the page.
  164. * `no_content`: No usable content could be found at the given URL.
  165. * `unauthorized`: The URL returned HTTP 401 Unauthorized.
  166. * `forbidden`: The URL returned HTTP 403 Forbidden.
  167. ### Response Format
  168. ```json
  169. {
  170. "data":{
  171. "type":"entry",
  172. "published":"2017-03-01T19:00:33-08:00",
  173. "url":"https://aaronparecki.com/2017/03/01/14/hwc",
  174. "category":[
  175. "indieweb",
  176. "hwc"
  177. ],
  178. "photo":[
  179. "https://aaronparecki.com/2017/03/01/14/photo.jpg"
  180. ],
  181. "syndication":[
  182. "https://twitter.com/aaronpk/status/837135519427395584"
  183. ],
  184. "content":{
  185. "text":"Hello from Homebrew Website Club PDX! Thanks to @DreamHost for hosting us! 🍕🎉 #indieweb",
  186. "html":"Hello from Homebrew Website Club PDX! Thanks to <a href=\"https://twitter.com/DreamHost\">@DreamHost</a> for hosting us! <a href=\"https://aaronparecki.com/emoji/%F0%9F%8D%95\">🍕</a><a href=\"https://aaronparecki.com/emoji/%F0%9F%8E%89\">🎉</a> <a href=\"https://aaronparecki.com/tag/indieweb\">#indieweb</a>"
  187. },
  188. "author":{
  189. "type":"card",
  190. "name":"Aaron Parecki",
  191. "url":"https://aaronparecki.com/",
  192. "photo":"https://aaronparecki.com/images/profile.jpg"
  193. }
  194. },
  195. "url":"https://aaronparecki.com/2017/03/01/14/hwc",
  196. "code":200
  197. }
  198. ```
  199. #### Primary Data
  200. The primary object on the page is returned in the `data` property. This will indicate the type of object (e.g. `entry`), and will contain the vocabulary's properties that it was able to parse from the page.
  201. If a property supports multiple values, it will always be returned as an array. The following properties support multiple values:
  202. * in-reply-to
  203. * like-of
  204. * repost-of
  205. * bookmark-of
  206. * syndication
  207. * photo (of entry, not of a card)
  208. * video
  209. * audio
  210. * category
  211. The content will be an object that always contains a "text" property and may contain an "html" property if the source documented published HTML content. The "text" property must always be HTML escaped before displaying it as HTML, as it may include unescaped characters such as `<` and `>`.
  212. The author will always be set in the entry if available. The service follows the [authorship discovery](http://indiewebcamp.com/authorship) algorithm to try to find the author information elsewhere on the page if it is not inside the entry in the source document.
  213. All URLs provided in the output are absolute URLs. If the source document contains a relative URL, it will be resolved first.
  214. In a future version, replies, likes, reposts, etc. of this post will be included if they are listed on the page.
  215. ```json
  216. {
  217. "data": {
  218. "type": "entry",
  219. ...
  220. "like": [
  221. {
  222. "type": "cite",
  223. "author": {
  224. "type": "card",
  225. "name": "Thomas Dunlap",
  226. "photo": "https://s3-us-west-2.amazonaws.com/aaronparecki.com/twitter.com/9055c458a67762637c0071006b16c78f25cb610b224dbc98f48961d772faff4d.jpeg",
  227. "url": "https://twitter.com/spladow"
  228. },
  229. "url": "https://twitter.com/aaronpk/status/688518372170977280#favorited-by-16467582"
  230. }
  231. ],
  232. "comment": [
  233. {
  234. "type": "cite",
  235. "author": {
  236. "type": "card",
  237. "name": "Poetica",
  238. "photo": "https://s3-us-west-2.amazonaws.com/aaronparecki.com/twitter.com/192664bb706b2998ed42a50a860490b6aa1bb4926b458ba293b4578af599aa6f.png",
  239. "url": "http://poetica.com/"
  240. },
  241. "url": "https://twitter.com/poetica/status/689045331426803712",
  242. "published": "2016-01-18T03:23:03-08:00",
  243. "content": {
  244. "text": "@aaronpk @mozillapersona thanks very much! :)"
  245. }
  246. }
  247. ]
  248. }
  249. }
  250. ```
  251. #### Other Properties
  252. Other properties are returned in the response at the same level as the `data` property.
  253. * `url` - The effective URL that the document was retrieved from. This will be the final URL after following any redirects.
  254. * `code` - The HTTP response code returned by the URL. Typically this will be 200, but if the URL returned an alternate HTTP code that also included an h-entry (such as a 410 deleted notice with a stub h-entry), you can use this to find out that the original URL was actually deleted.
  255. ## Rels
  256. There is also an API method to parse and return all rel values on the page, including HTTP `Link` headers and HTML rel values.
  257. ```
  258. GET /rels?url=https://aaronparecki.com/
  259. ```
  260. ## Token API
  261. When verifying [Private Webmentions](https://indieweb.org/Private-Webmention#How_to_Receive_Private_Webmentions), you will need to exchange a code for an access token at the token endpoint specified by the source URL.
  262. XRay provides an API that will do this in one step. You can provide the source URL and code you got from the webmention, and XRay will discover the token endpoint, and then return you an access token.
  263. ```
  264. POST /token
  265. source=http://example.com/private-post
  266. &code=1234567812345678
  267. ```
  268. The response will be the response from the token endpoint, which will include an `access_token` property, and possibly an `expires_in` property.
  269. ```
  270. {
  271. "access_token": "eyJ0eXAXBlIjoI6Imh0dHB8idGFyZ2V0IjoraW0uZGV2bb-ZO6MV-DIqbUn_3LZs",
  272. "token_type": "bearer",
  273. "expires_in": 3600
  274. }
  275. ```
  276. If there was a problem fetching the access token, you will get one of the errors below in addition to the HTTP related errors returned by the parse API:
  277. * `no_token_endpoint` - Unable to find an HTTP header specifying the token endpoint.
  278. ## Installation
  279. ### From Source
  280. ```
  281. # Clone this repository
  282. git clone git@github.com:aaronpk/XRay.git
  283. cd XRay
  284. # Install dependencies
  285. composer install
  286. ```
  287. ### From Zip Archive
  288. * Download the latest release from https://github.com/aaronpk/XRay/releases
  289. * Extract to a folder on your web server
  290. ### Web Server Configuration
  291. Configure your web server to point to the `public` folder.
  292. Make sure all requests are routed to `index.php`. XRay ships with `.htaccess` files for Apache. For nginx, you'll need a rule like the following in your server config block.
  293. ```
  294. try_files $uri /index.php?$args;
  295. ```