Newscred Data

From Quantwiki
Jump to: navigation, search

Students at GT have access to a collection of news articles purchased from The articles are formatted in XML as follows:

<article> The root tag of the xml document
   <category> Describes the category of the news article.
      <dashed_name> The name of the category formatted to be read by a machine
      <name> The name of the category
   <description> A description of the article
   <title> The title of the article
      <topic_set> Begins a set of topics for the article
         <topic> Tag for a specific topic
         <name> The name of the topic
         <topic_group> The same as the name
         <dashed_name> The name of the topic formatted to be read by a machine
         <image_url> The url of the image used in the news article
         <link> Link to the topic(usually wikipedia)
         <guid> Globally unique identifier of the topic
         <description> A short description of the topic
   <created_at> When the article was created
   <author_set> The set of authors who wrote the article
      <author> The set of tags describing the individual author
         <guid> A gloablly unique identification of the article author
         <first_name> The first name of the author
         <last_name> The last name of the author
      <source> A tag with children tags that describe where articles came from
         <website> The base website that the article came from
         <name> The name of the news source(e.g. New York Times)
         <guid> The globally unique identifier of the source
         <country> The country that the news source is in
         <company_type> If the company is public or private
         <founded> When the source was founded
         <frequency> How fast the company publishes articles
         <owner> The parent company of the news source
         <media_type> Set of tags describing the type of media(blog or mainstream)
            <name> name of the media type(blog or mainstream)
            <dashed_name> The name with dashes instead of spaces
         <description> A description of the company
         <guid> An identifier for the company
         <thumbnail> A link to a thumbnail of the company's logo
      <published_at> When the article was published
      <link> A link to the article
      <guid> An identifier of the article
      <metadata> Extra data about the article
         <social metrics> A set of social metrics about the article