Coolthing Of Theday

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Friday, 11 October 2013

Some TweetSharp, Accord.Net and the author's code = Machine Learning to detect hacked tweets...

Posted on 16:21 by Unknown

Primary Objects - Detecting a Hacked Tweet with Machine Learning

Introduction

This article is part of a presentation for The Associated Press, 2013 Technology Summit.

On April 23, 2013 the stock market experienced one of its biggest flash-crash drops of the year, with the Dow Jones industrial average falling 143 points (over 1%) in a matter of minutes. Unlike the 2012 stock market blip, this one wasn't caused by an individual trade, but rather by a single tweet from the Associated Press (AP) account on the social network, Twitter. The tweet, of course, wasn't written by AP, but rather by an imposter who had temporarily gained control of the account. Considering the impact of real-time messaging services, such as Twitter, could it be possible to detect the tweet as hacked?

In this article, we'll discuss how to use machine learning and so-called "big data" analysis to mine large amounts of information and classify meaningful relationships from them. In particular, we'll walk through a prototype machine learning example that attempts to classify tweets as having been authored by AP or not. We'll examine learning curves to see how they help validate machine learning algorithms and models. As a final test, we'll run the program on the hacked tweet and see if it's able to successfully classify the tweet as being authentic or hacked.

...

Why, Hello There, Twitter

The foremost important part of a machine learning solution is the amount and quality of data to base learning upon. To classify AP's tweets by authorship, we'll need to extract tweets from AP's Twitter account history to serve as the positive cases. We'll also need a collection of non-AP tweets to serve as the negative cases.

To aid in the collection of tweets, the C# .NET library TweetSharp was used. Queries were initially prepared to extract AP tweets, using the search term "from:AP", and later refined to include date ranges.

...

The results from TweetSharp are then saved to a CSV format file, using the C# .NET library CsvHelper.

While TweetSharp worked quite well for extracting a limited history of tweets, the API is apparently limited by how far back in time tweets may be extracted from. This would leave us with about 1,100 data examples to train on. For a more optimal scenario, we could use a lot more data. Note, initial trainings on this minimal data-set actually achieved 94% accuracy, although the learning charts indicated a higher accuracy could be achieved with more data.

...

Digitizing Tweets

To allow the machine learning algorithm to process the tweets, each tweet will need to be converted into a numerical format. There are a couple of different methods for doing this, such as TF*IDF, but the optimal method appeared to be word indexing.

First, the collection of tweets was separated into two portions: the training set, and the cross validation (CV) set. The training set would be used for all learning-based examples, while the CV set would be used for calculating accuracy scores.

A vocabulary was built off of the training set by tokenizing the text of the tweets and then using the porter-stemmer algorithm (Centivus.EnglishStemmer.dll) to obtain the collection of base distinct words.

We then digitize each tweet in the training set to an array of ints, corresponding to the word existing in the vocabulary. For each tweet, we check each word in the vocabulary and see if it exists in the current tweet. ...

...

Proof That We're Learning Something

Learning curves are an excellent way for telling if a machine learning algorithm is actually learning. By plotting the accuracy against the number of training set items, it becomes apparent whether the algorithm is learning as data examples grow, and if adding more data will actually help or hinder accuracy.

For machine learning algorithms in C# .NET, the Accord .NET library was used.

...

Results?

The best algorithm was trained on 6,054 tweets. Roughly half were authored by AP, and the rest were authored by other users.

image

The program achieved a final accuracy of 100% Training, 97.38% CV, 96.23% Test. Judging by the learning curve, it looks like there is still some room to go even further, by providing more training examples.
Here is a view of the resulting program running on real live data (test set). The program never saw these tweets before in its whole life. Honest!

..."

There's some great code and ideas here, but best of all the author makes the concepts understandable and relevant...

Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest
Posted in C#, Development, EDD | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • Mr. 7,000! This is my 7,000th post...
    Before this post; After; 20 visits between taking these snaps? Oh wait, that's probably me searching for past related posts....
  • Rad Gate Post... Get your Red Gate Post here...
    simple talk - Melanie Townsend - Get a copy of the Red Gate Post We recently put together a newspaper of some of the best articles fr...
  • "Windows Server Essentials Media Pack" (DNLA Stream, HTML5 and Dashboard Media stuff)
    Microsoft Downloads - Windows Server Essentials Media Pack This pack enables the media streaming functionality for Windows Server 2012...
  • Windows Management Framework 4.0 (PowerShell 4, PowerShell ISE, Management OData, WMI, etc.) now available
    Keith Hill's Blog - PowerShell 4.0 Now Available You can get PowerShell 4.0 for down level operating systems now via the WMF 4.0 d...
  • Learning Log Parser Studio in two parts... (From Install to Library Ninja)
    Kary Wall - Getting Started with Log Parser Studio - Part 1 & Getting Started with Log Parser Studio - Part 2 Hopefully, if you a...
  • Viasfora - Your new favorite Visual Studio Text/*ML Editing Extension?
    Winterdom - Introducing Viasfora A couple of days ago, I unveiled Viasfora , my latest attempt at building a decently packaged extensi...
  • "Windows Server [2012 R2]: The Best Infrastructure to Run Linux Workloads"
    In the Cloud - What’s New in 2012 R2: Enabling Open Source Software Part 4 of a 9-part series . ... There are a lot of great s...
  • [Hardware Review] It's been a Haswell Summer... Haswell/Harris Beach Intel SDS Ultrabook Review - Part 3
    Are you all tired of it yet? Have you heard it often enough already? Well too bad! Haswell is Battery Love! MUHAHAhahahahaha.... There! ...
  • [Hardware Review] Life with Haswell... Haswell/Harris Beach Intel SDS Ultrabook Review - Part 2
    "So Greg, how's life with Haswell been?" "Pretty Sweet! (Mostly)" I've been given an opportunity to review t...
  • Want the world's best Science Fiction And Fantasy Library? Come to LA... The Eaton Collection @ UC Riverside
    CBS Los Angeles - The Eaton Collection: The Best Science Fiction And Fantasy Library Science fiction and fantasy fans rejoice and take...

Categories

  • .Net
  • 3DPrinting
  • AFeedYouShouldRead
  • Agile
  • ALM
  • Amazon
  • Amiga
  • Analytics
  • Android
  • ASP.NET
  • Azure
  • BigData
  • bing
  • Blogging
  • Book
  • BookReview
  • BUILD
  • C
  • C#
  • C++
  • Career
  • Cat
  • cheatsheet
  • ClickOnce
  • Cloud
  • ComputerHardware
  • css
  • Data
  • DBA
  • DependencyInjection
  • Deployment
  • Design
  • Development
  • devops
  • DVCS
  • ebook
  • EDD
  • Education
  • EnterpriseLibrary
  • EntityFramework
  • Exchange
  • Expression
  • gadget
  • Game
  • GIT
  • Google
  • Government
  • Hadoop
  • hardware
  • HardwareReview
  • HaswellReview
  • HTML5
  • Humor
  • IE
  • IEExtension
  • IfAllElseFails
  • IIS
  • ILMerge
  • Image
  • Infographic
  • interview
  • InversionOfControl
  • Java
  • Javascript
  • Kinect
  • LightSwitch
  • LINQ
  • Linux
  • LosAngeles
  • Lucene
  • Lync
  • MEF
  • Metro
  • MicrosoftOffice
  • MicrosoftOutlook
  • Mono
  • MVC
  • MVVM
  • NetMon
  • NLP
  • NoSQL
  • NuGet
  • OData
  • OneNote
  • OpenXML
  • Paint.Net
  • Personal
  • Photosynth
  • Physics
  • portable
  • Poster
  • PowerShell
  • Preparedness
  • Presentation
  • Prism
  • PrivateCloud
  • RegEx
  • RemoteDesktop
  • Reporting
  • RIAServices
  • Science
  • ScienceFiction
  • Scratch
  • Scrum
  • ServiceBus
  • SharePoint
  • Silverlight
  • SimiValley
  • SPA
  • Space
  • SQLServer
  • Storyboard
  • Surface
  • SVG
  • SystemAdministration
  • T4
  • TeamBuild
  • TeamFoundationServer
  • TechEd
  • Training
  • TypeScript
  • UnitTesting
  • UnityApplicationBlock
  • Utility
  • Veteran
  • VirtualMachine
  • Visio
  • VisualBasic
  • VisualStudio
  • WCF
  • Web X.X
  • Webcast
  • WebFeed
  • WebMatrix
  • Windows
  • Windows7
  • Windows8
  • Windows8.1
  • WindowsHomeServer
  • WindowsLiveWriter
  • WindowsPhone
  • WindowsServer
  • WinRT
  • WiX
  • WMI
  • WOPI
  • WPF
  • XAML
  • XBox360
  • XboxOne
  • zombie

Blog Archive

  • ▼  2013 (500)
    • ►  December (12)
    • ►  November (61)
    • ▼  October (65)
      • WindowSMART (The HD/SSD health monitoring, reporti...
      • 11 for 12... 11 Free SQL Server 2012 Microsoft Vir...
      • TechBooks, your Windows 8.1 window to discovering,...
      • "Halloween Costumes for Programmers" Comic
      • “Ship it, Maybe” - Yammer's software shipping parody
      • Mr. 7,000! This is my 7,000th post...
      • Welcome OctoGit... I mean, Octokit.Net for GitHub
      • Mastering MDS with the Master Data Services Operat...
      • Windows Azure Guidance - Cloud Design Patterns Alp...
      • "We're from the Government and we're to help with ...
      • Zombie Post of the Day #2: Zombie Combat Battalion...
      • Zombie Post of the Day #1: "Cloudy with a chance o...
      • Hotfix released to remove 'SecureBoot isn't config...
      • Clide, your guide to Visual Studio Extensibility s...
      • Toward Metadata Mastery with the Windows API Code ...
      • "Theory and Applications for Advanced Text Mining"...
      • Seven for SQL... Seven Free eBooks from Pinal Dave...
      • Surface RT Battery draining faster than you're use...
      • Internet Archive's Historical Software Archive = R...
      • Jesse's got your number (of Insanely Essential Pro...
      • Windows Management Framework 4.0 (PowerShell 4, Po...
      • Wriju's TFS 2013 Book and Video Link Round-up
      • XPlatformCloudKit - Your Cross Windows Phone, Wind...
      • Jason's Spa (err... I mean, Jason Haley's new SPA,...
      • Doughnuts! (Well, Infragistics XAML Doughnut Chart...
      • Doughnuts! (Well, Infragistics XAML Doughnut Chart...
      • Only for the cool cats (SQL Cat's are cool be defi...
      • Free eBook of the Day: "45 Database Performance Ti...
      • Page File = RAM x 1.5? Not so fast if you're x64...
      • AsmSpy [Assembly Spy] (Think "commandDepends for ...
      • "Introducing Windows 8.1 for IT Professionals: Tec...
      • patterns & practices: Data Access Guidance code dr...
      • "Windows Server Essentials Media Pack" (DNLA Strea...
      • You can only turn "SecureBoot isn't configured cor...
      • sp_AskBrent - Your new, "OMG, my SQL Server is soo...
      • Shining the light on 30 Code Samples, 9 Technologi...
      • bing up your app with the new Bing Speech Recognit...
      • Magic Method to Move from Windows 8.1 Preview to W...
      • Visual Studio 2013, Team Foundation Server 2013, ....
      • Windows 8.1 Now Available...(For Everyone)
      • [Hardware Review] It's been a Haswell Summer... Ha...
      • Revisiting Sando - Full Text Index and Source your...
      • IntelliCommand, the key to learning Visual Studio ...
      • PIE! (charts) - Log Parser and the Office Web Comp...
      • Besides tearing your hair out, how you debug why y...
      • Some TweetSharp, Accord.Net and the author's code ...
      • Habitat for Humanity SF/SCV, USO Greater Los Angel...
      • //? = Taking "Google Coding" to the next level? Fl...
      • This is a railroading you'll actually look forward...
      • How do the Microsoft Office Servers Integrate? Her...
      • Making SQL Server a happy kCura Relativity camper ...
      • Grant's TFS Grooming Guide (Think "How to keep you...
      • OpenGov.com, where your Local Government can get n...
      • Comparing Sentiment Analysis REST API's
      • XAML Spy v2 Beta Visual Studio now available... (t...
      • DebugDiag v2 is now out...
      • The Surface surface... Three Surface sites you sho...
      • VMware or Microsoft? 35 posts, six weeks, two prod...
      • No need to say no to NoSql - "Data Access for High...
      • Can you Kinect me now... Using the Kinect for Wind...
      • A Decade+ of Start Pages - Visual Studio Start Pag...
      • [Insert really lame Mime joke here] MimeKit v0.5 (...
      • Lets Get Physical [JavaScript] - PhysicsJS
      • WAMVA - Windows Azure Microsoft Virtual Academy co...
      • This is IT for Azure... "Introducing Windows Azure...
    • ►  September (38)
    • ►  August (47)
    • ►  July (75)
    • ►  June (39)
    • ►  May (40)
    • ►  April (42)
    • ►  March (39)
    • ►  February (42)
Powered by Blogger.

About Me

Unknown
View my complete profile