Praktische uitdagingen - Haalbaarheid toepassingen

5. Haalbaarheid toepassingen

5.2 Praktische uitdagingen

De totale haalbaarheid om elk van deze projecten te realiseren is afhankelijk van (1) de

beschikbaarheid van databronnen, (2) de beschikbaarheid van infrastructuur en (3) de verwachte ontwikkelkosten. Daarbij moet worden opgemerkt dat onder infrastructuur ook de software wordt gerekend die nodig is om, bijvoorbeeld, de verschillende systemen aan elkaar te knopen. We zullen nu voor elk van deze benodigdheden uiteen zetten waar de grootste risico’s en knelpunten zitten.

Voor elk van deze projecten zijn de benodigde databronnen beschikbaar bij Coosto. Voor de jihadmonitor hebben we, behalve de berichten, ook de links tussen profielen nodig. Voor Twitter kunnen we gebruik maken van mentions en voor andere platformen kunnen we gebruik maken van discussies. Beide typen data zitten in het databestand van Coosto. Voor het meten van opinies hebben we, in beginsel, genoeg aan sociale media berichten. Ook voor de tijdsreeks

Bronnen Gelabeld Infrastructuur

Opninion mining Aanwezig Niet

Aanwezig Aanwezig Tijdsreeks analyse Aanwezig Niet Nodig Gedeeltelijk aanwezig Netwerk analyse Aanwezig Niet Aanwezig Gedeeltelijk aanwezig Onderzoeks-kosten Ontwikkelkosten (backend) Ontwikkelkosten (frontend) ^Totaal Opninion mining Aanwezig Niet Aanwezig Aanwezig 8 Tijdsreeks analyse Aanwezig Niet Nodig Gedeeltelijk aanwezig 7 Netwerk analyse Aanwezig Niet Aanwezig Gedeeltelijk aanwezig 7 Tabel 6

Tabel 7 | De genoemde kosten zijn in maanden.

Gelabelde data gebruiken we om onze methodes in te leren en om ze te evalueren. Voor zowel opinion mining als het meten in 150 verschillende talen zullen we gelabelde data nodig hebben. Aangezien deze nog niet aanwezig is zullen we deze moeten genereren, hetgeen een arbeidsintensief proces is.

De benodigde infrastructuur is deels aanwezig. Voor de taken als opinion mining en het vinden van jihadgangers op grond van (de inhoud van) berichten kunnen we de bestaande infrastructuur gebruiken. Het extraheren van sociale netwerken wordt ook al deels gedaan, maar moet uitgebreid worden om de analysetechnieken zoals we die besproken hebben te kunnen implementeren. Het analyseren van tijdsreeksen zal hoogstwaarschijnlijk enige aanpassingen vereisen in de infrastructuur. Om tijdsreeksen doorzoekbaar te maken, moeten ze indexeerbaar gemaakt worden. Alhoewel we de data wel hebben, doen we dat nog niet voor tijdsreeksen. De ontwikkelkosten bestaan uit de kosten nodig voor de interface en de kosten nodig voor het ontwikkelen van de modellen en algoritmes. Voor alle drie de toepassingen geldt dat er in ieder geval een verandering moet plaats vinden in de interface. Met name bij de tijdsreeksanalyse is dat het geval. De ontwikkeltijd voor de modellen zal voor de tijdsreeksanalyse relatief laag zijn en voor de andere twee toepassingen relatief hoog.

Tabel 9 | De genoemde kosten zijn in maanden. Onderzoeks-kosten Ontwikkelkosten (backend) Ontwikkelkosten (frontend) ^Totaal Jihadmonitor 7 6 1 14 Trend analyse 2 4 1 7 Veiligheids-monitor 4 3 1 8 “150 talen” 9 3 1 10 Delict-herkenning 4 3 1 8

In tabel 6 en 7 worden de verwachte knelpunten en kosten van het ontwikkelen van de besproken technieken weergegeven. In tabel 8 en 9 hebben we hetzelfde gedaan, maar dan voor de toepassingen. Daarbij geldt dat aan de kosten van de toepassingen de kosten van de te ontwikkelen technieken ten grondslag liggen (bijvoorbeeld voor de jihadmonitor gaan we uit van het ontwikkelen van een netwerk analyse en opinion mining techniek). Voor de “150 talen” toepassing hebben een onzekerheidsfactor toegevoegd van 3 maanden aangezien deze toepassing vereist in zeer veel talen te worden uitgevoerd (hetgeen niet a priori voor de andere toepassingen geldt).

De kosten voor hardware wordt niet genoemd in dit rapport. We kunnen wel aangeven dat voor zowel netwerkanalyses en tijdsreeksanalyses zeker extra hardware nodig is. Voor opinion mining

Bronnen Gelabeld Infrastructuur

Opinion mining Aanwezig Niet aanwezig Gedeeltelijk

aanwezig

Tijdsreeks analyse Aanwezig Niet nodig Gedeeltelijk

aanwezig

Netwerk analyse Aanwezig Niet aanwezig Aanwezig

“150 talen” Aanwezig Niet aanwezig Aanwezig

Delictherkenning Aanwezig Niet aanwezig Aanwezig

Bibliografie

Abdelhaq, H., Sengstock, C., & Gertz, M. (2013). EvenTweet: online localized event

detection from twitter. Proc. VLDB Endow, 6(12), 1326-1329.

Agrawal, R., Faloutsos, C., & Swami, A. (1993). Efficient similarity search in

sequence databases. Proc. of the 4th Int’l Conf. on Foundations of Data Organization and Algorithms, 69-84.

Arun, K. (sd). Opgeroepen op maart 2015, van github: https://github.com/twitter/ BreakoutDetection

Asur, S. H. (2010). Predicting the Future

with Social Media. Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference on (pp. 492 - 499). Toronto: IEEE.

Becker, H., Naaman, M., & Gravano, L. (2011). Beyond Trending Topics:

Real-World Event Identification on Twitter. ICWSM.

Benhardus, J., & Kalita, J. (2013, Januari).

Streaming Trend Detection Twitter. International Journal of Web Based Communities, 122-139.

Benschop, A. (sd). Opgehaald van http://

www.sociosite.org: http://www.sociosite.org/

jihad_int.php#isis_socialemedia

Bermingham, A., Conway, M., McInerney, L., O’Hare, N., & Smeaton, A. F. (2009).

Combining social network analysis and sentiment analysis to explore the potential for online radicalisation. Social Network Analysis and Mining, 2009. ASONAM’09. International Conference on Advances in, 231-236.

Berndt, D. J., & Clifford, J. (1996). Finding

patterns in time series: a dynamic programming approach. Advances in Knowledge Discovery and Data Mining, 229-248.

Berzinji, A., Kaati, L., & Rezine, A. (2012, Augustus). Detecting Key Players in

Terrorist Networks. Intelligence and Security Informatics Conference (EISIC), 297-302.

Bhattacharya, S., Tran, H., & Srinivasan, P. (2012). Discovering Health Beliefs

in Twitter. AAAI Fall Symposium: Information Retrieval and Knowledge Discovery in Biomedical Text.

Brin, S. a. (1998). The anatomy of a

large-scale hypertextual Web search engine. Computer networks and ISDN systems 30.1, 107-117.

Burt, R. S. (2009). Structural holes: The

social structure of competition. Harvard University Press.

Chalothorn, T., & Ellman, J. (2012). Using

SentiWordNet and Sentiment Analysis for Detecting Radical Content on Web Forums. 6th Conference on Software, Knowledge, Information Management and Applications. Chengdu University.

Chen, L., & Ng, R. (2004). On the

marriage of Lp-norm and edit distance. Proc. 30th Int’l Conf. on Very Large Data Bases, 792–801.

Cheng, T., & Wicks, T. (2014). Event

Detection using Twitter: A Spatio-Temporal Approach. PLoS ONE, 9(6).

Das, G., Gunopulos, D., & Mannila, H. (1997). Finding similar time series. Proc.

of 1st European Symposium on Principles of Data Mining and Knowledge Discovery, 88-100.

David, M. (1978). The Complexity of

Some Problems on Subsequences and Supersequences. J. ACM (ACM Press), 25(2), 322–336.

DeRose, S. J. (1988). Grammatical

Category Disambiguation by Statistical Optimization. Computational Linguistics, 31-39.

Dietz, R. (2013, augustus 20). http:// www.emerce.nl/achtergrond/crisis-bewijst-betrouwbaarheid-social-media-data.

Opgehaald van emerce: http://www.emerce. nl/achtergrond/crisis-bewijst-betrouwbaarheid-social-media-data

Domingos, P. a. (2001). Mining the

network value of customers. Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. ACM.

Duivestein, S., & Bloem, J. (2012).

Opgehaald van frankwatching: http://www. frankwatching.com/archive/2012/09/25/project-x-haren-wat-ging-er-mis/

Evans, D. K. (2005). Identifying similarity

in text: multi-lingual analysis for summarization. PhD Thesis. Columbia University.

Gilbert, E. a. (2009). Predicting tie

strength with social media. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM.

Girvan, M. a. (2002). Community structure

in social and biological networks. . Proceedings of the National Academy of Sciences, (pp. 7821-7826).

Gleich, D., & Seshadhri, S. (2011).

Neighborhoods are good communities. CoRR.

Granovetter, M. S. (1973). The Strength of

Weak Ties. American Journal of Sociology, 1360-1380.

Hamers, H., Husslage, B., & Lindelauf, R. (2011). Centraliteitsanalyses van

terroristische netwerken. Tilburg: Tilburg University.

He, Q., Chang, K., & Lim, E. P. (2007).

Analyzing feature trajectories for event detection. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, (pp. 207– 214). New York, NY, USA.

Kahanda, I. a. (2009). Using Transactional

Information to Predict Link Strength in Online Social Networks. ICWSM.

Lee, S., Chun, S., Kim, D., Lee, J., & Chung, C. (2000). Similarity search for

multidimensional data sequences. Proc. of the 16th Int’l Conf. on Data Engineering, 599-608.

Lin, J., Keogh, E., Lonardi, S., & Chu, B. (2003). A symbolic representation of time

series, with implications for streaming algorithms. In. Proc. of workshop on Research issues in data mining and knowledge discovery in conjunction with ACM SIGMOD Int’l Conf. on Management of Data, 2-11.

McPherson, M. S.-L. (2001). Birds of a

Feather: Homophily in Social Networks. Annual Review of Sociology, 27:415–444.

Meng, X. W. (2012). Cross-lingual mixture

model for sentiment classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, (pp. 572-581).

Mitchell, T. M. (1997). Machine Learning.

New York: McGraw-Hill.

Osborne, M., Petrovic, S., McCreadie, R., & Macdonald, C. (2012). Bieber no more:

First story detection using Twitter and Wikipedia. Proceedings of the Workshop on Time-aware.

Qi, G.-J. C. (2012). Community detection

with edge content in social media networks. . IEEE, 2012. IEEE 28th International Conference on Data Engineering (ICDE). IEEE.

REITMAN, J. (2015, Maart 25). The

Children of ISIS. Opgehaald van www. rollingstone.com: http://www.rollingstone. com/culture/features/teenage-jihad-inside- the-world-of-american-kids-seduced-by-isis-20150325

Ritter, A. a. (2011). Named Entity

Recognition in Tweets: An Experimental Study. Conference on Empirical Methods in Natural Language Processing (pp. 1524-1534). Edinburgh, United

Kingdom: Association for Computational Linguistics.

Sakaki, T., Okazaki, M., & Matsuo, Y. (2010). Earthquake shakes Twitter users:

real-time event detection by social sensors. WWW, 851-860.

Scott, J., & Carrington, P. J. (2011). The

SAGE Handbook of Social Network Analysis. SAGE Publications.

Singh, A. R. (2013). Named Entity

Recognition: A Review. International Journal of Computer Science and

Communication Engineering IJCSCE Special issue on “Emerging Trends in Engineering & Management” ICETE.

Spielman, D. a. (1996). Spectral

partitioning works: Planar graphs and finite element meshes. Proceedings of Foundations of Computer Science. IEEE.

Valkanas, G., & Gunopulos, D. (2013).

How the live web feels about events. Proceedings CIKM.

Vanderkam, D., Schonberger, R., & Rowley, H. (2013). , “Nearest Neighbor

Search in Google Correlate”, http://www.

google.com/trends/correlate/nnsearch.pdf.

Google.

Vintsyuk, T. (1968). Speech discrimination

by dynamic programming. Kibernetika, 4, 81–88.

Vlachos, M., Hadjieleftheriou, M., Gunopulos, D., & Keogh, E. (2003).

Indexing multi-dimensional time-series with support for multiple distance measures. . Proc. ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, 216-225.

Wang, X., Zhu, F., Jiang, J., & Li, S. (2013).

Real Time Event Detection in Twitter , 2013-01-01 , p502-513. Web-Age Information Management, 7923, 502-513.

Wellman, B. a. (1990). Different strokes

from different folks: Community ties and social support. American journal of Sociology , 558-588.

Weng, J., Yao, Y., Leonardi, E., & Lee, F. (2011). Event detection in Twitter. HP

Whang, J. J. (2013). Overlapping

community detection using seed set expansion. Proceedings of the 22nd ACM international conference on Conference on information & knowledge management. . ACM.

Xiang, R. J. (sd). Modeling relationship

strength in online social networks. Proceedings of the 19th international conference on World wide web. 2010: ACM.

Xiangmin Zhou, L. C. (2014). Event

detection over twitter social media streams. The VLDB Journal, 381-400.

Xinfan Meng, F. W. (2012). Lost in

Translations? Building Sentiment Lexicons using Context Based Machine translation. Proceedings of COLING 2012.

Yang, C., & Srinivasan, P. (2014).

Translating surveys to surveillance on social media: methodological challenges & solutions. Proceedings of the 2014 ACM conference on Web science.

Yang, J. J. (2013). Community detection in

networks with node attributes. IEEE 13th International Conference on Data Mining (ICDM).

Zhu, Y., & Shasha, D. (2003). Warping

indexes with envelope transforms for query by humming. Proc. ACM SIGMOD Int’l Conf. on Management of Data, 181–192.

Colofon

Concept & realisatie

In document Onderzoek Toepassing Social Media Data-Analytics voor het Ministerie van Veiligheid en Justitie (pagina 52-59)