Archive for the ‘Data’ Category

Obviously, the elites haven’t learnt from the Panama Papers. Setting up these offshore structures take a long time, and to de-construct them take significant effort too. So what are the Paradise Papers?

They are 13.4M records acquired from the law firm and corporate services company called “Appleby”. It involves people and organisations.

Interestingly, Singapore has largely kept itself out of this leak.

Anyway, here’s a copy of the paradise papers dataset.

In 2007, a team of Google engineers needed more accurate time for servers. Time is especially useful for synchronising data, and especially transactional data. Technologies like Cassandra depend on accurate time between database servers to be able to reconstruct the order-of-events on a database. The end goal is to be sure about the “State-of-Data”.

NTP (network time protocol) is what unix servers and internet machines use to synchronise time. Due to network delays or processing delays, a computer’s time can easily get out of sync with its peers. The margin of error has been minor, and the demand for high accuracy has been crucial. But for a large computing company like Google, keeping thousands of system accurate was important for them to create “Spanner“.

Spanner is the new time keeping platform that Google has constructed using GPS and an atomic clock. Fortunately the distances we have to cover is at most the span of the earth. I am sure the folks in NASA and other space faring agencies will have to consider time differences spanning larger quantums of space.

This is all in the effort to maintain the “State-of-Data”.

 

 

Estonia is a small country bordering Russia, Latvia and Finland. It boasts of an advanced information management platform for government.

This platform is the X-Road platform which is an invisible but crucial backbone for data transactions between the various e-services databases in the public and private sectors. X-Road facilitates harmonious interoperability.

Estonia’s data stores are de-centralised meaning:

There is no single owner / controller Every government agency or business can choose the right products suitable for them Services are added one at a time, as they are ready

All Estonian services that use multiple data stores use X-Road as a central connection between these data stores. All outgoing data from X-Road is digitally signed and encrypted. All incoming data is authenticated and logged.

X-Road was a system built to facilitate multi-data store queries, but has evolved to also facilitate multi-data store writes, and transmit large datasets. It was also designed for growth and currently supports:

287 million queries (2013) Connects 170 database in Estonia Provides 2000 services in Estonia Connects 900 organisations daily Supports >50% of Estonians who use the government portal Eesti.ee

Services provided via X-Road include:

Electronic Registration of residency Updating personal data (like address, exam results, health insurance etc…) Declare taxes electronically Check driving license validity Check for registered vehicles Registering newborn children for health insurance

Estonia showcases its e-society here. To transform its society into a community of digital governance and tech-savvy individuals, children as young as 7 are taught the principles and basics of coding.

Estonians are driven, forward-thinking and entrepreneurial, and the same goes for the government. It takes only five minutes to register a company there and, according to The Economist, the country in 2013 held the world record for the number of startups per person. And it’s not quantity over quality: Many Estonian startups are now successful companies that you may recognize, such as Skype, Transferwise, Pipedrive, Cloutex, Click & Grow, GrabCAD, Erply, Fortumo, Lingvist and others.

If all this sounds enticing and you wish to become an entrepreneur there, you’re in luck; starting a business in Estonia is easy, and you can do it without packing your bags, thanks to its e-residency service, a transnational digital identity available to anyone. An e-resident can not only establish a company in Estonia through the Internet, but they can also have access to other online services that have been available to Estonians for over a decade. This includes e-banking and remote money transfers, declaring Estonian taxes online, digitally signing and verifying contracts and documents, and much more.

E-residents are issued a smart ID card, a legal equivalent to handwritten signatures and face-to-face identification in Estonia and worldwide. The cards themselves are protected by 2048-bit encryption, and the signature/ID functionality is provided by two security certificates stored on the card’s microchip.

But great innovations don’t stop there. Blockchain, the principle behind bitcoin that also secures the integrity of e-residency data, will be used to provide unparalleled safety to 1 million Estonian health records. The blockchain will be used to register any and all changes, illicit or otherwise, done to the health records, protecting their authenticity and effectively eliminating any abuse of the data therein.

There are many lessons we can learn from Estonia. To increases efficiency and maturity of services, a country needs to be willing to adapt and evolve infrastructure to the needs to the new economy. These include transparency, precise and equitable delivery of services to the community.

Talend has started leveraging Apache Spark as part of its big data integration platform. Spark leverages the speedy in-memory execution capability to accelerate data ingestion. Migrating to Apache Spark can provide performance improvements from 5 to 100 times.

Talend promises to make the migration literally as simple as the push of a button with a new refactoring option that can automatically convert data pipelines written for MapReduce to Spark. MapReduce was the previous leader in high-performance data integration. That theoretically requires no changes to the high-level workflows that a user has defined for a cluster.

New projects also benefit from the upgrade, which brings some 100 pre-implemented data ingestion and integration functions that make it possible to pull data into Spark without having to do any programming. According to Talend, the result is an up to tenfold improvement in developer productivity.

There are a number of new Talend features, with the biggest additions being “masking” or also commonly known as Tokenisation. This allows an organization to replace a sensitive file with a structurally similar placeholder that doesn’t reveal any specific details. That’s useful in scenarios where, say, an analyst at a hospital that doesn’t have permission to view patient treatment history wants to check how many medical records there are in a given dataset coming into Spark.

 

Here are some examples of correlations made that make little sense.

Cat owners drive craft beer (in USA) (r=0.70) Shorter commutes for people who work for men (r=0.71) Obese kids like listening to “Purple Rain” by Prince. (r=0.77) Watch porn if you want make more money. (r=0.54) Walmart shoppers own dogs and drink wine (r=0.76, r=0.79) Norwegian oil and Train Accidents USA Space spending and Suicides by Hanging/Stranglation/Suffocation Age of Miss America and Murders by steam/hot vapours/hot objects Eating Organic food causes Autism

See more, more, more and more ….

 

To read or process the Ashley Madison data is fairly straight forward. The dataset comes with a suite of files. These are:

am_am.dump.gz aminno_member.dump.gz aminno_member_email.dump.gz member_details.dump.gz member_login.dump.gz CreditCardTransactions.7z README

Each of these files come with a PGP signature. You can use gunzip on a Mac (or unix platform) to extract the files. 7z files will require 7-Zip software on a Windows computer.

You will need MySQL software from Oracle to load this data. MySQL Community edition is free.

Top Cities by Users for Ashley Madison

Here’s are the Top 100 cities. It’s interesting that Singapore doesn’t feature on the list at all. The city-state had banned the site in the interest of the family. It looks like the ban worked. Sydney, New York and Toronto looks like a hotbed of infidelity.

São Paulo 374542 New York 268171 Sydney 251813 Toronto 222982 Santiago 218125 Melbourne 213847 Houston 186795 Los Angeles 181918 London 179129 Chicago 162444 Rio de Janeiro 156572 Madrid 135294 Bogotá 123559 Brisbane 118857 Brooklyn 110859 Miami 109505 Calgary 107021 San Antonio 99157 Dallas 97736 Brasília 97096 San Diego 94953 Perth 88754 Las Vegas 87720 Atlanta 86897 Philadelphia 86018 Edmonton 84971 Lima 82279 Phoenix 81913 Belo Horizonte 77834 香港 77561 Austin 77432 Columbus 73377 Montreal 72304 Washington 71779 Jacksonville 70134 Denver 70043 Mississauga 69403 Curitiba 68916 Barcelona 68513 Dublin 65658 Ciudad de México 64516 Orlando 63549 San Francisco 62333 Minneapolis 61403 灣仔 60674 Portland 60672 Charlotte 59686 Ottawa 58463 Seattle 56935 Indianapolis 56741 Buenos Aires 56701 Adelaide 55490 Tampa 55321 Cleveland 55031 Vancouver 52651 Fort Lauderdale 52554 Cincinnati 52055 Springfield 51644 Arlington 51345 Salvador 51069 San Jose 51043 Fort Worth 50976 Medellín 50308 Beverly Hills 49437 Bronx 49067 Boston 47951 Pittsburgh 47815 Kansas City 47793 Louisville 47239 Winnipeg 47202 Porto Alegre 47018 Saint Louis 46547 Richmond 46546 Buffalo 46532 North York 46223 Roma 46000 Johannesburg 45831 Sacramento 45777 Rochester 45216 Columbia 44541 Tucson 43293 Central 41900 Oklahoma City 41809 Salt Lake City 41773 El Paso 40914 Milwaukee 40392 Hamilton 40096 Cali 38847 Colorado Springs 38696 New Delhi 38620 London 38561 Brampton 38446 Madison 37813 Paris 37641 Saint Paul 37412 Cape Town 37001 Fortaleza 36922 Scarborough 35952 Albuquerque 35802 תל אביב יפו 35602

Yes, we have a copy of it. No, we’re not selling it. However, we’ll be putting our data analytics and data quality glasses on to see what lies within.

Several Australian cities featured prominently on the list of AM users. Singapore, which had banned the website was highly represented.

There is nothing special about Ashley Madison’s leak except that the brand attracts alot of negative emotions. They probably stepped on the toes on a capable geek. Reality is that nothing is safe on the internet, and transparency is your only defence.

Should Ashley Madison have done more to protect their data? Yes! The next simple step of basic data encryption should have been done. But it wasn’t.

Bye bye Ashley Madision. You’re not the first, and I am sure you won’t be the last. There is no defence against cheating spouses expect character, honesty, truth and love.

 

Data is never clean. You will spend most of your time cleaning and preparing data. 95% of tasks do not require deep learning ( or other forms of machine learning). In 90% of cases generalized linear regression will do the trick. Big Data is just a tool. You should embrace Bayesian approach. No one cares how you did it. Academia and business are two different worlds. Presentation is key – be the master of Powerpoint. All models are false, but some are useful. There is no fully automated Data Science. You need to get your hands dirty.

SAP re-launches the <a href=”http://www.sap.com/solutions/technology/in-memory-computing-platform/hana/overview/index.epx”>HANA</a> (<strong>H</strong>igh-performance <strong>AN</strong>alytic <strong>A</strong>ppliance) platform in 2012 and looks to this as the “game changing” technology for BI/DW/analytics. But is it?

Driven by the corporate demand for real time analytics, the HANA platform seeks to put data into memory and dramatically improve performance. This will help address the demand for big data, predictive capabilities, and text-mining capabilities.

But doesn’t this sounds like the typical rhetoric from computing vendors that previously addressed technology issues by recommending the addition of more CPU, or RAM, or disk space. SAP HANA is delivered as a software appliance focused on the underlying infrastructure for SAP Business Objects. This <a href=”http://download.sap.com/download.epd?context=B576F8D167129B337CD171865DFF8973EBDC14E3C34A18AF1CF17ED596163658ABE46C2191175A1415B54F1837F5F0A13487B903339C6F98″>white paper</a> suggests alot of scoping is centred around hardware and infrastructure design.

HANA makes incredulous claims that traditional BI/DW folks would falter to whisper. The one that stands out is the “Combination of OLAP and OLTP” into the one database. Ouch! Feel the wrath of the stakeholders of business operations. Another claim is running analytics in “mixed operations”. Double ouch!

It’s already challenging enough to get DW/BI solutions deployed without affecting operations. BI folks have constantly advocated separate infrastructure for analytics, with the ETL window  as the firewall between systems. The same ETL window has also created delays for realtime analytics. To advocate moving the BI/DW infrastructure back into operations is going to be a challenge. Yes, it facilitates “closer to real-time”, but its going to be a challenge to make it work politically.

For other BI/DW vendors, this solution would be unfeasible, but because SAP also happens to the largest ERP application platform on the planet, they definitely have a good shot at consolidating their ERP and HANA’s BI analytics. Google, Facebook and the large online behemoths already do it. So why not?!

This is indeed exciting, and its definitely time to take a closer look at SAP HANA.

&nbsp;

&nbsp;

If you thought “Big Data” was already quite unmanageable, IEEE predicts a 1500% (x15) growth in data by 2015. That is 3 years from now.

On a similar scale, IEEE also suggests that terabit networks should be implemented soon to cater for demand in network traffic by 2015. This is up by x40-1000 times from today’s gigabit networks.

This probably also suggests that demand for data processing and delivery will need to increase by a similar scale. To some 10-40 times.

What products and skills will power the delivery of services for “Humungous Data”?

New Data systems – like GFS, BigTables, Hadoop, Hive, MapReduce New Data patterns – No-SQL Cloud computing – A must for elastic computing vs BYO data centres Open data systems skills – unless you plan to pay for expensive database licenses. Web Services – to tie it all together Agile Architecture – often under-rated, but is increasingly important to focus corporate development. Agile Security – also under-rated, but is increasingly important.

With corporations already struggling to manage data growth and demand, will this mean a growth of x15 in data staffing, or will a data specialist have to be x15 times more productive. I believe its a combination of both. New tools will make the data professional more effective. At the same time because of the lack of training and skills transfer, there will always be a need for the human bridge.

 

 

The future is indeed exciting.